- Task 1 - Detection
- Task 2 - Detection-Linking
- Task 3 - Detection-Recognition
- Task 4 - Detection-Recognition-Linking
method: LIGHT + PaLeTTe2025-06-12
Authors: Yijun Lin, Rhett Olson, Junhan Wu, Yao-Yi Chiang, and Jerod Weinman
Affiliation: University of Minnesota, Grinnell College
Description: LIGHT integrates Linguistic, Image and Geometric features for linking Historical map Text via a geometry-aware embedding module that encodes the polygonal coordinates of text regions to capture polygon shapes and their relative spatial positions on an image. LIGHT then unifies this geometric information with the visual and linguistic token embeddings from LayoutLMv3, a pretrained layout analysis model. LIGHT uses the cross-modal information to predict the reading-order successor of each text instance directly with a bi-directional learning strategy to enhance sequence robustness.
The text detection and recognition step is performed by PaLeTTe.
Yijun Lin, Rhett Olson, Junhan Wu, Yao-Yi Chiang, and Jerod Weinman. LIGHT: Multi-Modal Text Linking on Historical Maps. ICDAR 2025. (In press.)
method: CREPE + BezierCurve2025-04-13
Authors: Youngmin Baek, Michael Hentschel, Yu Nakagome, Shuta Ichimura, Jeong Tae Lee, Chankyu Choi
Affiliation: NAVER/LINE WORKS
Description: We used CREPE method, an end-to-end KIE model, to perform text detection, recognition, and linking tasks without any post-processing. While the original paper regresses four quadrilateral coordinates, we modified the approach to predict eight control points based on Bezier curves. For pretraining, we use ArT dataset to learn effective representations of curved text, and during fine-tuning, we train exclusively on the Rumsey dataset.
method: Self-Sequencer2025-03-28
Authors: Mengjie Zou, Tianhao Dai, Remi Petitpierre, Beatrice Vaienti, Frederic Kaplan, Isabella di Lenardo
Affiliation: EPFL, Swiss Federal Institute of Technology in Lausanne
Email: remi.petitpierre@epfl.ch
Description: For word detection and recognition, our approach relies on DeepSolo, whose architecture is derived from Detection Transformers (DETR). In short, DeepSolo extracts hierarchical visual features from map images and processes them through an encoder-decoder architecture to detect words as segments bounded by Bézier curves. The model specifically returns four control points of central Bézier curves per word and then uniformly samples query points along these curves to segment, classify, and delineate each text instance precisely. To resolve duplicate word detections, we implement a postprocessing step inspired by Non-Maximum Suppression. It involves calculating the Fréchet distance between the Bézier curves of potential duplicate word pairs, or "directional synonyms", and merging those below a defined threshold.
Our text linking methodology consists of four steps: (1) neighbor sampling, (2) self-sequencing, (3) graph assignment, and (4) ordering. In the first step, a word segment is designated as the query, and neighbor segments are used as candidates. For the second step, we introduce Self-Sequencer, a trainable Transformer-based model that iteratively returns ordered local sequences based on the input query segment, and candidate neighbors. Each input text segment is represented only by the control points of its bounding Bézier curves. A Transformer Encoder generates a deep representation from spatial-directional features. Then, an Attentive Pointer module predicts local word sequences based on the concatenated hidden states of candidate word pairs. The aim of the third step, which we call graph assignement, is to aggregate the local link predictions. In this perspective, the links predicted by the Self-Sequencer are used to create a global directional graph. Each strongly connected component of the global graph is considered a distinct linked word set. In the fourth and last step, the order of the words in the sequence is retrieved by applying a consensus ranking algorithm, based on the local sequence order predicted by the Self-Sequencer. More details on the model, algorithms, and specific implementation are provided in our separate article [1].
The training leverages several real and synthetical datasets: ICDAR MapText [2], MapKuratorHuman [3], SynthMap [3], and Paris and Jerusalem Maps Text Dataset [4].
References:
[1] Zou, M., Dai, T., Petitpierre, R., Vaienti, B., Kaplan, F., & di Lenardo I. (2025). Recognizing and Sequencing Multi-word Texts in Maps Using an Attentive Pointer.
[2] Lin, Y., Li, Z., Chiang Y.Y., & Weinman J. (2024). Rumsey Train and Validation Data for ICDAR'24 MapText Competition (Version 1.3). Zenodo. https://doi.org/10.5281/zenodo.11516933
[3] Kim, J., Li, Z., Lin Y., Namgung, M., Jang, L., & Chiang Y.Y. (2023) The mapKurator System: A Complete Pipeline for Extracting and Linking Text from Historical Maps. In: Proceedings of the 31st ACM International Conference on Advances in Geographic Information Systems. https://arxiv.org/abs/2306.17059
[4] Dai, T., Johnson, K., Petitpierre, R., Vaienti, B., & di Lenardo, I. (2025). Paris and Jerusalem Maps Text Dataset (Version 1.0.0). Zenodo. https://doi.org/10.5281/zenodo.14982662
Overall | Words | Links | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | Method | H-Mean | Precision | Recall | Tightness | Char Accuracy | Char Quality | Det Quality | F-Score | Precision | Recall | F-Score | |||
2025-06-12 | LIGHT + PaLeTTe | 83.47% | 88.94% | 92.17% | 85.79% | 94.64% | 73.50% | 77.66% | 90.53% | 77.12% | 68.51% | 72.56% | |||
2025-04-13 | CREPE + BezierCurve | 80.80% | 87.10% | 86.53% | 73.62% | 95.47% | 61.02% | 63.91% | 86.81% | 75.71% | 71.68% | 73.64% | |||
2025-03-28 | Self-Sequencer | 80.76% | 91.52% | 89.13% | 86.14% | 94.86% | 73.79% | 77.79% | 90.31% | 72.59% | 61.66% | 66.68% | |||
2025-04-29 | Baseline TESTR Finetuned + Heuristic MST | 59.34% | 89.14% | 90.04% | 86.28% | 92.92% | 71.82% | 77.30% | 89.59% | 27.14% | 51.08% | 35.44% | |||
2025-01-10 | [Baseline MapText'24] DS-LP | 46.23% | 71.76% | 78.93% | 71.63% | 90.83% | 48.90% | 53.84% | 75.17% | 55.28% | 16.63% | 25.57% | |||
2025-04-19 | YOLOv8_ViTAE_PolygonDetector | 0.03% | 61.75% | 9.64% | 75.97% | 77.96% | 9.88% | 12.67% | 16.68% | 2.60% | 0.00% | 0.01% | |||
2025-03-31 | MapText Strong Pipeline | 0.00% | 95.88% | 91.84% | 83.75% | 94.04% | 73.89% | 78.57% | 93.82% | 0.00% | 0.00% | 0.00% | |||
2025-03-31 | MapText Strong Pipeline | 0.00% | 95.88% | 91.84% | 83.75% | 94.04% | 73.89% | 78.57% | 93.82% | 0.00% | 0.00% | 0.00% |