Tasks - ICDAR 2024 Competition on Historical Map Text Detection, Recognition, and Linking
Tasks
The competition consists of multiple inter-related tasks on historical maps, including text detection and recognition at word and phrase levels. The four primary competition tasks are,
- Word detection
- Phrase detection (word linking)
- Word Detection and Recognition
- Phrase Detection and Recognition
We will evaluate each task on two datasets. One dataset consists map images from the David Rumsey Map Collection, covering a wide range of map styles, and the other dataset from French Land Registers contains maps tailored to a specific place and time. See details in the Downloads page
File Formats
All tasks share the same file formats for either ground truth and submissions, though certain fields or elements might be ignored when irrelevant for the task.
Note that all coordinates are given in pixels with respect to the image, starting at (0,0) in the top-left corner.
Ground Truth Format
The same ground truth file and format is used for all tasks: a list of dictionaries (one per image), each of which has a list of phrase groups, consisting of an ordered list of words. The JSON file (UTF-8 encoded) has the following format:
[ # Begin a list of images
{
"image": "IMAGE_NAME1",
"groups": [ # Begin a list of phrase groups for the image
[ # Begin a list of words for the phrase
{"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT1", "illegible": False, "truncated": False},
...,
{"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT2", "illegible": True, "truncated": False}
],
...
[ {"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT3", "illegible": False, "truncated": True}, ... ]
] },
{
"image": "IMAGE_NAME2",
"groups": [
[
{"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT4", "illegible": False, "truncated": False},
...,
{"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT5", "illegible": False, "truncated": False}],
...
[ {"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT6", "illegible": False, "truncated": False}, ... ]
] },
...
]
(Small sample file.) For each image, the "groups" field stores a list of groups, where each entry represents the list of words within a group, given in reading order (for task 4). The "vertices" field of each word stores a list of coordinates (number pairs) representing the vertices of a bounding polygon for the given word; there must be at least three points and no specific arrangement is required otherwise.
For detection tasks (1 and 2), the "text" field of the words can be ignored. For non-grouping tasks (1 and 3), the grouping structure can be ignored and only the lists of words are needed.
Words that are marked truncated or illegible will be ignored in the evaluation. For linking tasks (2 and 4), groups that contain any ignored word will be ignored in the evaluation.
Submission Format
All tasks use the same basic submission format (and can accept the same file): a list of dictionaries (one per image), each of which has a list of phrase groups, consisting of an ordered list of predicted words. Some fields and structures will be ignored in certain tasks when they are irrelevant. The JSON file (UTF-8 encoded) has the following format:
[ # Begin a list of images
{
"image": "IMAGE_NAME1",
"groups": [ # Begin a list of phrase groups for the image
[ # Begin a list of words for the phrase
{"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT1"},
...,
{"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT2"}
],
...
[ {"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT3}, ... ]
] },
{
"image": "IMAGE_NAME2",
"groups": [
[
{"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT4"},
...,
{"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT5"}
],
...
[ {"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT6"}, ... ]
] },
...
]
(Small sample file.) The "groups" field for each image stores a list of groups, where each entry represents the list of words within a group. The "vertices" field of each word stores a list of coordinates (number pairs) representing the vertices of a bounding polygon for the given word; there must be at least three points and no specific arrangement is required otherwise.
Note:
- The membership of words in groups is only considered during evaluation of Tasks 2 and 4.
- The "text" field is only considered during evaluation of Tasks 3 and 4.
- The word order (within a group) is only considered during evaluation of Task 4.
- The group order (within an image) is never considered.
- The "image" field must contain the relative path to the image, like {"image": "rumsey/test/000001.png" ... } or {"image": "ign/test/000001.jpg" ... }
Task 1 - Word Detection
The task requires detecting individual words on map images, i.e., generating bounding polygons that enclose text instances at the word level.
Submission Format
The optional "text" field of each word is ignored in evaluation and not required in the JSON file; this allows the same file to be submitted for every task. Grouping is not required for this task; although the groups field is required in the JSON file, the group-level organization is ignored by the evaluation. For example, each image could have one group containing all the words or each word could belong to its own group.
Evaluation Metric
Following the HierText competition, the Panoptic Detection Quality (PDQ) evaluates word detection,
where the tightness T is the average IoU among true positive regions
and F represents the F-score, the harmonic mean between precision and recall:
Note that matches in TP are determined by solving the linear sum assignment problem on the weighted bipartite graph between ground truth and predicted regions where IoU > 0.5. Predictions matched to ground truth regions that are to be ignored are not counted in the evaluation.
While PDQ will be the competition metric, the conventional evaluation metrics precision P, recall R, and F-score F are reported to get a coarse sense of relative raw detection quality, separate from the localization tightness T.
Task 2 - Phrase Detection
This task requires words to be detected (with their polygon boundaries) and grouped into constituent lists for label phrases. Words from the same group (phrase) are treated as one unit for (joint) detection.
Submission Format
As in Task 1, the optional "text" field of each word is ignored in evaluation and not required in the JSON file; this allows the same file to be submitted for every task. However, the evaluation is sensitive to group-level organization among the words. The word order is not considered during evaluation.
Evaluation Metric
For each group, the unions of word polygons in the detection predictions and ground truth are used to calculate the IoU and thence the PDQ score at the phrase level. The word order is ignored. With group-level detection quality the official competition metric, the constituent terms F-score F, precision P, recall R, and tightness T are also reported.
Task 3 - Word Detection and Recognition
In this task, participants are expected to produce word-level text detection and recognition results, e.g., generating a set of word bounding polygons and corresponding transcriptions.
Submission Format
Submissions require the "text" field for each word. Like Task 1, grouping is not required for this task; although the groups field is required in the JSON file, any group-level organization is ignored by the evaluation. For example, each image could have one group containing all the words or each word could belong to its own group.
Evaluation Metric
We introduce the Panoptic Word Quality (PWQ) metric, which is identical to PDQ, except that true positives must have matching transcriptions in addition to meeting the IoU threshold.
PWQ effectively combines localization accuracy (tightness), detection quality (polygon presence/absence), and word-level recognition accuracy into a single parameter-free measure. PWQ is the competition metric, giving strong preference for well-localized, textually accurate detections.
Word-recognition accuracy can be quite stringent, as measured by the PWQ calculated with TPRec. Therefore we also desire a metric that accounts for character-level recognition accuracy by comparing the edit distance between ground truth and predicted text. To this end, we introduce the Panoptic Character Quality (PCQ) metric:
where T×F are as in the original PDQ metric, and C represents the average complementary normalized edit distance of each word's text among the matched true positive detections TPDet.
Here NED is the normalized edit distance between the detected and ground truth text.
With its three factors T,F,C∈[0,1], the product PCQ falls into the same range. Thus, the PCQ combines localization accuracy (tightness), detection quality, and character-level recognition accuracy into a single measure without additional parameters.
Task 4 - Phrase Detection and Recognition
This task requires the detection and recognition at the phrase level. Submissions must group words (polygons and transcriptions) into phrases, an ordered list.
Submission Format
All elements of the submission file format described above are required; "text" transcriptions and groupings are all considered in the evaluation.
Evaluation Metric
As in Task 2, the unions of word polygons in the detection predictions and ground truth groups are used for calculating IoUs. The competition metric is PCQ, where the ground truth linked string and the recognized linked string will be the space-separated concatenation of the words in each group (using their list order). Note that detection-level errors in word segmentation that are otherwise grouped correctly are not as heavily penalized because the edit distance will only count the spaces when missing (an under-segmentation) or extra (an over-segmentation).
The PWQ measure and all the constituent terms are also reported for evaluation.
Evaluation Metric Summary
Task | Competition Metric | Other Metrics |
1-Word Detection | Word PDQ | Word P/R/F/T |
2-Phrase Detection | Group PDQ | Group P/R/F/T |
3-Word Detection and Recognition | Word PWQ | Word PCQ, P/R/F/T/C |
4-Phrase Detection and Recognition | Group PCQ | Group PWQ, P/R/F/T/C |
FAQs
- What training data is allowed?
- Can I use private data for the competition?
-
Yes, but only under two conditions:
In particular, competitors must take great care to exclude any labeled training data that overlaps with the competition test data set. Historical maps are often printed in multiple editions from the same engraving or scanned into a variety of digital libraries. Entries whose labeled training data is discovered to contain a test map will be disqualified.
- The use is disclosed with the submission, and
- The data is made public for the benefit of the community. To be included in the competition, link(s) to the data must be included with any submission using private data.
-
- Can I use synthetic data for the competition?
- Yes; submitters are strongly encouraged to share their data and/or synthesis method.
- Can I use labeled public data sets (i.e., ICDAR13, COCOText, ArT, etc.)?
- Yes; submitters are encouraged to indicate any training data sets used.
- Can I use publicly available unlabeled data?
- Yes.
- Can I use private data for the competition?
- Does my submission have to include both data sets (Rumsey and French) for evaluation?
- No, the two data sets will be evaluated separately. Omitting one will not influence evaluation of the other. While such "empty" results will necessarily appear in the online evaluation, they will be manually excluded from a competition report.
- What is the character set for the data?
- The Latin character set (including diacritics and digraphs), numerals, punctuation, common keyboard symbols, and select special symbols such as ™ and ®/©.
Maps from the Rumsey data set are expected to be in English, but whereas they cover world-wide historical geography, occasional diacritics (e.g., ü or ø) and digraphs (e.g., æ) are to be expected. The 19th century Cadastre maps from the French data are expected to exhibit typical qualities of that language, place, and time.
- The Latin character set (including diacritics and digraphs), numerals, punctuation, common keyboard symbols, and select special symbols such as ™ and ®/©.
- How many submissions can I make?
- For the competition, each participant may submit as many times as they like, but only the final pre-deadline submission will be included for the competition report and ranking.
Challenge News
- 06/15/2024
Competition Results Announced - 06/07/2024
Rumsey (General) Data Update 3 - 04/23/2024
Rumsey (General) Data Update 2 - 04/17/2024
IGN (French) Data Update - 04/12/2024
Final Submission Deadline Extended - 03/15/2024
Submission Deadline Extended - 03/04/2024
Test Data Available - 02/20/2024
Rumsey (General) Data Update
Important Dates
2 January 2024: Competition Announced
1 February 2024: Training and validation data released
1 March 2024: Competition test data released
6 May 2024 [Extended] Final results submission deadline (AoE time zone)