Tasks - ICDAR 2024 Competition on Historical Map Text Detection, Recognition, and Linking

Tasks

The competition consists of multiple inter-related tasks on historical maps, including text detection and recognition at word and phrase levels. The four primary competition tasks are,

  1. Word detection
  2. Phrase detection (word linking)
  3. Word Detection and Recognition
  4. Phrase Detection and Recognition

We will evaluate each task on two datasets. One dataset consists map images from the David Rumsey Map Collection, covering a wide range of map styles, and the other dataset from French Land Registers contains maps tailored to a specific place and time. See details in the Downloads page

File Formats

All tasks share the same file formats for either ground truth and submissions, though certain fields or elements might be ignored when irrelevant for the task.

Note that all coordinates are given in pixels with respect to the image, starting at (0,0) in the top-left corner.

Ground Truth Format

The same ground truth file and format is used for all tasks: a list of dictionaries (one per image), each of which has a list of phrase groups, consisting of an ordered list of words. The JSON file (UTF-8 encoded) has the following format:

[ # Begin a list of images
    {
     "image": "IMAGE_NAME1",
     "groups": [ # Begin a list of phrase groups for the image
         [  # Begin a list of words for the phrase
           {"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT1", "illegible": False, "truncated": False},
           ...,
           {"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT2", "illegible": True, "truncated": False}
         ],
          ...
         [ {"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT3", "illegible": False, "truncated": True}, ... ]
     ] },
    {
     "image": "IMAGE_NAME2",
     "groups": [
         [
           {"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT4", "illegible": False, "truncated": False},
           ...,
           {"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT5", "illegible": False, "truncated": False}],
          ...
         [ {"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT6", "illegible": False, "truncated": False}, ... ] 
     ] },
     ...
]

(Small sample file.) For each image, the "groups" field stores a list of groups, where each entry represents the list of words within a group, given in reading order (for task 4). The "vertices" field of each word stores a list of coordinates (number pairs) representing the vertices of a bounding polygon for the given word; there must be at least three points and no specific arrangement is required otherwise.


For detection tasks (1 and 2), the "text" field of the words can be ignored. For non-grouping tasks (1 and 3), the grouping structure can be ignored and only the lists of words are needed.

Words that are marked truncated or illegible will be ignored in the evaluation. For linking tasks (2 and 4), groups that contain any ignored word will be ignored in the evaluation.

Submission Format

All tasks use the same basic submission format (and can accept the same file): a list of dictionaries (one per image), each of which has a list of phrase groups, consisting of an ordered list of predicted words. Some fields and structures will be ignored in certain tasks when they are irrelevant. The JSON file (UTF-8 encoded) has the following format:

[ # Begin a list of images
    {
     "image": "IMAGE_NAME1",
     "groups": [ # Begin a list of phrase groups for the image
        [ # Begin a list of words for the phrase
          {"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT1"},
          ...,
          {"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT2"}
       ],
       ...
       [ {"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT3}, ... ]
    ] },
    {
     "image": "IMAGE_NAME2",
     "groups": [
        [
          {"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT4"},
          ...,
          {"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT5"}
        ],
        ...
        [ {"vertices": [[x1, y1], [x2, y2], ..., [xN, yN]], "text": "TEXT6"}, ... ] 
    ] },
    ...
]

(Small sample file.) The "groups" field for each image stores a list of groups, where each entry represents the list of words within a group. The "vertices" field of each word stores a list of coordinates (number pairs) representing the vertices of a bounding polygon for the given word; there must be at least three points and no specific arrangement is required otherwise.

Note:

  • The membership of words in groups is only considered during evaluation of Tasks 2 and 4.
  • The "text" field is only considered during evaluation of Tasks 3 and 4.
  • The word order (within a group) is only considered during evaluation of Task 4.
  • The group order (within an image) is never considered.

Task 1 - Word Detection

The task requires detecting individual words on map images, i.e., generating bounding polygons that enclose text instances at the word level.

Submission Format

The optional "text" field of each word is ignored in evaluation and not required in the JSON file; this allows the same file to be submitted for every task. Grouping is not required for this task; although the groups field is required in the JSON file, the group-level organization is ignored by the evaluation. For example, each image could have one group containing all the words or each word could belong to its own group.

Evaluation Metric

Following the HierText competition, the Panoptic Detection Quality (PDQ) evaluates word detection,

MapText-eq1.png

where the tightness T is the average IoU among true positive regions

MapText-eq2.png

and F represents the F-score, the harmonic mean between precision and recall:

MapText-eq3.png

Note that matches in TP are determined by solving the linear sum assignment problem on the weighted bipartite graph between ground truth and predicted regions where IoU > 0.5. Predictions matched to ground truth regions that are to be ignored are not counted in the evaluation.

While PDQ will be the competition metric, the conventional evaluation metrics precision P, recall R, and F-score F are reported to get a coarse sense of relative raw detection quality, separate from the localization tightness T.

Task 2 - Phrase Detection

This task requires words to be detected (with their polygon boundaries) and grouped into constituent lists for label phrases. Words from the same group (phrase) are treated as one unit for (joint) detection.

Submission Format

As in Task 1, the optional "text" field of each word is ignored in evaluation and not required in the JSON file; this allows the same file to be submitted for every task. However, the evaluation is sensitive to group-level organization among the words. The word order is not considered during evaluation.

Evaluation Metric

For each group, the unions of word polygons in the detection predictions and ground truth are used to calculate the IoU and thence the PDQ score at the phrase level. The word order is ignored. With group-level detection quality the official competition metric, the constituent terms F-score F, precision P, recall R, and tightness T are also reported.

Task 3 - Word Detection and Recognition

In this task, participants are expected to produce word-level text detection and recognition results, e.g., generating a set of word bounding polygons and corresponding transcriptions.

Submission Format

Submissions require the "text" field for each word. Like Task 1, grouping is not required for this task; although the groups field is required in the JSON file, any group-level organization is ignored by the evaluation. For example, each image could have one group containing all the words or each word could belong to its own group.

Evaluation Metric

We introduce the Panoptic Word Quality (PWQ) metric, which is identical to PDQ, except that true positives must have matching transcriptions in addition to meeting the IoU threshold.

PWQ effectively combines localization accuracy (tightness), detection quality (polygon presence/absence), and word-level recognition accuracy into a single parameter-free measure. PWQ is the competition metric, giving strong preference for well-localized, textually accurate detections.

Word-recognition accuracy can be quite stringent, as measured by the PWQ calculated with TPRec. Therefore we also desire a metric that accounts for character-level recognition accuracy by comparing the edit distance between ground truth and predicted text. To this end, we introduce the Panoptic Character Quality (PCQ) metric:

MapText-eq5.png

where T×F are as in the original PDQ metric, and C represents the average complementary normalized edit distance of each word's text among the matched true positive detections TPDet.

MapText-eq6.png

Here NED is the normalized edit distance between the detected and ground truth text.

With its three factors T,F,C∈[0,1], the product PCQ falls into the same range. Thus, the PCQ combines localization accuracy (tightness), detection quality, and character-level recognition accuracy into a single measure without additional parameters.

Task 4 - Phrase Detection and Recognition

This task requires the detection and recognition at the phrase level. Submissions must group words (polygons and transcriptions) into phrases, an ordered list.

Submission Format

All elements of the submission file format described above are required; "text" transcriptions and groupings are all considered in the evaluation.

Evaluation Metric

As in Task 2, the unions of word polygons in the detection predictions and ground truth groups are used for calculating IoUs. The competition metric is PCQ, where the ground truth linked string and the recognized linked string will be the space-separated concatenation of the words in each group (using their list order). Note that detection-level errors in word segmentation that are otherwise grouped correctly are not as heavily penalized because the edit distance will only count the spaces when missing (an under-segmentation) or extra (an over-segmentation).

The PWQ measure and all the constituent terms are also reported for evaluation.

Evaluation Metric Summary

 Task  Competition Metric  Other Metrics
 1-Word Detection  Word PDQ  Word P/R/F/T
 2-Phrase Detection  Group PDQ  Group P/R/F/T
 3-Word Detection and Recognition  Word PWQ  Word PCQ, P/R/F/T/C
 4-Phrase Detection and Recognition  Group PCQ  Group PWQ, P/R/F/T/C

 

FAQs

  1. What training data is allowed?
    • Can I use private data for the competition?
      • Yes, but only under two conditions:

        In particular, competitors must take great care to exclude any labeled training data that overlaps with the competition test data set. Historical maps are often printed in multiple editions from the same engraving or scanned into a variety of digital libraries. Entries whose labeled training data is discovered to contain a test map will be disqualified.

        1. The use is disclosed with the submission, and
        2. The data is made public for the benefit of the community. To be included in the competition, link(s) to the data must be included with any submission using private data.
    • Can I use synthetic data for the competition?
      • Yes; submitters are strongly encouraged to share their data and/or synthesis method.
    • Can I use labeled public data sets (i.e., ICDAR13, COCOText, ArT, etc.)?
      • Yes; submitters are encouraged to indicate any training data sets used.
    • Can I use publicly available unlabeled data?
      • Yes.
  2. Does my submission have to include both data sets (Rumsey and French) for evaluation?
    • No, the two data sets will be evaluated separately. Omitting one will not influence evaluation of the other. While such "empty" results will necessarily appear in the online evaluation, they will be manually excluded from a competition report.
  3. What is the character set for the data?
    • The Latin character set (including diacritics and digraphs), numerals, punctuation, common keyboard symbols, and select special symbols such as ™ and ®/©.

      Maps from the Rumsey data set are expected to be in English, but whereas they cover world-wide historical geography, occasional diacritics (e.g., ü or ø) and digraphs (e.g., æ) are to be expected. The 19th century Cadastre maps from the French data are expected to exhibit typical qualities of that language, place, and time.
  4. How many submissions can I make?
    • For the competition, each participant may submit as many times as they like, but only the final pre-deadline submission will be included for the competition report and ranking.

Important Dates

2 January 2024: Competition Announced

1 February 2024: Training and validation data released

1 March 2024: Competition test data released

29 April 2024: Final results submission deadline (AoE time zone)