Tasks - Hierarchical Text: Challenge on Unified OCR and Layout Analysis

Task 1: Hierarchical Text Detection

Task definition

This task is a combination of 3 tasks: word detection, text line detection, and paragraph detection.

In this task, participants will be provided with images and expected to produce the hierarchical text detection results. Specifically, the results are composed of word-level bounding polygons and line and paragraph clusters on top of words. The clusters are represented as forests, where each paragraph is a tree, and lines are children of paragraphs, and words are children of lines and are the leaves. Note that, for this task, participants do not need to provide text recognition results. 

The submission will be a jsonl with the following formats:

{
  "annotations": [  // List of dictionaries, one for each image.
    {
      "image_id": "the filename of corresponding image.",
      "paragraphs": [  // List of paragraphs.
        {
          "lines": [  // List of lines.
            {
              "text": "",  // Set to empty string.
              "words": [  // List of words.
                {
                  "vertices": [[x1, y1], [x2, y2],...,[xm, ym]],
                  "text": "the text content of this word",  // Set to empty string for detection-only evaluation.
                }, ...
              ]
            }, ...
          ]
        }, ...
      ]
    }, ...
  ]
}

Evaluation and ranking

hiertext-dataset-supp.png

Fig 1. Text hierarchies represented as masks.

As illustrated in Fig. 1, we evaluate this task as 3 instance segmentation sub-tasks for word, line, and paragraph respectively. For word level, each word is one instance. For line level, we take the union of each line's children words as one instance. For paragraph level, we aggregate each paragraph's children lines, and take that as one instance. With this formulation, all the 3 sub-tasks will be evaluated with the PQ metric designed for instance segmentation, as specified in [1].

Each submission will have 3 PQ scores for word, line, and paragraph respectively. There will be 3 sub-rankings for these 3 sub-tasks respectively. For the final ranking of the whole task, we will compute the final score as a harmonic mean of the 3 PQ scores and give the final ranking.

Task 2: Word-Level End-to-End Text Detection and Recognition

For this task, images will be provided and participants are expected to produce word-level text detection and recognition results, i.e. a set of word bounding polygons and transcriptions for each image. This will be a challenging task, as the dataset has the most dense images, with more than 100 words per image on average, 3 times as many as the second dense dataset, TextOCR.

The submission format is the same as Task 1, except that, the "text" fields for words are required. Line and paragraph clustering is not required. If your method does not provide line and paragraph clusterings, you can simply treat each word as a line and paragraph itself as well, i.e. each paragraph dict has one line, and each line has one word. The clustering will not affect word-level end-to-end scores. For example, submission 1 & 2 as shown below will have exactly the same scores, while they have different layout representation.

Submission 1:

{
 "annotations": [
  {
   "image_id": "123456789a",
   "paragraphs": [
    {
     "lines": [
      {
       "text": "",
       "words": [
        {
         "vertices": [[0, 0], [10, 0],[10, 10], [0, 10]],
         "text": "test1",
        }, 
        {
         "vertices": [[10, 0], [20, 0],[20, 10], [10, 10]],
         "text": "test2",
        }, 
       ]
      }, 
     ]
    }, 
   ]
  }, 
 ]
}

Submission 2:

{
 "annotations": [
  {
   "image_id": "123456789a",
   "paragraphs": [
    {
     "lines": [
      {
       "text": "",
       "words": [
        {
         "vertices": [[0, 0], [10, 0],[10, 10], [0, 10]],
         "text": "test1",
        }, 
       ]
      }, 
     ]
    }, 
    {
     "lines": [
      {
       "text": "",
       "words": [
        {
         "vertices": [[10, 0], [20, 0],[20, 10], [10, 10]],
         "text": "test2",
        }, 
       ]
      }, 
     ]
    }, 
   ]
  }, 
 ]
}

Evaluation and ranking

For evaluation, we use the f1 measure, which is a harmonic mean of word-level prediction and recall. A word result is considered true positive if the IoU with ground-truth polygon is greater or equal to 0.5 and the transcription is the same as the ground-truth. The transcription comparison will consider all characters and will be case-sensitive. Note that, the dataset has illegible words. Detecting or missing these words will have no effect on the results. Ground-truths marked as illegible do not count as false negative even if they are not matched. Detections overlapping more than 50% with illegible ground-truth regions will be discarded before evaluation.

 

References

  1. Long, Shangbang, Siyang Qin, Dmitry Panteleev, Alessandro Bissacco, Yasuhisa Fujii, and Michalis Raptis. "Towards End-to-End Unified Scene Text Detection and Layout Analysis." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1049-1059. 2022.

Important Dates

All dates are final.

- 2023 Jan 2nd: Start of the competition, and submissions of results will be made available.

- 2023 Apr 1st 23:59 PST: Deadline for submissions to the ICDAR 2023 Competition 

- 2023 Apr 15th: Release of competition results.