Tasks - ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text

Our proposed competition consists of three main tasks:

  1. scene text detection,
  2. scene text recognition,
  3. scene text spotting.

Note

Participants are free to use publicly available datasets (e.g. ICDAR2015, MSRA-TD500, COCO-Text, and MLT.) or synthetic images as extra training data for this competition, while the private data that is not publicly accessible is not permitted to use.

Ground truth format

Task 1 and 3

We create a single JSON file that covers all images in the dataset to store the ground truth in a structured format, following the naming convention:

gt_[image_id], where image_id refers to the index of the image in the dataset.

In the JSON file, each gt_[image_id] corresponds to a list, where each line in the list correspond to one word in the image and gives its bounding box coordinates, transcription, language type and difficulty flag, in the following format:

{

“gt_1”:  [   {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “transcription” : “trans1”, “language” : “Latin”, "illegibility": false },

                …

                {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “transcription” : “trans2”, “language” : “Chinese”, "illegibility": false }],

“gt_2”:  [

                {“points”: [[x1, y1], [x2, y2], …, [xn, yn]] , “transcription” : “trans3”, “language” : “Latin”, "illegibility": false }],

……

}

where x1, y1, x2, y2, …, xn, yn in “points” are the coordinates of the polygon bounding boxes, which could be 4, 8, 10, 12 polygon vertices. The “transcription” denotes the text of each text line, the “language” denotes the language type of the transcription, which could be “Latin” and “Chinese”. Similar to COCOtext [3] and ICDAR2015 [2], “illegibility” represents “Do Not Care” text region when it’s set “true”, which does not influence the results.

Task 2

The given input will be the cropped image patches with corresponding text instances, and the relative polygon spatial coordinates. Similar to Task 1, for all images in the dataset, we create a single JSON file to store the ground truths in a structured format, following the naming convention:

gt_[image_id], where image_id refers to the index of the image in the dataset.

{

“gt_1”:  [{“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “transcription” : “trans1”, “language” : “Latin”, "illegibility": false }],

“gt_3”:  [{“points”: [[x1, y1], [x2, y2], …, [xn, yn]] , “transcription” : “trans2”, “language” : “Latin”, "illegibility": false }],

“gt_3”:  [{“points”: [[x1, y1], [x2, y2], …, [xn, yn]] , “transcription” : “trans3”, “language” : “Latin”, "illegibility": false }],

……

}

Note that the polygon coordinates are provided as an optional information. Participants are free to decide whether to utilize that information or not.

example_images.png

Figure 1: Example images of the ArT dataset. Red color binding lines are formed with the polygon ground truth vertices. All images in this dataset are saved with the ‘jpg’ suffix.

ArT_Tasks_Figure22.png

Figure 2. Polygon ground truth format of ArT.

 

Figure 2 illustrates all the mentioned attributes. It is worth pointing out that such polygon ground truth format is different from all the previous RRC, which used axis-aligned bounding box [1, 3], or quadrilateral [2] as the only ground truth format. Both of which have two and four vertices respectively, which are deemed to be inappropriate for the arbitrary-oriented text instances in ArT, especially the curved text instances. Both Chinese and Latin scripts were annotated in ArT. Following the practice of the MLT dataset [5], we annotate Chinese scripts with line-level granularity and Latin scripts at word-level granularity.

Download submission example here: ArT-gt-example.zip

Task 1: Scene Text Detection

The main objective of this task is to detect the location of every text instance given an input image, which is similar to all the previous RRC scene text detection tasks. The input of this task is strictly constrained to image only, no other form of input is allowed to aid the model in the process of detecting the text instances.

  • Input: Scene text image
  • Output: Spatial location of every text instance at word-level for Latin scripts, and line-level for Chinese scripts.

Results format

The naming of all the submitted results should follows such format: res_[image_id]. For example, the name for the text file corresponding to the input image “gt_1.jpg” should be “res_1”. Participants are required to submit the detection results for all the images in a single JSON file. The submission file format is as below:

{

“res_1”:  [

                {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “confidence” : c},

                …

                {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “confidence” : c}],

“res_2”:  [

                {“points”: [[x1, y1], [x2, y2], …, [xn, yn]] , “confidence” : c }],

……

}

where the key of JSON file should adhere to the format of res_[image_id]. Also, n is the total number of vertices (could be unfixed, varied among different predicted text instance), and c is the confidence score of the prediction. In order to encourage different approaches in attempting this challenge, we provide a default wrapper script for participants with models which produce mask as their final output, to convert them into polygon vertices before submitting their results for evaluation. Participants are free and encouraged to use their own methods to convert their mask outputs into polygon vertices.

Download submission example here: ArT-detection-example.zip

Evaluation metrics

For T1, we adopt IoU-based evaluation protocol by following CTW1500 [4]. IoU is a threshold-based evaluation protocol, with 0.5 set as the default threshold. We will report results on 0.5 and 0.7 thresholds but only H-Mean under 0.5 will be treated as the final score for each submitted model, and to be used as submission ranking purpose. To ensure fairness, the competitors are required to submit confidence score for each detection, and thus we can iterate all confidence thresholds to find the best H-Mean score. Meanwhile, in the case of multiple matches, we only consider the detection region with the highest IOU, the rest of the matches will be counted as False Positive. The calculation of Precision, Recall, and F-score are as follows:

LSVT-equation1.jpg

 

where TP, FP, FN and F denote true positive, false positive, false negative and H-Mean, respectively.

All illegible text instances and symbols were labeled as “Do not care” region, which will not contribute to the evaluation result.

Task 2: Scene Text Recognition

The main objective of this task is to recognize every character in a cropped image patch, which is also one of the common tasks in previous RRC. Considering the fact that the research in Chinese scripts text recognition is relatively immature compared to Latin scripts, we decided to further break down T2 into two subcategories:

  1. T2.1 - Latin script only,
  2. T2.2 - Latin and Chinese scripts.

We hope that such a split could make this task friendlier for non-Chinese, as the main problem that we are trying to address in this competition is the challenge of arbitrary-shaped text.

  • Input: Cropped image patch with text instance.
  • Output: A string of predicted characters.

Results format

For T2, participants are required to submit the predicted transcripts for all the images in a single JSON file:

{

“res_1”:  [{ “transcription” : “trans1”}],

“res_2”:  [{ “transcription” : “trans2”}],

“res_3”:  [{ “transcription” : “trans3”}],

……

}

where the key of JSON file should adhere to the format of res_[image_id].

Note: Participants are required to make a single submission only regardless of scripts. We will evaluate all submissions under two categories, Latin and mixed (Latin and Chinese) scripts. When evaluating recognition performance for Latin script, all non-Latin transcriptions will be treated as "Do Not Care" regions.

Download submission example here: ArT_recognition_example.zip

Evaluation metrics

For T2.1, case-insensitive word accuracy will be taken as the primary challenge metric .Apart from this, all the standard practice for text spotting evaluation such as i) for the ground truth that contain symbols, we will consider symbols in the middle of words, ii) but remove the symbols ( !?.:,*"()·[]/'_ ) at the beginning and at the end of both the ground truths and the submissions.

For T2.2, we adopt the Normalized Edit Distance metric (1-N.E.D specifically) and case-insensitive word accuracy. 1-N.E.D is also used in the ICDAR 2017 competition ICPR-MTWI [6]. Only the 1-N.E.D will be treated as the official ranking metric although results of both metrics will be published. The Normalized Edit Distance (N.E.D) is formulated as below:

formula.png

where D(:) stands for the Levenshtein Distance, and si.jpg and widehat_si.jpg denote the predicted text line in string and the corresponding ground truths in the regions. Note that the corresponding ground truth widehat_si.jpg is calculated over all ground truth locations to select the one in the maximum IoU with the predicted si.jpg as a pair. N is the maximum number of "paired" GT and detected regions, which includes singletons: GT regions that were not matched with any detection (paired with NULL / empty string) and detections that were not matched with any GT region (paired with NULL / empty string). 

The reason that chose 1-N.E.D as the official ranking metric for T2.2 is motivated by the fact that Chinese script has much more and also usually longer vocabulary than Latin script, which makes word accuracy metric too harsh to proper evaluate T2.2. In the 1-N.E.D evaluation protocol, all characters (Latin and Chinese) will be treated in a consistent manner.

Note: To avoid ambiguities in annotations, we perform certain preprocessing steps before evaluation: 1) English letters are treated as case insensitive; 2) Chinese traditional and simplified characters are treated as the same label; 3) Blank spaces and symbols will be removed; 4) All illegible images will not contribute to the evaluation result.

Task 3: Scene Text Spotting

The main objective of this task is to detect and recognize every text instance in the provided image in an end-to-end manner. Similar to RRC 2017, a generic vocabulary list (90K common English words) will be provided as a reference for this challenge. Identical to T2, we break T3 down into two sub categories:

  1. T3.1 Latin script only text spotting,
  2. T3.2 Latin and Chinese scripts text spotting. 
  • Input: Scene text image
  • Output: Spatial location of every text instance at word-level for Latin scripts, and line-level for Chinese scripts together with the predicted word for each detection.

Results format

Lastly, participants are required to submit the results for all the images in a single JSON file with the following format:

{

“res_1”:  [

                {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “confidence” : c, “transcription” : “trans1”},

                …

                {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “confidence” : c, “transcription” : “trans2”}],

“res_2”:  [

                {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “confidence” : c , “transcription” : “trans3”}],

……

}

where the key of JSON file should adhere to the format of res_[image_id].

Note: Participants are required to make a single submission only regardless of scripts. We will evaluate all submissions under two categories, Latin and mixed (Latin and Chinese) scripts. When evaluating recognition performance for Latin script, all non-Latin transcriptions will be treated as "Do Not Care" regions.

Download submission example here: ArT-end-to-end-result-example.zip

Evaluation metrics

For T3, we first evaluate the detection result by calculating its Intersection over Union (IoU) with the corresponding ground-truth. Detection regions with an IoU value higher than 0.5 will be matched with the recognition ground truth (i.e. the transcript ground truth of the particular text region). Meanwhile, in the case of multiple matches, we only consider the detection region with the highest IOU, the rest of the matches will be counted as False Positive. Then, we will evaluate the predicted transcription with both case-insensitive word accuracy H-mean and 1-N.E.D (with 1-N.E.D as the official ranking) for T3.1 (while the Chinese text regions will be ignored in this evaluation). Similar to T2.2, we will publish the results of T3.2 in both metrics (1-N.E.D and case-insensitive word accuracy), but the official ranking is based on the results of 1-N.E.D.

Note: The preprocessing steps for the recognition part are the same as task 2.

 

References

  1. Karatzas, Dimosthenis, et al. "ICDAR 2013 robust reading competition." Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. IEEE, 2013.
  2. Karatzas, Dimosthenis, et al. "ICDAR 2015 competition on robust reading." Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015.
  3. Gomez, Raul, et al. "ICDAR2017 robust reading challenge on COCO-Text." 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2017.
  4. Yuliang, Liu, Lianwen, Jin, et al. "Curved Scene Text Detection via Transverse and Longitudinal Sequence Connection." Pattern Recognition, 2019.
  5. Nayef, Nibal, et al. "ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification-RRC-MLT." Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. Vol. 1. IEEE, 2017
  6. Shi, Baoguang, et al. "ICDAR2017 competition on reading chinese text in the wild (RCTW-17)." Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. Vol. 1. IEEE, 2017.

Important Dates

1st January to 1st March

i) Q&A period for the competition,

ii) The launching of initial website

15th Feb to 1st March

i) Competition formal announcement,

ii) Publicity,

iii) Sample training images available,

iv) Evaluation protocol, file formats etc. available.

25th February

i) Evaluation tools ready,

ii) Full website ready.

1st March

i) Competition kicks off officially,

ii) Release of training set images and ground truth.

9th April

Release of the first part of test set images (2271 images),

20th April

i) Release of the second part of test set images (2292 images).

ii) Website opens for results submission

30th April

i) Deadline of the competition and result submission closes(at PDT 23: 59)

ii) Release of the evaluation results.

5th May

i) Submission deadline for 1 page competition report, and the final ranking will be released after results checking.

20th to 25th September

i) Announcement of competition results at ICDAR2019.