Tasks - ICDAR2019 Robust Reading Challenge on Large-scale Street View Text with Partial Labeling

LSVT dataset will include 450, 000 images with text that are freely captured in the streets, e.g., store fronts and landmarks. 50,000 of them are fully annotated, which are split into i) 30,000 for training set, ii) 20,000 testing set. The rest of the 400,000 images are weakly annotated, as a part of the training set.

To evaluate the text reading performance on various aspects, we introduce two common tasks on this large-scale street view datasets, i.e., text detection and end-to-end text spotting.

  • Text detection, where the objective is to localize text from street view images at the level of text lines, which is similar to all the previous RRC scene text detection tasks.
  • End-to-end text spotting, where the objective is to localize and recognize all the text lines in the image in an end-to-end manner.

Note

Participants are free to use publicly available datasets (e.g. ICDAR2015, RCTW-17, MSRA-TD500, COCO-Text, and MLT.) or synthetic images as extra training data for this competition, while the private data that is not publicly accessible is not permitted to use.

Ground Truth Format

For all images with full annotations in the dataset, we create a single JSON file to store the ground-truths in a structured format, following the naming convention:

    gt_[image_id], image_id refers to the index of the image in the dataset.

In the JSON file, each gt_[image_id] corresponds to a list, where each line in the list correspond to one word in the image and gives its bounding box coordinates, transcription, and illegibility flag in the following format:

{

    “gt_1”: [

        {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “transcription” : “trans1”, "illegibility": false },

        {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “transcription” : “trans2”, " illegibility ": false }],

    “gt_2”: [

        {“points”: [[x1, y1], [x2, y2], …, [xn, yn]] , “transcription” : “trans3”, " illegibility ": false }],

    ……

}

where x1, y1, x2, y2, …, xn, yn in “points” are the coordinates of the polygon bounding boxes, which could be 4, 8, 12 polygon vertices. The “transcription” denotes the text of each text line, and “illegibility” represents “Do Not Care” text region when it’s set “true”, which does not influence the results.

 

Similar to full annotations ground-truths, for images with weak annotations in the dataset, we store all the ground-truths in a single JSON file. In the JSON file, each gt_[image_id] corresponds to one word which we refer to as `text-of-interest' in the images:

{

    “gt_0”: [{ “transcription” : “trans1” }],

    “gt_1”: [{“transcription” : “trans2” }],

    “gt_2”: [{“transcription” : “trans3” }],

    ……

}

Download ground truth example here: LSVT-gt-example

Task1:Text detection

This task is to evaluate the text detection performance, where candidates' methods are expected to localize text from street view images at the level of text lines.

Input: Full street view images

Output: Locations of text lines in quadrangles or polygons for all the text instances.

Results Format

The naming of all the submitted results should follows such format: res_[image_id]. For example, the name for the text file corresponding to the input image “gt_1.jpg” should be “res_1”. Participants are required to submit the detection results for all the images in a single JSON file. The submission file format is as below: 

{

    “res_1”: [

        {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “confidence” : c},

        {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “confidence” : c}],

    “res_2”: [

        {“points”: [[x1, y1], [x2, y2], …, [xn, yn]] , “confidence” : c }],

    ……

}

where n is the total number of vertices (could be unfixed, varied among different predicted text instance). The c is the confidence score of the prediction.

Download submission example here: LSVT-detection-example

Evaluation Metrics

Following the evaluation protocols of ICDAR 2015 [1] and ICDAR 2017-RCTW [2] datasets, the detection task of LSVT (T1) is evaluated in terms of Precision, Recall and F-score with the IoU (Intersection-over-Union) threshold of 0.5 and 0.7, and only the H-Mean under 0.5 will be used as the primary metric for the final ranking. Meanwhile, in the case of multiple matches, we only consider the detection region with the highest IOU, the rest of the matches will be counted as False Positive. The calculation of Precision, Recall, and F-score are as follows:

LSVT-equation1.jpg

 

where TP, FP, FN and F denote true positive, false positive, false negative and H-Mean, respectively.

All detected or missed “Do not care” ground truths will not contribute to the evaluation result. Similar to COCO text [3] and ICDAR2015[1], illegible text
instances and symbols were labeled as “Do not care” region.

Task2:End-to-end text spotting

The main objective of this task is to detect and recognize every text instance in the provided image in an end-to-end manner.

Input: Full street view images

Output: Locations of text lines in quadrangles or polygons and the corresponding recognized results for all the text instances in the image.

Results Format

The participants are required to submit the predicted detection and recognition results for all the images in a single JSON file:

{

    “res_1”: [

        {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “confidence” : c, “transcription” : “trans1”},

        {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “confidence” : c, “transcription” : “trans2”}],

    “res_2”: [

        {“points”: [[x1, y1], [x2, y2], …, [xn, yn]], “confidence” : c , “transcription” : “trans3”}],

    ……

}

where n is the total number of vertices (could be unfixed, varied among different predicted text instance). The c is the confidence score of the prediction. The “transcription” denotes the text of each text line.

 

All the participants in competition are expected to submit their results of the test set, which will be released a few weeks before the final submission deadline.

Download submission example here: LSVT-end-to-end-example

Evaluation Metrics

To compare the results of the end-to-end text spotting task (T2) more comprehensively, the submitted models will be evaluated in several aspects, they are: i) Normalized metric [2] in terms of Normalized Edit Distance (1-N.E.D. specially), and ii) Precision, Recall and F-score. Only the 1-N.E.D will be treated as the official ranking metric although results of both metrics will be published.

Under the exactly matched criteria in F-score, a true positive text line means that the Levenshtein distance between the predicted result and the matched ground truth (IoU higher than 0.5) equals to 0. 

For the Normalized metric, we first evaluate the detection result by calculating its Intersection over Union (IoU) with the corresponding ground-truth. Detection regions with an IoU value higher than 0.5 will be matched with the recognition ground truth (i.e. the transcript ground truth of the particular text region). Meanwhile, in the case of multiple matches, we only consider the detection region with the highest IoU, the rest of the matches will be counted as False Positive. Then, we will evaluate the predicted transcription with the Normalized Edit Distance (N.E.D), which is formulated as:

equation2.jpg

where D(:) stands for the Levenshtein Distance, and si.jpg and widehat_si.jpg denote the predicted text line in string and the corresponding ground truths in the regions. Note that the corresponding ground truth widehat_si.jpg is calculated over all ground truth locations to select the one in the maximum IoU with the predicted si.jpg as a pair. N is the maximum number of "paired" GT and detected regions, which includes singletons: GT regions that were not matched with any detection (paired with NULL / empty string) and detections that were not matched with any GT region (paired with NULL / empty string). 

Note: To avoid the ambiguity in annotations, we preform preprocessing before evaluation: 1)The English letters are not case sensitive; 2) The Chinese traditional and simplified characters are treated as the same label; 3)The blank spaces and symbols will be removed; 4) All illegible images will not contribute to the evaluation result.

References

[1] Karatzas, Dimosthenis, et al. "ICDAR 2015 competition on robust reading." Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015.

[2] Shi, Baoguang, et al. "ICDAR2017 competition on reading Chinese text in the wild (RCTW-17)."Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. Vol. 1. IEEE, 2017.

[3] Gomez, Raul, et al. "ICDAR2017 robust reading challenge on COCO-Text." 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2017.

Important Dates

1st January to 1st March

i) Q&A period for the competition,

ii) The launching of initial website

15th Feb to 1st March

i) Competition formal announcement,

ii) Publicity,

iii) Sample training images available,

iv) Evaluation protocol, file formats etc. available.

25th February

i) Evaluation tools ready,

ii) Full website ready.

1st March

i) Competition kicks off officially,

ii) Release of training set images and ground truth.

9th April

Release of the first part of test set images (10,000 images),

20th April

i) Release of the second part of test set images (10,000 images).

ii) Website opens for results submission

30th April

i) Deadline of the competition and result submission closes (at PDT 23:59)

ii) Release of the evaluation results.

5th May

i) Submission deadline for one page competition report, and the final ranking will be released after results checking.

20th to 25th September

i) Announcement of competition results at ICDAR2019.