Tasks - ICDAR2017 Robust Reading Challenge on COCO-Text

The challenge is set up around three tasks:

  • Text Localisation, where the objective is to obtain a rough estimation of the text areas in the image, in terms of bounding boxes that correspond to words.

  • Cropped Word Recognition, where the locations (bounding boxes) of words in the image are assumed to be known and the corresponding text transcriptions are sought.

  • End-to-End Recognition, where the objective is to localise and recognise all words in the image in a single step.

COCO-Text contains all together 63,686 images. 43,686 of the images will be used for training, 10,000 for validation, and 10,000 for testing. Different from many other scene text datasets, some images in COCO-Text do not contain text at all, since the images are not collected with text in mind.

The annotations of COCO-Text we provide for the challenge include (a) bounding boxes of text regions, (b) transcriptions of legible text, and (c) attributes including legibility (‘legible’ / ‘illegible’), language (‘English’ / ‘non-English’ / ‘N/A’), and class (‘machine printed’ / ‘handwritten’ / ‘others’). The annotations on all training and validation images will be stored in a single json file. See Downloads page for details.

For legible text, bounding boxes are annotated on every word. Each box is a rectangle that encloses an uninterrupted sequence of characters separated by blank spaces. For illegible text, we annotate one bounding box per continuous text region, e.g. a sheet or paper. All bounding boxes are represented by four parameters (x,y,w,h).

All images are provided as JPEG files and the text files are UTF-8 files with CR/LF new line endings.






Task 1: Text Localization

The aim of this task is to accurately localise text by word bounding boxes. Participants will be asked to run their systems to localise every word on every testing image.

Results Format

Localization results will be saved in one UTF-8 encoded text file per test image. Participants will be asked to submit a single zip file that contains all the text files to our evaluation server. Result files will be named after test image IDs (e.g. res_24577.txt), and will have the following format (with CR/LF new line endings):





Different from the previous robust reading challenges, participants will be asked to include a confidence score for every bounding box. As we will specify later, the scores allow us to calculate an average precision (AP) score as well as plotting a precision-recall curve from the results.

Download Task 1 submission sample.

Evaluation Metrics

Following the standard practice in object detection [1, 2], we will calculate the average precision (AP) for each submission. The metric will be calculated at IoU=0.5 and IoU=0.75, respectively. Illegible or non-English text will be treated as “don’t care” objects. The AP at IoU = 0.5 will be taken as the primary challenge metric for ranking the submissions. This metric is equivalent to the “meanAP” metric adopted by PASCAL VOC, since we only have one category.

Compared to Hmean, the metric adopted by the previous ICDAR Robust Reading competitions, measuring AP takes into account the trade-off between precision and recall, thus enabling a more comprehensive view on the performance. AP also exempts the participants from manually adjusting thresholds on their results. Further, we will plot a precision-recall curve for each submission for a detailed comparison.





Task 2: Cropped Word Recognition

The aim of this task is to recognise cropped word images into character sequences. The cropped boxes are annotated word boxes padded by 2 pixels on all sides. Only legible English words longer than 3 characters will be considered in this task. The annotations contain symbols that will be considered in the evaluation.

The recognition will be unconstrained, meaning that there will be no lexicon constraining the output words. We provide a generic dictionary which participants can choose to use. The dictionary, however, may not contain all groundtruth words and numbers.

Results Format

Recognition results on all test images will be saved into a single UTF-8 encoded text file, in the following format:





Anything that follows the coma until the end of line (CR/LF) will be considered transcription, and no escape characters will be used.

Download Task 2 submission sample.

Evaluation Metrics

Metrics for this task include word accuracy (both case sensitive and insensitive) and mean edit distance (both case sensitive and insensitive). Case-insensitive word accuracy will be taken as the primary challenge metric.


COCO_train2014_0000000283921.jpgCOCO_train2014_000000028392 (copy).jpg



Task 3: End-to-End Recognition

The aim of this task is to both localise and recognise words in images. Only legible English words longer than 3 characters will be considered. The rest are treated as “don’t care” objects. The annotations contain symbols. In the evaluation we will consider symbols in the middle of words, but remove the symbols ( !?.:,*"()·[]/'_ ) at the beggining and at the end of both the groundtruth and the submissions. If there is more than one of those symbols beggining or ending a word, all will be removed. The evaluation will be case-insensitive.

Results Format

End-to-end results will be saved in a similar format as that in Task 1, as shown below. Result files will be named after test image IDs (e.g. res_24577.txt).





Anything that follows the coma until the end of line (CR/LF) will be considered transcription, and no escape characters will be used.

Download Task 3 submission sample.

Evaluation Metrics

Average Precision (AP) calculated at IoU = 0.5 will be taken as the primary challenge metric. The metrics are calculated in the same way as that in Task 1, except that the recognition results will be taken into consideration. A detection will be considered a true positive if its bounding box sufficiently overlaps with the matching groundtruth box and its recognition matches the groundtruth word.




[1]    M. Everingham, L.V. Gool, C.K. Williams, J.M. Winn, A. Zisserman. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88, 303-338.

[2]    T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C.L. Zitnick. “Microsoft COCO: Common Objects in Context.” ECCV (2014).


Important Dates

March, 13: COCO-Text available. (train/val/test).

March, 19: Cropped words dataset available. (train/val).

March, 23: Annotations updated (v1.4).

March, 30: Cropped words dataset updated (v1.4).

May, 23:    Submissions opening.

June, 30: Submission of results deadline.

September, 28: Results publication.

November, 10-15: Results presentation.