Tasks - ICDAR 2019 Robust Reading Challenge on Multi-lingual scene text detection and recognition

In order to participate in the RRC-MLT-2019 challenge, you have to participate in at least one task. Here is the description of the tasks. The first three tasks are similar to the ones in RRC-MLT-2017, but they are re-opened for RRC-MLT-2019 with adding a new language to the dataset and improved quality of the ground truth for the whole dataset. We are also introducing a new forth task on End-2-End text detection and recognition.

Task-1: Multi-script text detection

In this task, a participant method should be able to generalize to detecting text of different scripts. The input to this task is scene images with embedded text in various languages, and the required detection is at word level.

Ground Truth (GT) Format

NOTE: the GT provided for this task contains more information than needed for this task, because this GT is shared with Tasks 3 and 4 as well. So, please make sure the results format generated by your method is as described in the "Results Format" paragraph.

The ground truth is provided in terms of word bounding boxes. Bounding boxes are NOT axis oriented and they are specified by the coordinates of their four corners in a clock-wise manner. For each image in the training set a corresponding UTF-8 encoded text file is provided, following the naming convention:

gt_[image name].txt

The text files are comma separated files, where each line corresponds to one text block in the image and gives its bounding box coordinates (four corners, clockwise), its script and its transcription in the format:

x1,y1,x2,y2,x3,y3,x4,y4,script,transcription

Valid scripts are: "Arabic", "Latin", "Chinese", "Japanese", "Korean", "Bangla", "Hindi", "Symbols", "Mixed", "None"

Note that the transcription is anything that follows the 9th comma until the end of line. No escape characters are to be used.

If the transcription is provided as "###", then text block (word) is considered as "don't care". Some of the "don't care" words have a script class that corresponds to a language, and others have a "None" script class. The latter case is when the word script cannot be identified due to low resolution or other distortions.

Results Format

Localisation (detection) results are expected as follows: One UTF-8 encoded text file per test image is expected. Participants are asked to submit all results in a single zip file. Result files should be named after test image IDs following the naming convention:

res_[image name].txt

(e.g. res_1245.txt). Each line should correspond to one word in the image and provide its bounding box coordinates (four corners, clockwise) and a confidence score in the format:

x1,y1,x2,y2,x3,y3,x4,y4,confidence

Evaluation

The f-measure (Hmean) is used as the metric for ranking the participants methods. The standard f-measure is based on both the recall and precision of the detected word bounding boxes as compared to the ground truth. A detection is considered as correct (true positive) if the detected bounding box has more than 50% overlap (intersection over union) with the GT box. The details of how the scores are computed are in section III-B of this paper: MLT2017

Question: Tasks 1 and 3: Should we detect the "don't care" boxes (transcribed as "####"), how will this be evaluated ?
Answer: "don't care" boxes do not count in the evaluation. This means detecting or missing the don't care boxes will not affect your final score.

Task-2: Cropped Word Script identification

The text in our dataset images appears in 10 different languages, some of them share the same script. Additionally, punctuation and some math symbols sometimes appear as separate words, those words are assigned a special script class called "Symbols". Hence, we have a total of 8 different scripts. We have excluded the words that have "Mixed" script for this task. We have also excluded all "don't care" words whether they have an identified script or not.

Ground Truth Format

For the word script identification task, we provide all the words (cropped words) in our dataset as separate image files, along with the corresponding ground-truth script and transcription. The transcription is not used in this task and can be ignored. For each text block, the axis oriented area that tightly contains the text block is provided.

The script and transcription of all words is provided in a SINGLE UTF-8 text file for the whole collection. Each line in the ground truth file has the following format

[word image name],script,transcription

Note that the transcription is anything that follows the 2nd comma until the end of line. No escape characters are to be used. Valid scripts are "Arabic", "Latin", "Chinese", "Japanese", "Korean", "Bangla", "Hindi", "Symbols".

In addition, we provide the information about the original image from which the word images have been extracted, as follows: the relative coordinates of the (non-axis oriented) bounding box that defines the text block within the cut-out text block image are provided in a separate SINGLE text file for the whole collection. The coordinates of the text blocks are given in reference to the cut-out box, as the four corners of the bounding box in a clock-wise manner. Each line in the ground truth file has the following format.

[word image name], x1, y1, x2, y2, x3, y3, x4, y4,[original image name]

Results Format

A participant method should provide the script of each image, where each input image is a cropped word image (cut-out text block from a scene image). A single script name per image is requested. All the output scripts should be listed in a single UTF-8 encoded text file, one script per word image, using the following format:

[word image name],script

Evaluation

The evaluation of results against the ground truth is computed in the following way: participants provide a script ID for each word image, and if the result is correct, then the count of correct results is incremented. The final evaluation for a given method is the accuracy of such prediction. This can be summarized with the simple definition that follows:

let G = {g1 , g2 , . . . , gi, . . . , gm} be the set of correct script classes in the ground truth, and T = {t1 , t2, . . . , ti , . . . , tm} be the set of script classes returned by a given method, where gi and ti refer to the same original image. A script identification per word is counted as correct (One) if gi = ti, otherwise it is false (Zero), the sum of all m identifications divided by m gives the overall accuracy for this task.

Task-3: Joint text detection and script identification

This task combines all the preparation steps needed for multi-script text recognition. A participant method should take as input a full scene image, and then find the bounding boxes of all the words, and the information about each word in terms of script id.

Ground Truth Format

The ground truth is provided in the same format as in Task 1.

Results Format

Joint detection and script identification results should be provided in a single zip file. A text file per image is expected. The file should be named after the test image ID, using the following naming convention:

res_[image name].txt

Inside each text file, a list of detected bounding boxes coordinates (four corners, clockwise), along with the confidence of the detection and the script class should be provided:

x1,y1,x2,y2,x3,y3,x4,y4,confidence,script

Evaluation

The evaluation of this task is a cascade of correct localization (detection) of a text box and correct script classification. If a word bounding box is correctly detected according to the evaluation criterion of Task 1, and also the script of this correctly detected word is correctly identified as in Task 2, then the joint detection and script identification of this word is counted as correct.

Task-4: End-to-End text detection and recognition

This is a very challenging task of a unified OCR for multiple-languages. The end-to-end scene text detection and recognition task in multi-language setting is coherent with its English counterparts. Given an input scene image, the objective is to localize a set of bounding boxes and their corresponding transcriptions.

We provide a baseline method for this task: E2E-MLT
We provide a synthetic dataset that matches the real dataset in terms of scripts, to help with the training for this task:
- Synthetic MLT Data ( Arabic, Bangla, Chinese, Japanese, Korean, Latin, Hindi )
- GT in MLT format ( Arabic, Bangla, Chinese, Japanese, Korean, Latin, Hindi )
- You can find the details of the method and also the synthetic dataset at:
  - E2E-MLT - an Unconstrained End-to-End Method for Multi-Language Scene Text: https://arxiv.org/abs/1801.09919

Ground Truth Format

The ground truth is provided in the same format as in Task 1.

Results Format

Joint detection and recognition results should be provided in a single zip file. A text file per image is expected. The file should be named after the test image ID, using the following naming convention:

res_[image name].txt

Inside each text file, a list of detected bounding boxes coordinates (four corners, clockwise), along with the transcription of the detection should be provided:

x1,y1,x2,y2,x3,y3,x4,y4,confidence,transcription

Evaluation

The evaluation of this task is a cascade of correct localization (detection) of a text box and correct recognition (word transcription). If a word bounding box is correctly detected according to the evaluation criterion of Task 1, and also the transcription of this correctly detected word is correctly recognized (according to edit-distance measure), then the joint detection and recognition of this word is counted as correct.

All the words in the test set that contain characters which do not appear in the training set, will be set to "don't care", so whether they are correctly detected/recognized by your method or not, it won't affect the evaluation, they are simply not counted. This means you could train based on the lexicon of the training set.

Frequently Asked Questions:

Q: How is the ranking/evaluation done for tasks 1, 3 and 4:

For task 1:
The ranking is based on f-measure (Hmean) [NOT average precision], and is calculated at the end using average recall and average precision of the method, where:
methodRecall = number of matches / number of bounding boxes in GT (where a match is when the box is detected correctly)
methodPrecision = number of matches / number bounding boxes in Detection
MethodHmean = 2* methodRecall * methodPrecision / (methodRecall + methodPrecision)

The recall, precision and f-measure are NOT calculated for each image individually. They are computed based on the detected boxes in all the images (of course the boxes are matched/processed image by image). There was a confusion because in the paper of MLT-2017, there was a mistake in describing the evaluation protocol (in the paper, it is mentioned that the f-measure is computed per image and then averaged across the images -- this is not what we did)

For Task 3: The same ranking and evaluation except the definition of the "match". The match is counted when the box is both detected correctly and with correct script identification.

For Task 4: The same ranking and evaluation, but the match is counted when the box is both detected and recognized correctly. Extra information: 1) the recognition measure is edit distance, 2) the test set words which contain characters that did not appear in the train set will be set as "don't care" for both the detection and recognition. This means whether you detect them correctly or not, or recognize them correctly or not, they won't be counted in the evaluation.

Q: Multiple submissions (multiple results submissions for the same task):

1. You can have more than one final submission ONLY if the methods are different (at least one step in the method is different, not just parameter tuning). But you can keep all your submissions online, just inform us which one is the final submission to be shown the competition results.

2. Choose a different method name for each submission ( for example: Method_V1, Method_V2 etc.) then you could describe the different submissions in the description space.

Q: May the participants use other datasets (public, private or synthetic) for training ?

A: You are free to use any dataset for training (private, public, synthetic etc.) for any task. It is up to you to train your model the best way you can. Regarding task 4, we suggest the optional use of the synthetic dataset that was made to match the MLT dataset in terms of languages etc. since the training data is unbalanced for the different languages. Again, it is up to you to utilize our training set and any complement dataset for training purposes.

Q: Is the evaluation code available for participants ?

A: The offline evaluation tool is currently not available to participants for the MLT challenges, because the ground truth of the MLT test set -- which is required for the correct functioning of the offline evaluation tool -- is not public. As the MLT-2017 continues in a new version this year (MLT-2019), we have decided not to share the ground truth for the test set. You may either use the online version of the tools, or make your own evaluation based on the information provided in the paper of MLT-2017 and on the website.

Q: What is the process of results submission, what are the deadlines ?

A: The process of results submission is as follows:
27th May 2019: deadline for the submitting the 1) participants information (names/teams and affiliations), 2) methods descriptions for the task(s) in which you are participating and 3) initial results* (see below)

3rd June 2019: submission of the final results (you are able to update the results which you submit to us till 3rd June, for example if you tune more etc.). It is preferable if you submit earlier, and in this case, please notify us which one is your final submission.

* initial results: same format and completeness of the final results, but only the final results will appear in the MLT-2019 challenge results, and the initial results may be removed after 3rd June.

This schedule is also shown on the right column of this web-page.

Challenge News

05/03/2019
MLT-2019 Deadlines and process of results submission
05/02/2019
MLT-2019 Test Set Available
03/15/2019
MLT-2019 Train Set Available

Important Dates

15 Feb to 2 May

Manifestation of interest by participants opens

Asking/Answering questions about the details of the competition

1 Mar

Competition formal announcement

15 Mar

Website fully ready

Registration of participants continues

Evaluation protocol, file formats etc. available

15 Mar to 2 May

Train set available - training period - MLT challenge in progress -Participants evaluate their methods on the training/validation sets - Prepare for submission

Registration is still open

2 May

Registration closes for this MLT challenge for ICDAR-2019

2 May to 3 June

Test set available

27 May

Deadline for submitting: 1) participant information (names and affiliation), 2) methods description, 3) initial (or final) results

3 June

Deadline for submission of the final results by participants

20 - 25 Sept

Announcement of results at ICDAR2019

1 Oct

The public release of the full dataset