Tasks - ICDAR2017 Competition on Multi-lingual scene text detection and script identification

In order to participate in the MLT challenge, you can participate in one or more tasks.

After the tasks description, you will see some notes about training and evaluation of the different tasks.

Task-1: Multi-script text detection

This task could be viewed as a generalization of the text detection task of the previous robust reading competitions, where a participant method should be able to generalize to detecting text of different scripts. This task will be at word level.

Ground Truth Format

NOTE: the GT provides more information than needed for this task, because this GT is shared with Task 3 as well. So, please check the results format.

The ground truth is provided in terms of word bounding boxes. Bounding boxes are NOT axis oriented and they are specified by the coordinates of their four corners in a clock-wise manner. For each image in the training set a corresponding UTF-8 encoded text file is provided, following the naming convention:

gt_[image name].txt

The text files are comma separated files, where each line corresponds to one text block in the image and gives its bounding box coordinates (four corners, clockwise), its script and its transcription in the format:

x1,y1,x2,y2,x3,y3,x4,y4,script,transcription

Valid scripts are: "Arabic", "Latin", "Chinese", "Japanese", "Korean", "Bangla", "Symbols", "Mixed", "None"

Note that the transcription is anything that follows the 9th comma until the end of line. No escape characters are to be used.

If the transcription is provided as "###", then text block (word) is considered as "don't care". Some of the "don't care" words have a script class that corresponds to a language, and the others have a "None" script class. The latter case is when the word script cannot be identified due to low resolution or other disortions.

Results Format

Localisation results are expected in a similar format as the ground truth. One UTF-8 encoded text file per test image is expected. Participants will be asked to submit all results in a single zip file. Result files should be named after test image IDs following the naming convention:

res_[image name].txt 

(e.g. res_1245.txt). Each line should correspond to one word in the image and provide its bounding box coordinates (four corners, clockwise) and a confidence scrore in the format:

x1,y1,x2,y2,x3,y3,x4,y4,confidence

Participants will be asked to include a confidence score for every bounding box (with four corner points). Note that points are listed in clockwise order.

Task-2: Cropped Word Script identification

The text in our dataset images appear in 9 different languages, some of them share the same script. Additionally, punctuation and some math symbols sometimes appear as separate words, those words will be of a special script class called "Symbols". Hence, we have a total of 7 different scripts. We have excluded the words that have "Mixed" script for this task. We have also excluded all "don't care" words whether they have an identified script or not.

Ground Truth Format

For the word script identification task, we provide all the words in our dataset in separate image files, along with the corresponding ground-truth script and transcription. The transcription is not used in this task and can be ignored. For each text block, the axis oriented area that tighly contains the text block is provided.

The script and transcription of all words is provided in a SINGLE UTF-8 text file for the whole collection. Each line in the ground truth file has the following format

[word image name],script,transcription

Note that the transcription is anything that follows the 2nd comma until the end of line. No escape characters are to be used. Valid scripts are "Arabic", "Latin", "Chinese", "Japanese", "Korean", "Bangla", "Symbols".

In addition, we provide the infromation about the original image from which the word images have been extracted as follows: the relative coordinates of the (non-axis oriented) bounding box that defines the text block within the cut-out text block image are provided in a separate SINGLE text file for the whole collection. The coordinates of the text blocks are given in reference to the cut-out box, as the four corners of the bounding box in a clock-wise manner. Each line in the ground truth file has the following format.

[word image name], x1, y1, x2, y2, x3, y3, x4, y4,[original image name]

Results Format

For testing we will provide the cropped images of text blocks and we will ask for the script of each image. A single script name per image will be requested. The authors should return all scripts in a single UTF-8 encoded text file, one script per word image, using the following format:

[word image name],script

Task-3: Joint text detection and script identification

This task combines all the preparation steps needed for multi-lingual text recognition. This task should take a full input image, finds the bounding boxes of all the words, and the information about each word in terms of script id.

Ground Truth Format

The ground truth is provided in the same format as in Task 1.

Results Format

Joint detection and script identification results should be provided in a single zip file. A text file per image named after the test image IDs is expected using the following naming convention:

res_[image name].txt 

Inside each text file, a list of detected bounding boxes coordinates (four corners, clockwise), along with the confidence of the detection and the script class should be provided:

x1,y1,x2,y2,x3,y3,x4,y4,confidence,script

Training and Evaluation:

Tasks 1 and 3: Should we detect the "don't care" boxes (transcribed as "####"), how will this be evaluated ?
Answer: "don't care" boxes do not count in the evaluation. This means detecting or missing the don't care boxes will not affect your final score. 

Task 2: The training and test sets of this task are word images that have been extracted from the full images of task 1. We extracted all the words except "don't care" words and the words of "Mixed" script.

Question: can we use the validation set for training?
Answer: Yes. You can use the full available dataset (training + validation) for training. The split "training/validation" is just for you to test your methods. But eventually, the 9000 images should be considered as the full training set.

Important Dates

1 Feb to 31 Mar

  • Manifestation of interest by participants opens
  • Asking/Answering questions about the details of the competition Initial website available

1 Mar

  • Competition formal announcement

31 Mar

  • Website fully ready
  • Registration of participants continues
  • Evaluation protocol, file formats etc. available

1 Apr to 31 May

  • Train set available - training period - MLT challenge in progress -Participants evaluate their methods on the training/validation sets - Prepare for submission
  • Registration is still open

1 Jun

  • Registration closes for this MLT challenge for ICDAR-2017

1 Jun to 1 Jul

  • Test set available

1 Jul

  • Deadline for submission of results by participants

1 Nov

  • The public release of the full dataset