Tasks - Focused Scene Text

The Challenge is set up around four tasks:

Text Localization, where the objective is to obtain a rough estimation of the text areas in the image, in terms of bounding boxes that correspond to parts of text (words or text lines).
Text Segmentation, where the objective is the pixel level separation of text from the background.
Word Recognition, where the locations (bounding boxes) of words in the image are assumed to be known and the corresponding text transcriptions are sought.
End-to-End, where the objective is to localise and recognise all words in the image in a single step.

For the 2015 edition, the focus is solely on task T2.4 "End-to-End". The rest of the tasks are open for submissions but will not be included / analysed in the ICDAR 2015 report.

A training set of 229 images (containing 848 words) is provided through the downloads section. Different ground truth data is provided for each task.

All images are provided as PNG files and the text files are ASCII files with CR/LF new line endings.

Task 2.1: Text Localization

For the text localization task we provide bounding boxes of words for each of the images. The ground truth is given as separate text files (one per image) where each line specifies the coordinates of one word's bounding box and its transcription in a comma separated format (see Figure 1).

For the text localization task the ground truth data is provided in terms of word bounding boxes. For each image in the training set a separate ASCII text file will be provided, following the naming convention:

gt_[image name].txt

The text files are comma separated files, where each line will corresponds to one word in the image and gives its bounding box coordinates and its transcription in the format:

left, top, right, bottom, "transcription"

Please note that the escape character (\) is used for double quotes and backslashes.

The authors will be required to automatically localise the text in the images and return bounding boxes. The results will have to be submitted in separate text files for each image, with each line corresponding to a bounding box (comma separated values) as per the above format. A single compressed (zip or rar) file should be submitted containing all the result files. In the case that your method fails to produce any results for an image, you can either include an empty result file or no file at all.

The evaluation of the results will be based on the algorithm of Wolf et al [1] which in turn is an improvement on the algorithms used in the robust reading competitions in previous ICDAR instalments.

Task 2.2: Text Segmentation

For the text segmentation task, the ground truth data is provided in the form of colour-coded BMP images and text files following the naming convention:

[image name]_GT.bmp

[image name]_GT.txt

Ground-Truth image files contain pixel-level assignment to individual characters. In the ground truth images, white pixels should be interpreted as background pixels, while non-white pixels as text (see Figure 2). The non-white pixels are colour coded so that pixels of each character have a unique color, hence by iterating through the different colors in the image, one can get individual character images. To get a binary mask containing all pixels belonging to text regions, simply ignore the color labels (white corresponds to background while all other colors represent foreground pixels).

A Ground-Truth text file containing Unicode labels corresponding to each marked character in the ground-truth image file is also provided. Each line of the text GT has a format like that:

200 77 18 457 142 443 128 473 169 "T"
The first three numbers are the RGB values of the colour corresponding to character "T" in the ground truth image. The 4th and 5th columns give the coordinates of the center of "T", and the last 4 columns represent the bounding box of "T" (top-left, bottom-right corners).

Note that the concept of "Don't Care" regions is used, referring to all text in the image that is illegible, usually because of very small text height or because of a high amount of occlusion. The evaluation method will be modified such that if an algorithm 'A' identifies such regions as text, where algorithm 'B' ignores them; both algorithms will get the same score (hence the name "Don't Care" for these regions). Don't Care regions are marked with rectangular (filled) boxes in the ground-truth image. The corresponding line in the ground-truth text files starts with a "#" symbol and has a space as the label of the region.

The authors will be asked to automatically segment the test images and submit their segmentation result as a series of bi-level images, following the same format. A single compressed (zip or rar) file should be submitted containing all the result files. In the case that your method fails to produce any results for an image, you can either include an empty result file or no file at all.

Evaluation will be primarily based on the methodology proposed by the organisers in the paper [2], while a typical precision / recall measurement will also be provided for consistency, in the same fashion as [3].

Task 2.3: Word Recognition

For the word recognition task, we provide all the words in our dataset in separate image files, along with the corresponding ground-truth transcription (See Figure 2 for examples). Note that there are many short words and even single letters in this dataset. The transcription of all words is provided in a SINGLE text file for the whole collection. Each line in the ground truth file has the following format:

[image name], "transcription"

An example is given in figure 3. Please note that the escape character (\) is used for double quotes and backslashes.

For testing we provide the images of 1095 words and we will ask for the transcription of each image. A single transcription per image will be requested. The authors should return all result transcriptions in a single text file of the same format as the ground truth.

For the evaluation we will calculate the edit distance between the submitted image and the ground truth transcription. Equal weights will be set for all edit operations. The best performing method will be the one with the smallest total edit distance.

Note that words are cut-out with a frame of 4 pixels around them (instead of the tight bounding box), in order to preserve the immediate context. This is usual practice to facilitate processing (see for example the MNIST character dataset).

Task 2.4: End to End

Ground truth is provided for each image of the training set that comprises the bounding quadrilateral of each word as well as the transcription of the word. The ground truth is the same as for Task 2.1. One- or two-character words as well as words deemed unreadable are annotated in the dataset as “do not care” following the ground truthing protocol (to be made public).

Vocabularies

Apart from the transcription and location ground truth we provide a generic vocabulary of about 90k words, a vocabulary of all words in the training set and per-image vocabularies of 100 words comprising all words in the corresponding image as well as distractor words selected from the rest of the training set vocabulary, following the setup of Wang et al [4]. Authors are free to incorporate other vocabularies / text corpuses during training to enhance their language models, in which case they will be requested to indicate so at submission time to facilitate the analysis of results.

All vocabularies provided contain words of 3 characters or longer comprising only letters.

Volcabularies will comprise only the proper words in the images. Vocabularies do not contain alphanumeric structures that correspond to prices, URLs, times, dates, emails etc. Such structures, when deemed readable, are tagged in the images and an end-to-end method should be able to recognise them, although the vocabularies provided do not inlcude them explicitly.

Words were stripped by any preceding or trailing symbols and punctuation marks before they were added in the vocabulary. Words that still contained any symbols and puctuation marks (with the exception of hyphens) were filtered as well. So for example "e-mail" is a valid vocabulary entry, while "rrc.cvc.uab.es" is a non-word and is not included.

Submission Stage

For the test phase, we will provide a set of test images along with three specific lists of words for each test image that comprise:

Strongly Contextualised: per-image vocabularies of 100 words including all words (3 characters or longer, only letters) that appear in the image as well as a number of distractor words chosen at random from the same subset test following the setup of Wang et al [4],
Weakly Contextualised: all proper words (3 characters or longer, only letters) that appear in the entire test set, and
Generic: any vocabulary can be used, a 90k word vocabulary is provided

For each of the above variants, participants can make use of the corresponding vocabulary given to guide the end-to-end word detection and recognition process.

Participants will be able to submit end-to-end results for these variants in a single submission step. Variant (1) will be obligatory, while variants (2) and (3) optional.

Along with the submission of results, participants will have the option to submit the corresponding executable binary file (Windows, Linux or Mac executable). This optional binary file can be added to the submission at a later time (there is no need to delay the submission of results). The executable of the method will be used over a hidden test subset to further analyse the method and provide insight to the authors. The ownership of the file remains with the authors, and the organisers of the competition will keep the executable private and will not make use of the executable in any way unrelated to the competition. The executable should be:

Windows, linux, Mac executable
Compiled for single core architectures
Have no external dependencies (statically linked, or all libraries given)
Command line, no graphical interface
In Parameters: vocabulary filename (e.g. images/img.txt), image filename (e.g. images/img.png)
Output: text file of results for the image same format as the submission called out.txt

Evaluation

The evaluation protocol proposed by Wang 2011 [4] will be used which considers a detection as a match if it overlaps a ground truth bounding box by more than 50% (as in [5]) and the words match, ignoring the case. Detecting or missing words marked as “do not care” will not affect (positively or negatively) the results. Any detections overlapping more than 50% with “do not care” ground truth regions will be discarded from the submitted results before evaluation takes place, and evaluation will not take into account ground truth regions marked as “do not care”.

References

C. Wolf and J.M. Jolion, "Object Count / Area Graphs for the Evaluation of Object Detection and Segmentation Algorithms", International Journal of Document Analysis, vol. 8, no. 4, pp. 280-296, 2006.
A. Clavelli, D. Karatzas, and J. Llados, "A Framework for the Assessment of Text Extraction Algorithms on Complex Colour Images", in Proceedings of the 9th IAPR Workshop on Document Analysis Systems, Boston, MA, 2010, pp. 19-28.
K. Ntirogiannis, B. Gatos, and I. Pratikakis, "An Objective Methodology for Document Image Binarization Techniques", in Proceedings of the 8th International Workshop on Document Analysis Systems, Nara, Japan, 2008, pp. 217-224
K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition”, in Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 1457-1464), IEEE, November 2011
M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, (2014). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98-136.