Tasks - Incidental Scene Text

The Challenge is set up around three tasks, all of them new for the 2015 edition of the competition:

  • Text Localization, where the objective is to obtain a rough estimation of the text areas in the image, in terms of bounding boxes that correspond to parts of text (words or text lines).
  • Word Recognition, where the locations (bounding boxes) of words in the image are assumed to be known and the corresponding text transcriptions are sought.
  • End-to-End, where the objective is to localise and recognise all words in the image in a single step.

A training set of 1000 images containing about 4500 readable words will be provided through the downloads section. Different ground truth data is provided for each task.

All images are provided as JPEG or PNG files and the text files are UTF-8 files with CR/LF new line endings.

Task 4.1: Text Localization

For the text localization task we will provide bounding boxes of words for each of the images. The ground truth is given as separate text files (one per image) where each line specifies the coordinates of one word's bounding box and its transcription in a comma separated format (see Figure 1).

Ch4_Task1_Figure1.pngFor the text localization task the ground truth data is provided in terms of word bounding boxes. Unlike Challenges 1 and 2, bounding boxes are NOT axis oriented in Challenge 4, and they are specified by the coordinates of their four corners in a clock-wise manner. For each image in the training set a separate UTF-8 text file will be provided, following the naming convention:

gt_[image name].txt

The text files are comma separated files, where each line will corresponds to one word in the image and gives its bounding box coordinates (four corners, clockwise) and its transcription in the format:

x1, y1, x2, y2, x3, y3, x4, y4, transcription

Please note that anything that follows the eighth comma is part of the transcription, and no escape characters are used. "Do Not Care" regions are indicated in the ground truth with a transcription of "###".

The authors will be required to automatically localise the text in the images and return bounding boxes. The results will have to be submitted in separate text files for each image, with each line corresponding to a bounding box (comma separated values) as per the above format. A single compressed (zip or rar) file should be submitted containing all the result files. In the case that your method fails to produce any results for an image, you can either include an empty result file or no file at all.

Contrary to Challenges 1 and 2, the evaluation of the results will be based on a single Intersection-over-Union criterion, with a threshold of 50%, similarly to standard practice in object recognition and Pascal VOC challenge [1].

Task 4.2: Text Segmentation

Not available for this Challenge.

Task 4.3: Word Recognition


For the word recognition task, we provide all the words in our dataset with 3 characters or more in separate image files, along with the corresponding ground-truth transcription (See Figure 2 for examples). For each word the axis oriented area that tighly contains the word will be provided.

The transcription of all words is provided in a SINGLE UTF-8 text file for the whole collection. Each line in the ground truth file has the following format:

[word image name], "transcription"

An example is given in figure 2. Please note that the escape character (\) is used for double quotes and backslashes.

In addition, the relative coordinates of the (non-axis oriented) bounding box that defines the word within the cut-out word image will be provided in a separate SINGLE text file for the whole collection. Coordinates of the words are given in reference to the cut-out box, as the four corners of the bounding box in a clock-wise manner. Each line in the ground truth file has the following format:

[word image name], x1, y1, x2, y2, x3, y3, x4, y4

An example is given in figure 2.

For testing we will provide the images of about 2000 words and we will ask for the transcription of each image. A single transcription per image will be requested. The authors should return all transcriptions in a single text file of the same format as the ground truth.

For the evaluation we will calculate the edit distance between the submitted image and the ground truth transcription. Equal weights will be set for all edit operations. The best performing method will be the one with the smallest total edit distance.

Task 4.4: End to End

Ground truth is provided for each image of the training set that comprises the bounding quadrilateral of each word as well as the transcription of the word. The ground truth is the same as for Task 4.1. One- or two-character words as well as words deemed unreadable are annotated in the dataset as “do not care” following the ground truthing protocol (to be made public).


Apart from the transcription and location ground truth we provide a generic vocabulary of about 90k words, a vocabulary of all words in the training set and per-image vocabularies of 100 words comprising all words in the corresponding image as well as distractor words selected from the rest of the training set vocabulary, following the setup of Wang et al [2]. Authors are free to incorporate other vocabularies / text corpuses during training to enhance their language models, in which case they will be requested to indicate so at submission time to facilitate the analysis of results.

All vocabularies provided contain words of 3 characters or longer comprising only letters.

Vocabularies do not contain alphanumeric structures that correspond to prices, URLs, times, dates, emails etc. Such structures, when deemed readable, are tagged in the images and an end-to-end method should be able to recognise them, although the vocabularies provided do not inlcude them explicitly.

Words were stripped by any preceding or trailing symbols and punctuation marks before they were added in the vocabulary. Words that still contained any symbols and puctuation marks (with the exception of hyphens) were filtered as well. So for example "e-mail" is a valid vocabulary entry, while "rrc.cvc.uab.es" is a non-word and is not included.

Submission Stage

For the test phase, we will provide a set of test images along with three specific lists of words for each test image that comprise:
  1. Strongly Contextualised: per-image vocabularies of 100 words including all words (3 characters or longer, only letters) that appear in the image as well as a number of distractor words chosen at random from the same subset test following the setup of Wang et al [2],
  2. Weakly Contextualised: all words (3 characters or longer, only letters) that appear in the entire test set, and
  3. Generic: any vocabulary can be used, a 90k word vocabulary is provided

For each of the above variants, participants can make use of the corresponding vocabulary given at test time to guide the end-to-end word detection and recognition process.

Participants will be able to submit end-to-end results for these variants in a single submission step. Variant (1) will be obligatory, while variants (2) and (3) optional.

Along with the submission of results, participants will have the option to submit the corresponding executable binary file (Windows, Linux or Mac executable). This optional binary file can be added to the submission at a later time (there is no need to delay the submission of results). The executable of the method will be used over a hidden test subset to further analyse the method and provide insight to the authors. The ownership of the file remains with the authors, and the organisers of the competition will keep the executable private and will not make use of the executable in any way unrelated to the competition. The executable should be:

  • Windows, linux, Mac executable
  • Compiled for single core architectures
  • Have no external dependencies (statically linked, or all libraries given)
  • Command line, no graphical interface
  • In Parameters: vocabulary filename (e.g. images/img.txt), image filename (e.g. images/img.png)
  • Output: text file  of results for the image same format as the submission called out.txt


The evaluation protocol proposed by Wang 2011 [2] will be used which considers a detection as a match if it overlaps a ground truth bounding box by more than 50% (same as [1] and Task 4.1) and the words match, ignoring the case. Detecting or missing words marked as “do not care” will not affect (positively or negatively) the results. Any detections overlapping more than 50% with “do not care” ground truth regions will be discarded from the submitted results before evaluation takes place, and evaluation will not take into account ground truth regions marked as “do not care”.


  1. M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, (2014). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision111(1), 98-136.
  2. K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition”, in Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 1457-1464), IEEE, November 2011

Challenge News

Important Dates