Tasks - Text in Videos
- Text Localisation, where the objective is to localise and track all words in the video sequences.
- End to End, where the objective is to localise, track and recognise all words in the video sequences.
- Follow wayfinding panels walking outdoors
- Search for a shop in a shopping street
- Browse products in a supermarket
- Search for a location in a building
- Highway watch
- Train watch
- Head-mounted camera
- Mobile Phone
- Hand-held camcorder
- HD camera
<?xml version="1.0" encoding="us-ascii"?>
<object Transcription="T" ID="1001" Quality="low" Language=”Spanish” Mirrored=”unmirrored”>
<Point x="97" y="382" />
<Point x="126" y="382" />
<Point x="125" y="410" />
<Point x="97" y="411" />
<object Transcription="910" ID="1002" Quality="moderate" Language=”Spanish” Mirrored=”unmirrored”>
<Point x="607" y="305" />
<Point x="640" y="305" />
<Point x="639" y="323" />
<Point x="609" y="322" />
// Represents an empty frame
<object Transcription="T" ID="1001" Quality="moderate" Language=”Spanish” Mirrored=”unmirrored”>
<Point x="98" y="384" />
<lt;Point x="127" y="384" />
<Point x="125" y="412" />
<Point x="97" y="413" />
<object Transcription="910" ID="1002" Quality="high" Language=”Spanish” Mirrored=”unmirrored”>
<Point x="609" y="307" />
<Point x="642" y="307" />
<Point x="641" y="325" />
<Point x="611" y="324" />
<frames>is the root tag.
">identifies the frame inside the video. ID is the index of the frame in the video.
" Mirrored = "
" Quality = "
">represents each of the objects (words) in the frame.
Transcriptionis the textual transcription of the word
IDis a unique identifier of an object; all occurrences of the same object have the same ID.
Languagedefines the language the word is written in
Mirroredis a boolean value that defines whether the word is seen through a mirrored surface or not
Qualityis the quality of the text which can be one of those values: low, moderate or high. The low value is special, as it is used to define text areas that are unreadable. During the evaluation, such areas will not be taken into account: a method will not be penalised if it does not detect these words, while a method that detects them will not get any better score.
" />represents a point of the word bounding box in the image. Bounding boxes always comprise 4 points. See more information about the ground truthing protocol here.
If no objects exist in a particular frame the frame tag is created empty.
Participants are required to automatically localise the words in the images and return affine bounding boxes in the same XML format. In the XML format of the users, only the ID attribute is expected for each object, any other attributes will be ignored.
A single compressed (zip or rar) file should be submitted containing all the result files for all the videos of the test set. In the case that your method fails to produce any results for a particular video, you should include no XML file for that particular video.
The evaluation of the results will be based on an adaptation of the CLEAR-MOT evaluation framework . The basis for our evaluation is the code provided by Bagdanov et al. . For more detail, see .
The objective of this task is to recognise words in the video as well as localise them in terms of time and space. The localisation in time means identifying when a word appears and disappears in the video. The localisation in space means identifying where a word exists in a single frame. The task requires that correctly recognised words are also correctly localised in every frame and tracked correctly over the video sequence.
In this task, the same dataset and ground truth as Task 3.1 is used, along with certain vocabularies provided.
Apart from the transcription and location ground truth we provide a generic vocabulary, a vocabulary of all words in the training set and per-image vocabularies of 200 words comprising all proper words in the corresponding image as well as distractor words selected from the training set vocabulary, following the setup of Wang et al .
Along with the training set sequences and ground truth, we provide a generic training vocabulary of about 90k words as extra material that can be used for building statistical language models. Authors are free to incorporate other vocabularies / text corpuses during their training to enhance their language models, in which case they will be requested to indicate so at submission time to facilitate the analysis of results.
All vocabularies provided contain words of 3 characters or longer comprising only letters.
Vocabularies do not contain alphanumeric structures that correspond to prices, URLs, times, dates, emails etc. Such structures, when deemed readable, are tagged in the images and an end-to-end method should be able to recognise them, although the vocabularies provided do not inlcude them explicitly.
Words were stripped by any preceding or trailing symbols and punctuation marks before they were added in the vocabulary. Words that still contained any symbols and puctuation marks (with the exception of hyphens) were filtered as well. So for example "e-mail" is a valid vocabulary entry, while "rrc.cvc.uab.es" is a non-word and is not included.
- Strongly Contextualised: per-sequence vocabularies of 200 words including all words (3 characters or longer, letters only) that appear in the image as well as a number of distractor words chosen at random from the test set vocabulary,
- Weakly Contextualised: all words (3 characters or longer, letters only) that appear in the entire test set, and
- Generic: any vocabulary can be used, a 90k word vocabulary is provided
For each of the above variants, participants can make use of the corresponding vocabulary given to guide the end-to-end word detection and recognition process.
Participants will be able to submit end-to-end results for these variants in a single submission step. Variant (1) will be obligatory, while variants (2) and (3) optional.
The evaluation is made in two steps; (1) firstly whether each word sequence is correctly identified is evaluated, and (2) the F-measure in word sequence level is calculated. As for (1), word localisation and recognition are evaluated separately. The word localisation performance is evaluated by the same scheme as Task 3.1. The word recognition performance is evaluated by simply whether a word recognition result is completely correct.
Word recognition evaluation is case-insensitive, and accent-insensitive. Words with less than 3 characters are not taken into account for evaluation. Similarly, words containing non-alphanumeric characters are not taken into account with the following exceptions: the "hyphen" and "apostrophe" are allowed always, exclamation/question marks, dots, and commas are allowed only at the begining/ending of a word but a method is not penalized if the transcription does not include them.
So as to perform the evaluation mentioned above, for each video, the participants are required to submit an XML file in the same format as Task 3.1, and a text file containing word recognition results. The text file should follow the following format:
where the first field contains IDs and the second field correspond to recognition results.
Note that word recognition transcriptions of the same word object may be different in different frames due to entering to the frame, exiting from the frame and occlusions. Despite these obstructions, the sequence level recognition results should be COMPLETE words. For example, a word “Gracias” should be written as “Gracias” while it may be seen as “Gra” when it is entering and it may be seen as “cias” when it is exiting the scene.
 Keni Bernardin and Rainer Stiefelhagen. “Evaluating multiple object tracking performance: the CLEAR MOT metrics” J. Image Video Process. 2008, Article 1 (January 2008), 10 pages.” DOI=10.1155/2008/246309
 A.D. Bagdanov, A. Del Bimbo, F. Dini, G. Lisanti, and I. Masi, "Compact and efficient posterity logging of face imagery for video surveillance", IEEE Multimedia, 2012
 D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. Gomez, S. Robles, J. Mas, D. Fernandez, J. Almazan, L.P. de las Heras , "ICDAR 2013 Robust Reading Competition", In Proc. 12th International Conference of Document Analysis and Recognition, 2013, IEEE CPS, pp. 1115-1124
 K. Wang, B. Babenko, and S. Belongie, "End-to-end scene text recognition", in Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 1457-1464), IEEE, November 2011