Tasks - Text in Videos

The Challenge will comprise two tasks

Text Localisation, where the objective is to localise and track all words in the video sequences.
End to End, where the objective is to localise, track and recognise all words in the video sequences.

A training set of 25 videos (13450 frames in total) and a test set of 24 videos (14374 frames in total) are available through the downloads section.

The dataset was collected by the organisers in different countries, in order to include text in different languages. The video sequences correspond to 7 high level tasks. The tasks were selected so that they represent typical real-life applications, and cover indoors and outdoors scenarios. We used 4 different cameras for different sequences, so that we also cover a variety of possible hardware used.

The dataset is summarised in the following table (For the meaning of numbers in Task and Camera type, please refer to the naming convention of the training set videos described below).

Training set
Video ID	Task	Camera type	No. of Frames	Duration
2	1	2	452	0:15
4	4	2	662	0:22
7	5	1	264	0:11
8	5	1	240	0:10
10	1	1	336	0:14
13	4	1	576	0:24
16	3	2	301	0:10
18	3	1	198	0:08
19	5	1	408	0:17
21	5	1	960	0:40
25	5	2	1860	1:02
26	5	2	387	0:16
28	1	1	432	0:18
33	2	3	480	0:20
36	2	3	336	0:14
37	2	3	456	0:19
40	2	3	408	0:17
41	2	3	528	0:22
42	2	3	504	0:21
45	6	4	780	0:26
46	6	4	722	0:30
47	6	4	780	0:26
51	7	4	450	0:15
52	7	4	450	0:15
54	7	4	480	0:16

Test set
Video ID	Task	Camera type	No. of Frames	Duration
1	1	2	602	0:20
5	3	2	542	0:18
6	3	2	162	0:05
9	1	1	528	0:22
11	4	1	312	0:13
15	4	1	216	0:09
17	3	1	264	0:11
20	5	1	912	0:38
22	5	1	648	0:27
23	5	2	1020	0:34
24	5	2	1050	0:35
30	2	3	264	0:11
32	2	3	312	0:13
34	2	3	384	0:16
35	2	3	336	0:14
38	2	3	684	0:28
39	2	3	438	0:18
43	6	4	1020	0:34
44	6	4	1980	1:06
48	6	4	510	0:17
49	6	4	900	0:30
50	7	4	360	0:12
53	7	4	450	0:15
55	3	2	480	0:20

The naming convention of the videos is as follows:

video_[UID]_[task]_[camera].mp4

where [UID] is a unique ID number of the video, [task] is a number that indicates the task depicted in the video, and [camera] is the type of camera used to record it. The task IDs are as follows:

Follow wayfinding panels walking outdoors
Search for a shop in a shopping street
Browse products in a supermarket
Search for a location in a building
Driving
Highway watch
Train watch

The camera IDs are as follows:

Head-mounted camera
Mobile Phone
Hand-held camcorder
HD camera

Task 3.1 Text Localisation in Video

The objective of this task is to obtain the location of words in the video in terms of their affine bounding boxes. The task requires that words are both localised correctly in every frame and tracked correctly over the video sequence.

All the videos will be provided as MP4 files.

Ground truth will be provided as a single XML file per video. The format of the ground truth file will follow the structure of the example below.

<?xml version="1.0" encoding="us-ascii"?> <frames> <frame ID="1"> <object Transcription="T" ID="1001" Quality="low" Language=”Spanish” Mirrored=”unmirrored”> <Point x="97" y="382" /> <Point x="126" y="382" /> <Point x="125" y="410" /> <Point x="97" y="411" /> </object> <object Transcription="910" ID="1002" Quality="moderate" Language=”Spanish” Mirrored=”unmirrored”> <Point x="607" y="305" /> <Point x="640" y="305" /> <Point x="639" y="323" /> <Point x="609" y="322" /> </object> </frame> <frame ID="2"> // Represents an empty frame </frame> <frame ID="3"> <object Transcription="T" ID="1001" Quality="moderate" Language=”Spanish” Mirrored=”unmirrored”> <Point x="98" y="384" /> <lt;Point x="127" y="384" /> <Point x="125" y="412" /> <Point x="97" y="413" /> </object> <object Transcription="910" ID="1002" Quality="high" Language=”Spanish” Mirrored=”unmirrored”> <Point x="609" y="307" /> <Point x="642" y="307" /> <Point x="641" y="325" /> <Point x="611" y="324" /> </object> </frame> </frames>

where <frames> is the root tag.

<Frame ID="num_frame"> identifies the frame inside the video. ID is the index of the frame in the video.

<Object Transcription="transcription" ID="num_id" Language="language" Mirrored = "mirrored/unmirrored" Quality = "low/moderate/high"> represents each of the objects (words) in the frame.

Transcription is the textual transcription of the word
ID is a unique identifier of an object; all occurrences of the same object have the same ID.
Language defines the language the word is written in
Mirrored is a boolean value that defines whether the word is seen through a mirrored surface or not
Quality is the quality of the text which can be one of those values: low, moderate or high. The low value is special, as it is used to define text areas that are unreadable. During the evaluation, such areas will not be taken into account: a method will not be penalised if it does not detect these words, while a method that detects them will not get any better score.

<Point x="000" y="000" /> represents a point of the word bounding box in the image. Bounding boxes always comprise 4 points. See more information about the ground truthing protocol here.

If no objects exist in a particular frame the frame tag is created empty.

Participants are required to automatically localise the words in the images and return affine bounding boxes in the same XML format. In the XML format of the users, only the ID attribute is expected for each object, any other attributes will be ignored.

A single compressed (zip or rar) file should be submitted containing all the result files for all the videos of the test set. In the case that your method fails to produce any results for a particular video, you should include no XML file for that particular video.

The evaluation of the results will be based on an adaptation of the CLEAR-MOT evaluation framework [1]. The basis for our evaluation is the code provided by Bagdanov et al. [2]. For more detail, see [3].

Task 3.4: End to End

The objective of this task is to recognise words in the video as well as localise them in terms of time and space. The localisation in time means identifying when a word appears and disappears in the video. The localisation in space means identifying where a word exists in a single frame. The task requires that correctly recognised words are also correctly localised in every frame and tracked correctly over the video sequence.

In this task, the same dataset and ground truth as Task 3.1 is used, along with certain vocabularies provided.

Vocabularies

Apart from the transcription and location ground truth we provide a generic vocabulary, a vocabulary of all words in the training set and per-image vocabularies of 200 words comprising all proper words in the corresponding image as well as distractor words selected from the training set vocabulary, following the setup of Wang et al [4].

Along with the training set sequences and ground truth, we provide a generic training vocabulary of about 90k words as extra material that can be used for building statistical language models. Authors are free to incorporate other vocabularies / text corpuses during their training to enhance their language models, in which case they will be requested to indicate so at submission time to facilitate the analysis of results.

All vocabularies provided contain words of 3 characters or longer comprising only letters.

Vocabularies do not contain alphanumeric structures that correspond to prices, URLs, times, dates, emails etc. Such structures, when deemed readable, are tagged in the images and an end-to-end method should be able to recognise them, although the vocabularies provided do not inlcude them explicitly.

Words were stripped by any preceding or trailing symbols and punctuation marks before they were added in the vocabulary. Words that still contained any symbols and puctuation marks (with the exception of hyphens) were filtered as well. So for example "e-mail" is a valid vocabulary entry, while "rrc.cvc.uab.es" is a non-word and is not included.

Submission Stage

For the test phase, we will provide a set of test sequences along with three specific types of vocabularies:

Strongly Contextualised: per-sequence vocabularies of 200 words including all words (3 characters or longer, letters only) that appear in the image as well as a number of distractor words chosen at random from the test set vocabulary,
Weakly Contextualised: all words (3 characters or longer, letters only) that appear in the entire test set, and
Generic: any vocabulary can be used, a 90k word vocabulary is provided

For each of the above variants, participants can make use of the corresponding vocabulary given to guide the end-to-end word detection and recognition process.

Participants will be able to submit end-to-end results for these variants in a single submission step. Variant (1) will be obligatory, while variants (2) and (3) optional.

Evaluation

The evaluation is made in two steps; (1) firstly whether each word sequence is correctly identified is evaluated, and (2) the F-measure in word sequence level is calculated. As for (1), word localisation and recognition are evaluated separately. The word localisation performance is evaluated by the same scheme as Task 3.1. The word recognition performance is evaluated by simply whether a word recognition result is completely correct.

Word recognition evaluation is case-insensitive, and accent-insensitive. Words with less than 3 characters are not taken into account for evaluation. Similarly, words containing non-alphanumeric characters are not taken into account with the following exceptions: the "hyphen" and "apostrophe" are allowed always, exclamation/question marks, dots, and commas are allowed only at the begining/ending of a word but a method is not penalized if the transcription does not include them.

So as to perform the evaluation mentioned above, for each video, the participants are required to submit an XML file in the same format as Task 3.1, and a text file containing word recognition results. The text file should follow the following format:
--------------------------------
"910","estomago"
"1001","Gracias"
"1002","Usted"
--------------------------------
where the first field contains IDs and the second field correspond to recognition results.

Note that word recognition transcriptions of the same word object may be different in different frames due to entering to the frame, exiting from the frame and occlusions. Despite these obstructions, the sequence level recognition results should be COMPLETE words. For example, a word “Gracias” should be written as “Gracias” while it may be seen as “Gra” when it is entering and it may be seen as “cias” when it is exiting the scene.

References

[1] Keni Bernardin and Rainer Stiefelhagen. “Evaluating multiple object tracking performance: the CLEAR MOT metrics” J. Image Video Process. 2008, Article 1 (January 2008), 10 pages.” DOI=10.1155/2008/246309

[2] A.D. Bagdanov, A. Del Bimbo, F. Dini, G. Lisanti, and I. Masi, "Compact and efficient posterity logging of face imagery for video surveillance", IEEE Multimedia, 2012

[3] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. Gomez, S. Robles, J. Mas, D. Fernandez, J. Almazan, L.P. de las Heras , "ICDAR 2013 Robust Reading Competition", In Proc. 12th International Conference of Document Analysis and Recognition, 2013, IEEE CPS, pp. 1115-1124

[4] K. Wang, B. Babenko, and S. Belongie, "End-to-end scene text recognition", in Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 1457-1464), IEEE, November 2011