Tasks - ICDAR 2017 Robust Reading Challenge on Omnidirectional Video

The ICDAR 2017 Robust Reading Challenge on Omnidirectional Video consists of Video mode and Still image mode, and following tasks:

Video mode:

In this mode, we regard the DOST dataset [1] as a video dataset and basically treat it in the same manner as the "Text in Videos" Challenge of ICDAR 2013/2015 RRC, and organize the following two tasks:

Task V1: Localisation: The objective of this task is the correct localisation and tracking of all words (excluding "do not care" ones) in the sequence.
Task V2: End-to-end: This task aims to assess End-to-End system performance that combines correct localisation and tracking with correct recognition.

Still image mode:

In this mode, we regard the DOST dataset [1] as a set of still images and basically treat it in the same manner as the "Incidental Scene Text" Challenge of ICDAR 2015 RRC, and organize the following three tasks:

Task I1: Localisation: The objective of this task is the correct localisation of all words (excluding "do not care" ones) of the image.
Task I2: Cropped word recognition: This task aims to evaluate recognition performance over a set of pre-localised word regions.
Task I3: End-to-end: This task aims to assess End-to-End system performance that combines correct localization and correct recognition.

Video mode

Dataset

In the video mode, we provide the DOST dataset [1] (images and ground truth (GT)) in a similar way to the "Text in Videos" Challenge of ICDAR 2013/2015 RRC. The current DOST dataset consists of “sequences,” each of which comprises consecutive images captured with a single camera. An example is given in the figure below. For each sequence, a movie made from the consecutive images and an XML file are provided. The format of the XML file is similar to the "Text in Videos" Challenge. The only difference is that the GT of the DOST dataset has a tag representing script (Latin or Japanese) instead of the tag representing language. Participants are allowed to submit results of either only Latin, only Japanese or both. Evaluation is made for each option. To be precise, a user is allowed to submit a single file for each time, which is automatically evaluated in three modalities (Japanese only, Latin only and both). In addition to the DOST dataset, we will provide images of Japanese characters in multiple fonts for training. Participants are allowed to use any training samples and requested to mention which samples are used for training in the submission. For the ground truthing policy, see the “Ground Truthing Policy” of the DOST paper [1].

Vocabulary

We will NOT provide a set of test images along with specific lists of words. For Latin text, since most of words are expected to be English, you may use the same generic vocabulary file as ICDAR2017 Robust Reading Challenge on COCO-Text from here. Authors will be free to incorporate other vocabularies / text corpuses in their training process to enhance their language models, in which case they will be asked to indicate so during the submission time to facilitate the analysis of results.

Evaluation

In the localisation task (Task V1), the evaluation of the results will be based on an adaptation of the CLEAR-MOT evaluation framework [3]. The basis for our evaluation is the code provided by Bagdanov et al. [4]. For more detail, see [5].

In the end-to-end task (Task V2), the evaluation is made based on the harmonic mean (F-measure) of the word recognition rate and the temporal word detection rate. The temporal word detection rate is defined by the number of correctly identified words; if the range of frames a word appears in the video substantially overlaps with its ground truth it is regarded as a true positive. For more detail, see [6].

In both tasks, detecting or missing words marked as “do not care” will not affect (positively or negatively) the results. Any detections overlapping more than 50% with “do not care” ground truth regions will be discarded from the submitted results before evaluation takes place, and evaluation will not take into account ground truth regions marked as “do not care”.


(a)	(b)
A Sample image of the DOST dataset. (a) Original image, (b) Visualisation of the text localisation ground truth comprising of bounding boxes in blue represent legible Latin text, green legible Japanese text and red illegible text (i.e., the region of “do not care”).

Still Image mode

Dataset

In the still image mode, we provide datasets consisting of a part of the DOST dataset [1] regarding it as a set of images. This dataset is created in a similar way to the "Incidental Scene Text" ICDAR 2015 RRC Challenge. The only difference is ground truth (GT) files are separately prepared for Latin only mode and Japanese only mode. In the tasks of the localisation task (Task I1) and the end-to-end task (Task I3), datasets consist of frame Images sampled every 10 frames from the video sequences are provided. In the cropped word recognition task (Task I2), cropped images are provided.

Vocabulary

Same as the video mode. See above.

Evaluation

Performance evaluation is made based on "Incidental Scene Text" Challenge of ICDAR 2015 RRC [6] described below.

In the localisation task (Task I1), performance evaluation is based on a single Intersection-over-Union criterion, with a threshold of 50%, in accordance to standard practice in object recognition [7]. Any detections overlapping by more than 50% with do not care ground truth regions are filtered before evaluation takes place, while ground truth regions marked as do not care are not taken into account at the time of evaluation.

In the cropped word recognition task (Task I2), the evaluation protocol is based on a standard edit distance metric, with equal costs for additions, deletions and substitutions [5]. For each word we calculate the normalized edit distance to the length of the ground truth transcription. The comparison is case sensitive. Statistics on the percentage of correctly recognised words are also provided.

In the end-to-end task (Task I3), the evaluation protocol proposed by Wang 2011 [2] will be used which considers a detection as a match if it overlaps a ground truth bounding box by more than 50% and the words match, ignoring the case. That is, correct localisation was assessed in the same way as in the localisation task. Subsequently, the recognition output for correctly localised words was compared to the ground truth transcription and a perfect match was sought.

References

[1] Iwamura, M., Matsuda, T., Morimoto, N., Sato, H., Ikeda, Y., Kise, K..: Downtown Osaka Scene Text Dataset. ECCV 2016 International Workshop of Robust Reading, pp.440-455 (2016)
[2] Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: Proceedings of ICCV, pp. 1457–1464 (2011)
[3] Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the CLEAR MOT metrics. Journal on Image Video Processing, Article 1, 10 pages (2008)
[4] Bagdanov, A.D., Bimbo, A. Del, Dini, F., Lisanti, G., Masi, I.: Compact and efficient posterity logging of face imagery for video surveillance. IEEE Multimedia, 19(4), pp.48-59 (2012)
[5] Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Gomez i Bigorda, L., Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., de las Heras, L.P.: ICDAR 2013 robust reading competition. In: Proceedings of ICDAR, pp. 1115–1124 (2013)
[6] Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., Shafait, F., Uchida, S., Valveny, E.: ICDAR 2015 robust reading competition. In: Proceedings of ICDAR, pp. 1156–1160 (2015)
[7] Everingham, M., Eslami, S. A., Gool, L. Van, Williams, C. K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective, International Journal of Computer Vision, 111(1), pp. 98–136 (2014)

Challenge News

Important Dates

April, 15: Initial training data available

May 27: More training data available

June 10: Test data available / Submissions open

June, 30: Submission of results deadline.

November, 10-15: Results presentation.