Overview - ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering

The report of the competition is now available, download it here. Detailed quantitative and qualitative results can be found at the Results Tab. A full description of the dataset and baseline methods can be found in this publication.

 

The ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering, focuses on a specific type of Visual Question Answering task, where understanding the textual information in a scene is necessary in order to give an answer [2].

 

STVQA_Overview_1a.jpg

STVQA_Overview_1b.jpg

STVQA_Overview_1c.png

(a) (b) (c)

Figure 1. Recognising and interpreting textual content is essential for most everyday tasks.
[Bus image (b) reproduced from MSCOCO dataset [3, 4]; shop image (c) reproduced from [5]).

Which is the cheapest rice milk on the shelf of Figure 1a? Where does the blue bus of Figure 1b go to? What kind of business is this in Figure 1c?

Textual content in human environments conveys important high-level semantic information that is not available in any other form in the scene. Interpreting written information in human environments is essential for performing most everyday tasks like making a purchase, using public transportation, finding a place in the city, etc.

There is text in about 50% of the images in large-scale datasets like MS Common Objects in Context [3, 4], and the percentage goes up sharply in urban environments. Current automated scene interpretation models such as visual question answering ones, present serious limitations as they disregard scene text content.

The ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering (ST-VQA) proposes a Visual Question Answering task specifically designed to explore these limitations [1]. The ST-VQA is organised around a dataset of images and corresponding questions which require the understanding the textual information in a scene in order to answer properly.

The dataset on which the competition is based comprises images sourced from different standard datasets that contain scene text, such as COCO-Text[4], VizWiz[6], ICDAR 2015 [7] etc, as well as images from generic datasets such as ImageNet [8] and Visual Genome [9] that contain at least two text instances. The questions and answers have been collected through Amazon Mechanical Turk. Details about the dataset and basic basline methods are given in [2].

The final report of the competition [1], reflects the submissions received until May 2019. For more up to date results, please look at the Results tab.

 

References

[1] Ali Furkan Biten, Rubèn Tito, Andres Mafla, Lluis Gomez, Marçal Rusiñol, Minesh Mathew, C.V. Jawahar, Ernest Valveny, Dimosthenis Karatzas, "ICDAR 2019 Competition on Scene Text Visual Question Answering", arXiv:1907.00490 [cs.CV], 2019

[2] Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusiñol, Ernest Valveny, C.V. Jawahar, Dimosthenis Karatzas, "Scene Text Visual Question Answering"arXiv:1905.13648 [cs.CV], 2019

[3] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C.L. Zitnick. “Microsoft COCO: Common Objects in Context.” ECCV (2014).

[4] A. Veit, T. Matera, L. Neumann, J. Matas, S. Belongie. “COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images”. arXiv preprint arXiv:1601.07140 (2016).

[5] Y. Movshovitz-Attias, Q. Yu, M.C. Stumpe, V. Shet, S. Arnoud, L. Yatziv. “Ontological supervision for fine grained classification of street view storefronts". CVPR (2015).

[6] D. Gurari, Q. Li, A.J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, J.P. Bigham. "VizWiz Grand Challenge: Answering Visual Questions from Blind People." CVPR (2018).

[7] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, D. Ghosh , A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, VR. Chandrasekhar, A. Lu, F. Shafait, S. Uchida, E. Valveny. “ICDAR 2015 robust reading competition”. ICDAR (2015).

[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, "ImageNet: A Large-Scale Hierarchical Image Database". IEEE Computer Vision and Pattern Recognition (CVPR), 2009

[9] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li Jia-Li, David Ayman Shamma, Michael Bernstein, Li Fei-Fei, "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations", https://arxiv.org/abs/1602.07332

 

Important Dates

12 February 2019: Web site online

8 March 2019: Training set available

15 April 2019: Test set available

30 April 2019: Submission of results deadline

10 May 2019: Deadline for providing short descriptions of the participating methods

20-25 September 2019Results presentation