Overview - ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering

The ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering, focuses on a specific type of Visual Question Answering task, where understanding the textual information in a scene is necessary in order to give an answer.

 

STVQA_Overview_1a.jpg

STVQA_Overview_1b.jpg

STVQA_Overview_1c.png

(a) (b) (c)

Figure 1. Recognising and interpreting textual content is essential for most everyday tasks.
[Bus image (b) reproduced from MSCOCO dataset [1, 2]; shop image (c) reproduced from [3]).

Which is the cheapest rice milk on the shelf of Figure 1a? Where does the blue bus of Figure 1b go to? What kind of business is this in Figure 1c?

Textual content in human environments conveys important high-level semantic information that is not available in any other form in the scene. Interpreting written information in human environments is essential for performing most everyday tasks like making a purchase, using public transportation, finding a place in the city, etc.

There is text in about 50% of the images in large-scale datasets like MS Common Objects in Context [1, 2], and the percentage goes up sharply in urban environments. Current automated scene interpretation models such as visual question answering ones, present serious limitations as they disregard scene text content.

The ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering (ST-VQA) proposes a Visual Question Answering task specifically designed to explore these limitations. The ST-VQA is organised around a dataset of images and corresponding questions, which require the understanding the textual information in a scene in order to answer properly.

The dataset on which we will base the competition is a collection of images coming from different standard datasets that contain scene text, such as COCO-Text[2], VizWiz[4], ICDAR 2015 [5] etc, as well as images from generic datasets such as ImageNet [**] and Visual Genome [**] that contain at least two text instances. The questions and answers have been collected through Amazon Mechanical Turk.

 

References

[1] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C.L. Zitnick. “Microsoft COCO: Common Objects in Context.” ECCV (2014).

[2] A. Veit, T. Matera, L. Neumann, J. Matas, S. Belongie. “COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images”. arXiv preprint arXiv:1601.07140 (2016).

[3] Y. Movshovitz-Attias, Q. Yu, M.C. Stumpe, V. Shet, S. Arnoud, L. Yatziv. “Ontological supervision for fine grained classification of street view storefronts". CVPR (2015).

[4] D. Gurari, Q. Li, A.J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, J.P. Bigham. "VizWiz Grand Challenge: Answering Visual Questions from Blind People." CVPR (2018).

[5] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, D. Ghosh , A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, VR. Chandrasekhar, A. Lu, F. Shafait, S. Uchida, E. Valveny. “ICDAR 2015 robust reading competition”. ICDAR (2015).

[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, "ImageNet: A Large-Scale Hierarchical Image Database". IEEE Computer Vision and Pattern Recognition (CVPR), 2009

[7] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li Jia-Li, David Ayman Shamma, Michael Bernstein, Li Fei-Fei, "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations", https://arxiv.org/abs/1602.07332

 

Important Dates

12 February 2019: Web site online

8 March 2019: Training set available

15 April 2019: Test set available

30 April 2019: Submission of results deadline

10 May 2019: Deadline for providing short descriptions of the participating methods

20-25 September 2019Results presentation