Overview - ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering
The ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering, focuses on a specific type of Visual Question Answering task, where understanding the textual information in a scene is necessary in order to give an answer.
Figure 1. Recognising and interpreting textual content is essential for most everyday tasks.
[Bus image (b) reproduced from MSCOCO dataset [1, 2]; shop image (c) reproduced from ).
Which is the cheapest rice milk on the shelf of Figure 1a? Where does the blue bus of Figure 1b go to? What kind of business is this in Figure 1c?
Textual content in human environments conveys important high-level semantic information that is not available in any other form in the scene. Interpreting written information in human environments is essential for performing most everyday tasks like making a purchase, using public transportation, finding a place in the city, etc.
There is text in about 50% of the images in large-scale datasets like MS Common Objects in Context [1, 2], and the percentage goes up sharply in urban environments. Current automated scene interpretation models such as visual question answering ones, present serious limitations as they disregard scene text content.
The ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering (ST-VQA) proposes a Visual Question Answering task specifically designed to explore these limitations. The ST-VQA is organised around a dataset of images and corresponding questions, which require the understanding the textual information in a scene in order to answer properly.
The dataset on which we will base the competition is a collection of images coming from different standard datasets that contain scene text, such as COCO-Text, VizWiz, ICDAR 2015  etc, as well as images from generic datasets such as ImageNet [**] and Visual Genome [**] that contain at least two text instances. The questions and answers have been collected through Amazon Mechanical Turk.
 T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C.L. Zitnick. “Microsoft COCO: Common Objects in Context.” ECCV (2014).
 A. Veit, T. Matera, L. Neumann, J. Matas, S. Belongie. “COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images”. arXiv preprint arXiv:1601.07140 (2016).
 Y. Movshovitz-Attias, Q. Yu, M.C. Stumpe, V. Shet, S. Arnoud, L. Yatziv. “Ontological supervision for fine grained classification of street view storefronts". CVPR (2015).
 D. Gurari, Q. Li, A.J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, J.P. Bigham. "VizWiz Grand Challenge: Answering Visual Questions from Blind People." CVPR (2018).
 D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, D. Ghosh , A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, VR. Chandrasekhar, A. Lu, F. Shafait, S. Uchida, E. Valveny. “ICDAR 2015 robust reading competition”. ICDAR (2015).
 J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, "ImageNet: A Large-Scale Hierarchical Image Database". IEEE Computer Vision and Pattern Recognition (CVPR), 2009
 Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li Jia-Li, David Ayman Shamma, Michael Bernstein, Li Fei-Fei, "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations", https://arxiv.org/abs/1602.07332
Test Set of ST-VQA
Extended: Special Issue on Scene Text Reading and its Applications
ST-VQA Training Set Updated
New Challenges for 2019 Announced
Special Issue on Scene Text Reading and its Applications
Do NOT use qq.com emails to register or contact us
Downtime due to scheduled revisions on 26 and 27 March 2018
Downtime due to scheduled revision on 11 and 12 April 2017
12 February 2019: Web site online
8 March 2019: Training set available
15 April 2019: Test set available
30 April 2019: Submission of results deadline
10 May 2019: Deadline for providing short descriptions of the participating methods
20-25 September 2019: Results presentation