Tasks - ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering

The Challenge is structured around three tasks, all of them new for the 2019 edition of the competition:

  • Strongly Contextualized, where the complete list of words that appear in the image plus some distractors are provided.
  • Weakly Contextualized, where the participants will have at hand the full list of possible answers for the complete dataset.
  • End-to-end, where no predefined list of possible answers is given, and the correct answer has to be generated automatically by processing the image context and reading and understanding the textual information in the image.

Dataset and Tools

The SceneText-VQA dataset comprises over 15,000 images with at least three questions/answer pairs per image. A train, validation and test split are provided. An example of the type of questions and answers to be expected is given in Figure 1.

Figure 1. A possible question/answer pair for this image might be:
(Q) Which soda brand appears in the bottom of the image? (A) Coca-Cola.


Along with the dataset, we offer a set of utility functions and scripts for the evaluation and visualisation of submitted results, both through the RRC online platform, and as stand-alone code and utilities that can be used offline (the latter provided after the competition has finished).

Task 1 - Strongly Contextualised

In this first task, the participants will be provided with a different list of possible answers for each image. The list will comprise some of the words that appear within the image, plus some extra words acting as distractors. As such, for each image a relatively small but different set of possible answers will be provided. For the example image above, the participant would be given a list including the words below, plus distractors:

[ Public, Market, Center, Coca-Cola, Farmers, Enjoy, … ]

Task 2 - Weakly Contextualised

In this task, the participants will be provided the full list of possible answers for the complete dataset, complemented with some distractor words. Although the list of possible answers will be the same (a static list) for all the images within the dataset, the list is considerably longer than the set of answers from the previous task.

Task 3 - End-to-end

The end-to-end task is the most generic and challenging one, since no set of answers is provided a priori. The submitted methods for this task should be able to generate the correct answers by analysing the image's visual context and reading and understanding all image contained textual information.

Evaluation Metric

In all the three tasks, the evaluation metric will be based on an accuracy measure, checking in how many of the question and answer pairs the proposed methods delivered a correct answer.

Important Dates

12 February 2019: Web site online

28 February 2019: Training and Validation set available

15 April 2019: Test set available

30 April 2019: Submission of results deadline

10 May 2019: Deadline for providing short descriptions of the participating methods

20-25 September 2019Results presentation