Tasks - ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering

The Challenge contains three tasks, all of them are new for the 2019 edition of the competition:

Strongly Contextualized, where is given a dictionary per image that contains the words that appear in the answers defined for questions on that image, along with a series of distractors.
Weakly Contextualized, where the participants will have at hand a unique dictionary of 30,000 words for all the datasets’ images which is formed by collecting all the ground truth words plus distractors.
End-to-end, where no predefined list of possible answers is given, and the correct answer has to be generated automatically by: processing the image context, reading and understanding the textual information in the image.

Dataset and Tools

The SceneText-VQA dataset comprises 23,000 images with up to three questions/answer pairs per image. A train and test split are provided. Train set consists of 19000 images with 26000 questions while test set consists of 3000 images with 4000 questions per task. An example of the type of questions and answers to be expected is given in Figure 1.

**Figure 1.** A possible question/answer pair for this image might be:
(Q) Which soda brand appears in the bottom of the image? (A) Coca-Cola.

Along with the dataset, we offer a set of utility functions and scripts for the evaluation and visualisation of submitted results, both through the RRC online platform, and as stand-alone code and utilities that can be used offline (the latter provided after the competition has finished).

Task 1 - Strongly Contextualised

In this first task, the participants will be provided with a different list of possible answers for each image. The list will comprise some of the words that appear within the image, plus some extra dictionary words . As such, each image will contain a relatively small but different set of possible answers. For the example image above, the participant would be given a list including the words below, plus some dictionary words:

[ Public, Market, Center, Coca-Cola, Farmers, Enjoy, … ]

Task 2 - Weakly Contextualised

In this task, the participants will be provided the full list of possible answers for the complete dataset and complemented with some dictionary words. Although the list of possible answers will be the same (a static list) for all the images within the dataset, the list is considerably larger than the set of answers from the previous task. The dictionary is comprised by 30,000 words formed by collecting all the 22k ground truth words plus 8k generated vocabulary.

Task 3 - Open Dictionary

The end-to-end task is the most generic and challenging one, since no set of answers is provided a priori. The submitted methods for this task should be able to generate the correct answers by analysing the image's visual context and reading and understanding all image contained textual information.

Evaluation Metric

In all the three tasks, the evaluation metric will be the Average Normalized Levenshtein Similarity (ANLS). The ANLS smoothly captures the OCR mistakes applying a slight penalization in case of correct intended responses, but badly recognized. It also makes use of a threshold of value 0.5 that dictates whether the output of the metric will be the ANLS if its value is equal or bigger than 0.5 or 0 otherwise. The key point of this threshold is to determine if the answer has been correctly selected but not properly recognized, or on the contrary, the output is a wrong text selected from the options and given as an answer.

More formally, the ANLS between the net output and the ground truth answers is given by equation 1. Where N is the total number of questions, M total number of GT answers per question, a_ij the ground truth answers where i = {0, ..., N}, and j = {0, ..., M}, and o_q_i be the network's answer for the i^th question q_i.

It is not case sensitive, but space sensitive. For example:

Q: What soft drink company name is on the red disk?

Possible different answers:

a_i1: Coca Cola
a_i₂: Coca Cola Company

Submission Format

The submission file should be only one file per task. It should be formatted as a JSON file that contains a list of dictionaries, in which there are two keys which are "questions_id" and "answer". The "question_id" key represents the unique id of the question while the key "answer" should be model's output. As a example, the result file might be named: result_task1.json and will contain a list similar to:

[

{'answer': 'Coca', 'question_id': 1},

{'answer': 'stop', 'question_id': 2},

{'answer': 'delta', 'question_id': 3},

...,

]

Challenge News

04/15/2019
Test Set of ST-VQA
04/04/2019
ST-VQA Training Set Updated

Important Dates

12 February 2019: Web site online

8 March 2019: Training set available

15 April 2019: Test set available

30 April 2019: Submission of results deadline

10 May 2019: Deadline for providing short descriptions of the participating methods

20-25 September 2019: Results presentation