Tasks - Document Visual Question Answering

The challenge comprises three different tasks. On one hand, the first task Single Document VQA is a typical VQA task, where natural language questions are defined over single documents and an answer needs to be generated by interpreting the document image. No list of pre-defined responses will be given, hence the problem cannot be easily treated as an n-way classification task. On the other hand, Document Collection VQA (DocCVQA) is a retrieval-answering task where given a question, the aim is to identify and retrieve all the documents in a large document collection that are relevant to answering this question as well as the answer. Finally, the new Infographics VQA is also a standard VQA task where natural language questions are defined over single images, but this time the images are infographics about different topics where visual information is more relevant to answer the posed questions.

 

Task 1 - Single Document VQA

The objective of this task is to answer questions asked on a document image. The images provided are sourced from the documents hosted at the Industry Documents Library, maintained by the UCSF. The documents contain a mix of printed, typewritten and handwritten content. A wide variety of document types is used for this task including letters, memos, notes, reports etc.

The answers to questions are short text spans taken verbatim from the document. This means that the answers comprise a set of contiguous text tokens present in the document.

There might be more than one valid answer per question. In such a case, a list of possible correct answers is given in the training set. In a typical supervised training setting one might want to use only one answer per question. In such a case we suggest to use the first answer in the list of answers.

Ground Truth Format

The subsets packages provided (see Downloads section) contain a JSON file with the ground truth annotations, and a folder with the document images. The JSON file, called "docvqa_train_vX.X" has the following format (explanations in italics):

{
    "dataset_name": "docvqa",  The name of the dataset, should be invariably "docvqa"
    "dataset_split": "train",  The subset (either "train" or "test")
    "dataset_version": "0.1",  The version of the dataset. A string in the format of major.minor version
    "data": [{...}]
}

The "data" element is a list of dictionary entries with the following structure

{
    "questionId": 52212,  A unique ID number for the question
    "question": "Whose signature is given?",   The question string - natural language asked question
    "image": "documents/txpn0095_1.png",   The image filename corresponding to the document page which the question is defined on. The images are provided in the /documents folder
    "docId": 1968,  A unique ID number for the document
    "ucsf_document_id": "txpn0095",   The UCSF document id number
    "ucsf_document_page_no": "1",  The page number within the UCSF document that is used here
    "answers": ["Edward R. Shannon", "Edward Shannon"],  A list of correct answers provided by annotators
    "data_split": "train"  The dataset split this question pertains to
}

Submissions Format

Results are expected to be submitted as a single JSON file (extension .json) that contains a list of dictionaries, in which there are two keys which are "questionId" and "answer". The "questionId" key represents the unique id of the question while the "answer" key should correspond to the model's output. As a example, the result file might be named: result_task1.json and will contain a list similar to:

[
    {'answer': 'TRANSMIT CONFIRMATION REPORT', 'questionId': 10285},
    {'answer': '12/15/88', 'questionId': 18601},
    {'answer': 'Dear Dr. Lobo', 'questionId': 16734},
     ...,
     ...,
]

Evaluation Metric

We will be using Average Normalized Levenshtein Similarity (ANLS) as the evaluation metric.  For more details on the metric please see the metric used for Task 3 for scene text VQA challenge. Please note that we are considering including other evaluation metrics , which are popular in VQA and Reading Comprehension tasks.

  • Answers are not case sensitive
  • Answers are space sensitive
  • Answers or tokens comprising answers are not limited to a fixed size dictionary. It could be any word/token which is present in the document.

 

Baseline's code for task 1 can be found in this GitHub repository.

 

Task 2 - Document Collection VQA

In this task the questions are posed over a whole collection of document images and the objective is to provide the answer to the given questions, but also the positive evidences. Positive evidences are considered the documents in which the answer of the question can be found. For example, in the first question (id 0): In which years did Anna M. Rivers run for the State senator office? The documents considered as positive evidence are the ones which its candidate is Anna M. Rivers when she ran for the State senator office. Thus, this task can be seen as a retrieval-answering task.

The images provided are sourced from the Public Disclosure Commission (PDC) documents. The collection of documents consists of 14,362 different instances from the same template document.

The ranking of the models in this task during the CVPR 2020 edition of the competition was according to the correctness of the provided evidences solely, disregarding the question answers. However, participants could optionally provide the question answers, which were evaluated just to show the performance of the models on the answering part.

For the ICDAR 2021 edition of this task the ranking has been changed to take into account both the evidences provided and the question answering performance of participating methods.

Since this is a retrieval task, the data to be processed and indexed does not change between "training" and "testing". As a matter of fact, there is no such a thing as a "training" in this scenario, what training reduces to is preprocessing and indexing documents in the most optimal way, so that unknown queries can be answered efficiently. Hence there are a few "validation queries" which we name "sample queries" that basically aim to get participants accustomed to the type of queries expected when they design their retrieval systems. Then at "test time" we provide new, never seen before queries, named "test queries",  that participants' systems have to respond to.

Ground Truth Format

{
    "dataset_name":"docvqa_task2", The name of the dataset, should be invariably "docvqa_task2"
    "dataset_split": "sample", The subset(either "sample" or "test")
    "dataset_version":"0.1", The version of the dataset. A string in the format of major.minor version
    "data": [{...}]
}

The "data" element is a list of dictionary entries with the following structure:

{
    "question_id": 0, A unique ID number for the question
    "questions": In which years did Anna M. Rivers run for the State senator office? The question string - Natural Language asked question
    "answers": [2016, 2020], The answer to the question
    "evidence": [454, 10901], The Doc IDs where the answer can be found
    "ground_truth": [0, 0, 0, 1, 0 ..... 0] List of dimension equal to the number of documents in the dataset. Values are 0 and 1, where 1 means the document is considered as a positive evidence for the question.
    "data_split": "sample" The dataset split this question belongs to.
}

Submission Format

Results are expected to be submitted as a single JSON file (extension .json) that contains a list of dictionaries, in which there are three keys which are "question_id", "evidence" and "answer". The "question_id" key represents the unique id of the question while the "evidence" and "answer" keys should correspond to the model's output. The evidence can be interpreted as the documents in which the answer to the given question can be found, considered as positive evidences. It will consist on a list of relevance scores for each document representing how confident is the model that the document contains the answer to the question. This scores will be used to sort the documents for the Mean Average Precision (MAP) evaluation metric.

Optionally, the participant can also submit the answer to the given question. This answer will consist in a list of values considered corrects. It is expected that each value in this list, corresponds to a value extracted for each page considered as positive evidence. This is, if the model considers that there are 2 positive evidences (documents) for a given question, in the answer field there will be a list of 2 values, each one from its corresponding document. If the answers are provided, the precision and recall metrics will be provided along with the MAP from evidences. On the other hand, if the participant doesn't want to provide the answer, we would recommend to still add the "answer" key even though is empty.

As an example, the result file might be named: result_task2.json and will contain a list similar to:

[
    {'question_id': 0, evidence: [0.01, 0.95, 0.12.....], answer: [___, ___]},
    {'question_id': 1, evidence: [0.32, 0.17, 0.86.....], answer: [___, ___]},
    {'question_id': 2, evidence: [0.73, 0.76, 0.09.....], answer: [___, ___]},
    ...,
    ...,
]

Evaluation Metrics

Evidence Evaluation Metric:

In the CVPR2020 edition the methods were ranked according to the correctness of the evidences provided evaluated through the Mean Average Precision (MAP). The scores of the evidences were used to rank the relevance of the documents and therefore, could be in any range (see example 1). Note that we forced positive evidences that were equally scored to be at the end of the ranking among those documents (see example 2). This was to ensure that the ranking was consistent and did not depend on the default order, or the way the score was evaluated

Example 1: Ground truth is ranked according to the submitted evidence relevance scores.

  • Evidence: [200.0, 120.0, 1000.0, 0.0, -0.1, 1]
  • Ground Truth: [0, 1, 0, 0, 0, 1]
  • Ranked Ground Truth: [0, 0, 1, 1, 0, 0]
  • MAP: 0.29

Example 2: Evidence scores list is submitted with all scores at 0. Then, the positive document is ranked at the end.

  • Evidence: [0.0, 0.0, 0.0, 0.0, 0.0]
  • Ground Truth: [0, 1, 0, 0, 0]
  • Ranked Ground Truth: [0, 0, 0, 0, 0, 1]
  • MAP: 0.17

If the submission included the answers to the questions, they were also evaluated and the precision and recall metrics was provided. However, that metrics weren't used to rank the methods in the CVPR2020 competition.

 

Answer Evaluation Metric:

For this new ICDAR2021 edition the methods are ranked according to the question answering performance, leaving the retrieval of the evidences as a robust explanation of where the answer has been inferred from. To evaluate them we have designed a new metric called Average Normalized Levenshtein Similarity for Lists (ANLSL) that adapts the ANLS metric, a standard metric for Reading based Question Answering tasks introduced in ST-VQA challenge. This new adaptation is capable to evaluate an itemized list of answers for which the order is not relevant while preserving the smoothness penalization to OCR recognition errors that ANLS provides.
Formally described in equation 1. Given a question Q, the ground truth list of answers G = {g1, g2 . . . gM} and a model’s list predicted answers P = {p1, p2 . . . pN}, the ANLSL performs the Hungarian matching algorithm to obtain a k number of pairs U = {u1, u2 . . . uK} where K is the minimum between the ground truth and the predicted answer lists lengths. The Hungarian matching (Ψ) is performed according to the Normalized Levenshtein Similarity (NLS) between each ground truth element gj ∈ G and each prediction pi ∈ P. Once the matching is performed, all the NLS scores of the uz ∈ U pairs are summed and divided for the maximum length of both ground truth and predicted answer lists. Therefore, if there are more or less ground truth answers than the ones predicted, the method is penalized.

ANLSL_img.png

Nevertheless the MAP of the evidence retrieval will be displayed to show the model's retrieval performance.

 

Task 3 - Infographics VQA

The objective of this task is to answer questions asked on an infographic image. Infographics (or information graphic) is a visual representation of information or data in the form of charts, diagrams, etc. so that it is easy for humans to understand. Therefore, visual information is much more relevant than in previous tasks.

Unlike DocVQA task1 which is a pure "extractive QA" task, infographic VQA allows answers which are not explicitly extracted from the given image. The answer for a question in this can be any of the following types

The answers for the question fall into the following four types

  • Answer is a piece contiguous text from the image (single span of text)
  • Answer is a list of "items" , where each item is a piece of text from the image ( multiple spans). In such cases your model/method is expected to output an answer where each item is separated by a comma and a space. For example if the question is "What are the three common symptoms of COVID-19?" Answer must be in the format "fever, dry cough, tiredness". In such cases "and" should not be used to connect last item and the penultimate item and a space after the comma is required so that your answer match exactly with the ground truth. In case of unordered lists like the example given above, our ground truth has all possible permutations and hence it doesn't matter the order in which the items are listed in your answer.
  • Answer is a contiguous piece of text from the question itself (a span from the question)
  • Answer is a number ( for example "2", "2.5", "2%", " 2/3" etc..). For example there are questions asking for count of something or cases where answer is sum of two values given in the image

In short the task is mostly an extractive QA task (span or multiple spans from image or question). The only case where an answer which is not directly extracted from either image or question is allowed is where answers are numeric.

There might be more than one valid answer per question. In such a case, a list of possible correct answers is given in the training set. In a typical supervised training setting one might want to use only one answer per question. In such a case we suggest to use the first answer in the list of answers.

Ground Truth Format

In the Downloads section, we provide download links to a JSON file that has the url to images and the question-answer annotations. The file has the following format.

{
    "dataset_name": "infographicVQA",  The name of the dataset, should be invariably "docvqa"
    "dataset_split": "train",  The subset (either "train" , "val" or "test")
    "dataset_version": "0.1",  The version of the dataset. A string in the format of major.minor version
    "data": [{...}]
}

 

The "data" element is a list of dictionary entries with the following structure

{
    "questionId": 65882,  A unique ID number for the question
    "question": "What is the value of Tourism GVA?",   The question string - natural language asked question
    "image_local_name": "34306.jpeg",
    "image_url": "https://www.tra.gov.au/Images/UserUploadedImages/209/ILoad703___Thumb.jpg",   url for the image
    "answers": ["$43.4 Billion", "43.4 Billion"],  A list of correct answers provided by annotators
    "data_split": "train"  The dataset split this question belongs to
}

Submissions Format

Results are expected to be submitted as a single JSON file (extension .json) that contains a list of dictionaries, in which there are two keys which are "questionId" and "answer". The "questionId" key represents the unique id of the question while the "answer" key should correspond to the model's output. As a example, the result file might be named: result_task1.json and will contain a list similar to:

[
    {'answer': 'TRANSMIT CONFIRMATION REPORT', 'questionId': 10285},
    {'answer': '12/15/88', 'questionId': 18601},
    {'answer': 'Dear Dr. Lobo', 'questionId': 16734},
     ...,
     ...,
]

Evaluation Metric

The mtric used to evaluate and rank the method in this task will be the Average Normalized Levenshtein Similarity (ANLS). The evaluation scheme will be the same as the evaluation metric used for Task1 .

  • Answers are not case sensitive.
  • Answers are space sensitive.
  • For multi span answers each "item" must be separated by a comma (",") and a space. Order of items in a multi span answer does not matter.

Important Dates

ICDAR 2021 edition

5 -10 September 2021: Presentation at the Document VQA workshop at ICDAR 2021

30 April 2021: Results available online

10 April 2021: Deadline for Competition submissions

11 February 2021: Test set available

23 December 2020: Release of full training data for Infographics VQA.

10 November 2020: Release of training data subset for new task "Infographics VQA"

 

CVPR 2020 edition

16-18 June 2020: CVPR workshop

15 May 2020 (23:59 PST): Submission of results

20 April 2020: Test set available

19 March 2020 : Text Transcriptions for Train_v0.1 Documents available

16 March 2020: Training set v0.1 available