Tasks - Document Visual Question Answering

The challenge comprises three different tasks. On one hand, the Task 1 is a typical VQA task, where natural language questions are defined over single documents, and an answer needs to be generated by interpreting the document image. No list of pre-defined responses will be given, hence the problem cannot be easily treated as an n-way classification task. On the other hand, Task 2 is a retrieval-style task where given a question, the aim is to identify and retrieve all the documents in a large document collection that are relevant to answering this question as well as the answer. Finally, the new Task 3 is also a typical VQA task where natural language questions are defined over single images, but this time the iamges are infographics about different topics where visual information is more relevant to answer the posed questions.

 

Task 1 - Single Document VQA

The objective of this task is to answer questions asked on a document image. The images provided are sourced from the documents hosted at the Industry Documents Library, maintained by the UCSF. The documents contain a mix of printed, typewritten and handwritten content. A wide variety of document types is used for this task including letters, memos, notes, reports etc.

The answers to questions are short text spans taken verbatim from the document. This means that the answers comprise a set of contiguous text tokens present in the document.

There might be more than one valid answer per question. In such a case, a list of possible correct answers is given in the training set. In a typical supervised training setting one might want to use only one answer per question. In such a case we suggest to use the first answer in the list of answers.

Ground Truth Format

The subsets packages provided (see Downloads section) contain a JSON file with the ground truth annotations, and a folder with the document images. The JSON file, called "docvqa_train_vX.X" has the following format (explanations in italics):

{
    "dataset_name": "docvqa",  The name of the dataset, should be invariably "docvqa"
    "dataset_split": "train",  The subset (either "train" or "test")
    "dataset_version": "0.1",  The version of the dataset. A string in the format of major.minor version
    "data": [{...}]
}

The "data" element is a list of dictionary entries with the following structure

{
    "questionId": 52212,  A unique ID number for the question
    "question": "Whose signature is given?",   The question string - natural language asked question
    "image": "documents/txpn0095_1.png",   The image filename corresponding to the document page which the question is defined on. The images are provided in the /documents folder
    "docId": 1968,  A unique ID number for the document
    "ucsf_document_id": "txpn0095",   The UCSF document id number
    "ucsf_document_page_no": "1",  The page number within the UCSF document that is used here
    "answers": ["Edward R. Shannon", "Edward Shannon"],  A list of correct answers provided by annotators
    "data_split": "train"  The dataset split this question pertains to
}

Submissions Format

Results are expected to be submitted as a single JSON file (extension .json) that contains a list of dictionaries, in which there are two keys which are "questionId" and "answer". The "questionId" key represents the unique id of the question while the "answer" key should correspond to the model's output. As a example, the result file might be named: result_task1.json and will contain a list similar to:

[
    {'answer': 'TRANSMIT CONFIRMATION REPORT', 'questionId': 10285},
    {'answer': '12/15/88', 'questionId': 18601},
    {'answer': 'Dear Dr. Lobo', 'questionId': 16734},
     ...,
     ...,
]

Evaluation Metric

We will be using Average Normalized Levenshtein Similarity (ANLS) as the evaluation metric.  For more details on the metric please see the metric used for Task 3 for scene text VQA challenge. Please note that we are considering including other evaluation metrics , which are popular in VQA and Reading Comprehension tasks. We will update the details here before final submissions.

  • Answers are not case sensitive
  • Answers are space sensitive
  • Answers or tokens comprising answers are not limited to a fixed size dictionary. It could be any word/token which is present in the document.

 

Baseline's code for task 1 can be found in this GitHub repository.

 

Task 2 - Document Collection VQA

The objective of this task is to find the positive evidences for questions asked on a document image collection, setting this task as a retrieval task. Positive evidences are considered the documents in which the answer of the question can be found. For example, in the first question (id 0): In which years did Anna M. Rivers run for the State senator office? The documents considered as positive evidence are the ones its candidate is Anna M. Rivers when she ran for the State senator office.

The images provided are sourced from the Public Disclosure Commission (PDC) documents. The collection of documents consists of 14362 different instances from the same template document.

The ranking of the models in this task for the CVPR 2020 edition of the competition were according to the correctness of the provided evidences solely, disregarding the question answers. However, participants could optionally provide the question answers, which were evaluated just to show the performance of the models on the answering part.

For the ICDAR 2021 edition of this task the ranking will be changed to take into account both the evidences provided and the question answering performance of participating methods.

Since this is a retrieval task, the data to be processed and indexed does not change between "training" and "testing". As a matter of fact, there is no such a thing as a "training" in this scenario, what training reduces to is preprocessing and indexing documents in the most optimal way, so that unknown queries can be answered efficiently. Hence there are a few "validation queries" which we name "sample queries" that basically aim to get participants accustomed to the type of queries expected when they design their retrieval systems. Then at "test time" we provide new, never seen before queries, named "test queries",  that participants' systems have to respond to.

Ground Truth Format

{
    "dataset_name":"docvqa_task2", The name of the dataset, should be invariably "docvqa_task2"
    "dataset_split": "sample", The subset(either "sample" or "test")
    "dataset_version":"0.1", The version of the dataset. A string in the format of major.minor version
    "data": [{...}]
}

The "data" element is a list of dictionary entries with the following structure:

{
    "question_id": 0, A unique ID number for the question
    "questions": In which years did Anna M. Rivers run for the State senator office? The question string - Natural Language asked question
    "answers": [2016, 2020], The answer to the question
    "evidence": [454, 10901], The Doc IDs where the answer can be found
    "ground_truth": [0, 0, 0, 1, 0 ..... 0] List of dimension equal to the number of documents in the dataset. Values are 0 and 1, where 1 means the document is considered as a positive evidence for the question.
    "data_split": "sample" The dataset split this question belongs to.
}

Submission Format

Results are expected to be submitted as a single JSON file (extension .json) that contains a list of dictionaries, in which there are three keys which are "question_id", "evidence" and "answer". The "question_id" key represents the unique id of the question while the "evidence" and "answer" keys should correspond to the model's output. The evidence can be interpreted as the documents in which the answer to the given question can be found, considered as positive evidences. It will consist on a list of relevance scores for each document representing how confident is the model that the document contains the answer to the question. This scores will be used to sort the documents for the Mean Average Precision (MAP) evaluation metric.

Optionally, the participant can also submit the answer to the given question. This answer will consist in a list of values considered corrects. It is expected that each value in this list, corresponds to a value extracted for each page considered as positive evidence. This is, if the model considers that there are 2 positive evidences (documents) for a given question, in the answer field there will be a list of 2 values, each one from its corresponding document. If the answers are provided, the precision and recall metrics will be provided along with the MAP from evidences. On the other hand, if the participant doesn't want to provide the answer, we would recommend to still add the "answer" key even though is empty.

As an example, the result file might be named: result_task2.json and will contain a list similar to:

[
    {'question_id': 0, evidence: [0.01, 0.95, 0.12.....], answer: [___, ___]},
    {'question_id': 1, evidence: [0.32, 0.17, 0.86.....], answer: [___, ___]},
    {'question_id': 2, evidence: [0.73, 0.76, 0.09.....], answer: [___, ___]},
    ...,
    ...,
]

Evaluation Metrics

Evidence Evaluation Metric:

In the CVPR2020 edition the methods were ranked according to the correctness of the evidences provided evaluated through the Mean Average Precision (MAP). The scores of the evidences were used to rank the relevance of the documents and therefore, could be in any range (see example 1). Note that we forced positive evidences that were equally scored to be at the end of the ranking among those documents (see example 2). This was to ensure that the ranking was consistent and did not depend on the default order, or the way the score was evaluated

Example 1: Ground truth is ranked according to the submitted evidence relevance scores.

  • Evidence: [200.0, 120.0, 1000.0, 0.0, -0.1, 1]
  • Ground Truth: [0, 1, 0, 0, 0, 1]
  • Ranked Ground Truth: [0, 0, 1, 1, 0, 0]
  • MAP: 0.29

Example 2: Evidence scores list is submitted with all scores at 0. Then, the positive document is ranked at the end.

  • Evidence: [0.0, 0.0, 0.0, 0.0, 0.0]
  • Ground Truth: [0, 1, 0, 0, 0]
  • Ranked Ground Truth: [0, 0, 0, 0, 0, 1]
  • MAP: 0.17

If the submission included the answers to the questions, they were also evaluated and the precision and recall metrics was provided. However, that metrics weren't used to rank the methods in the CVPR2020 competition.

 

Answer Evaluation Metric:

For this new ICDAR2021 edition the ranking of the methods will be according to the question answering performance of the paricipant methods. To evaluate them we have designed a new metric called Average Normalized Levenshtein Similarity for Lists (ANLSL) that adapts the ANLS metric previously used in ST-VQA challenge as well as in the Task 1 and 3 of the DocVQA challenge to evaluate an itemized list of answers for which the order is not relevant.
Formally described in equation 1. Given a question Q, the ground truth list of answers G = {g1, g2, ..., gM} and a model's list predicted answers P = {p1, p2, ..., pN}. The ANLSL performs an Hungarian matching algorithm to obtain a k number of pairs U = {u1, u2, ..., uk} where k is the minimum between the ground truth and the predicted answer lists lengths. The Hungarian matching is performed according to the Normalized Levenshtein Similarity (NLS) between each ground truth element gj ∈ G and each prediction pi ∈ P. Once the matching is performed, all the NLS scores of the uz ∈ U pairs are summed and divided for the maximum length of both ground truth and predicted answer lists. Therefore, if there are more or less ground truth answers than the ones predicted, the method is penalized.

DocVQA_ANLSL.PNG

 

Nevertheless the MAP of the evidence retrieval will be displayed to show the model's retrieval performance.

 

Task 3 - Infographics VQA

The objective of this task is to answer questions asked on an infographic image. Infographics (or information graphic) is a visual representation of information or data in the form of charts, diagrams, etc. so that it is easy for humans to understand. Therefore, visual information is much more relevant than in previous tasks.

Unlike DocVQA task1 which is a pure "extractive QA" task, infographic VQA allows answers which are not explicitly extracted from the given image. The answer for a question in this can be any of the following types

The answers for the question  fall into the following four types

  • Answer is a piece contiguous text from the image (single span of text)
  • Answer is a list of "items" , where each item is a piece of text from the image ( multiple spans). In such cases your model/method is expected to output an answer where each item is separated by a comma and a space. For example if the question is "What are the three common symptoms of COVID-19?" Answer must be in the format "fever, dry cough, tiredness". In such cases "and" should not be used to connect last item and the penultimate item and a space after the comma is required so that your answer match exactly with the ground truth. In case of unordered lists like the example given above, our ground truth has all possible permutations and hence it doesn't matter the order in which the items are listed in your answer.
  • Answer is a contiguous piece of text from the question itself (a span from the question)
  • - Answer is a number ( for example "2", "2.5", "2%", " 2/3" etc..). For example there are questions asking for count of something or cases where answer is sum of two values given in the image

In short the task is mostly an extractive QA task  ( span or multiple spans from image or question). The only case where an answer which is not directly  extracted from either image or question is allowed is where answers are numeric.

There might be more than one valid answer per question. In such a case, a list of possible correct answers is given in the training set. In a typical supervised training setting one might want to use only one answer per question. In such a case we suggest to use the first answer in the list of answers.

Ground Truth Format

In the Downloads section, we provide download links to a JSON file that has the url to images and the question-answer annotations. The file has the following format.

{
    "dataset_name": "infographicVQA",  The name of the dataset, should be invariably "docvqa"
    "dataset_split": "train",  The subset (either "train" , "val" or "test")
    "dataset_version": "0.1",  The version of the dataset. A string in the format of major.minor version
    "data": [{...}]
}

 

The "data" element is a list of dictionary entries with the following structure

{
    "questionId": 65882,  A unique ID number for the question
    "question": "What is the value of Tourism GVA?",   The question string - natural language asked question
    "image_local_name": "34306.jpeg",
    "image_url": "https://www.tra.gov.au/Images/UserUploadedImages/209/ILoad703___Thumb.jpg",   url for the image
    "answers": ["$43.4 Billion", "43.4 Billion"],  A list of correct answers provided by annotators
    "data_split": "train"  The dataset split this question belongs to
}

Submissions Format

Results are expected to be submitted as a single JSON file (extension .json) that contains a list of dictionaries, in which there are two keys which are "questionId" and "answer". The "questionId" key represents the unique id of the question while the "answer" key should correspond to the model's output. As a example, the result file might be named: result_task1.json and will contain a list similar to:

[
    {'answer': 'TRANSMIT CONFIRMATION REPORT', 'questionId': 10285},
    {'answer': '12/15/88', 'questionId': 18601},
    {'answer': 'Dear Dr. Lobo', 'questionId': 16734},
     ...,
     ...,
]

Evaluation Metric

The evaluation scheme  will be the same as the evaluation metirc  used for Task1 .

We will be using Average Normalized Levenshtein Similarity (ANLS) as the evaluation metric. 

  •  Answers are not case sensitive
  • Answers are space sensitive
  • For multi span answers each  "item" must be separated by a comma (",") and a space. Order of items in a multi span answer does not matter

 

Important Dates

ICDAR 2021 edition

10 November 2020: Release of training data subset for new Task 3 on "Infographics VQA"

23 December 2020: Release of full  training data for  Task 3 on "Infographics VQA"

11 February 2021: Test set available

10 April 2021: Deadline for Competition submissions

30 April 2021: Results available online

5 -10 September 2021: Presentation at the Document VQA workshop at ICDAR 2021

 

CVPR 2020 edition

16 March 2020: Training set  v0.1 available

19 March 2020 : Text Transcriptions for Train_v0.1 Documents available

20 April 2020: Test set available

15 May 2020 (23:59 PST): Submission of results

16-18 June 2020: CVPR workshop