Tasks - Document Visual Question Answering

The challenge will comprise two tasks. On one hand, a typical VQA task, where natural language questions are defined over single documents, and an answer needs to be generated by interpreting the document image. No list of pre-defined responses will be given, hence the problem cannot be easily treated as an n-way classification task. On the other hand, a retrieval-style task where given a question, the aim is to identify and retrieve all the documents in a large document collection that are relevant to answering this question.

 

Task 1

The objective of this task is to answer questions asked on a document image. The images provided are sourced from the documents hosted at the Industry Documents Library, maintained by the UCSF. The documents contain a mix of printed, typewritten and handwritten content. A wide variety of document types is used for this task including letters, memos, notes, reports etc.

The answers to questions are short text spans taken verbatim from the document. This means that the answers comprise a set of contiguous text tokens present in the document.

There might be more than one valid answer per question. In such a case, a list of possible correct answers is given in the training set. In a typical supervised training setting one might want to use only one answer per question. In such a case we suggest to use the first answer in the list of answers.

Ground Truth Format

The subsets packages provided (see Downloads section) contain a JSON file with the ground truth annotations, and a folder with the document images. The JSON file, called "docvqa_train_vX.X" has the following format (explanations in italics):

{
    "dataset_name": "docvqa",  The name of the dataset, should be invariably "docvqa"
    "dataset_split": "train",  The subset (either "train" or "test")
    "dataset_version": "0.1",  The version of the dataset. A string in the format of major.minor version
    "data": [{...}]
}

The "data" element is a list of dictionary entries with the following structure

{
    "questionId": 52212,  A unique ID number for the question
    "question": "Whose signature is given?",   The question string - natural language asked question
    "image": "documents/txpn0095_1.png",   The image filename corresponding to the document page which the question is defined on. The images are provided in the /documents folder
    "docId": 1968,  A unique ID number for the document
    "ucsf_document_id": "txpn0095",   The UCSF document id number
    "ucsf_document_page_no": "1",  The page number within the UCSF document that is used here
    "answers": ["Edward R. Shannon", "Edward Shannon"],  A list of correct answers provided by annotators
    "data_split": "train"  The dataset split this question pertains to
}

Submissions Format

Results are expected to be submitted as a single JSON file (extension .json) that contains a list of dictionaries, in which there are two keys which are "questionId" and "answer". The "questionId" key represents the unique id of the question while the "answer" key should correspond to the model's output. As a example, the result file might be named: result_task1.json and will contain a list similar to:

[
    {'answer': 'TRANSMIT CONFIRMATION REPORT', 'questionId': 10285},
    {'answer': '12/15/88', 'questionId': 18601},
    {'answer': 'Dear Dr. Lobo', 'questionId': 16734},
     ...,
     ...,
]

Evaluation Metirc

We will be using Average Normalized Levenshtein Similarity (ANLS) as the evaluation metric.  For more details on the metric please see the metric used for Task 3 for scene text VQA challenge. Please note that we are considering including other evaluation metrics , which are popular in VQA and Reading Comprehension tasks. We will update the details here before final submissions.

  • Answers are not case sensitive
  • Answers are space sensitive
  • Answers or tokens comprising answers are not limited to a fixed size dictionary. It could be any word/token which is present in the document.

 

Task 2

The objective of this task is to find the positive evidences for questions asked on a document image collection, setting this task as a retrieval task. Positive evidences are considered the documents in which the answer of the question can be found. For example, in the first question (id 0): In which years did Anna M. Rivers run for the State senator office? The documents considered as positive evidence are the ones its candidate is Anna M. Rivers when she ran for the State senator office.

The images provided are sourced from the Public Disclosure Commission (PDC) documents. The collection of documents consists of 14362 different instances from the same template document.

The ranking of the models in this task will be according to the correctness of the provided evidences. However, participants can optionally provide the question answers, which will be evaluated just to show the performance of the models on the answering part. We insist, the answers are optional and therefore, are not used to rank the participant models.

Since this is a retrieval task, the data to be processed and indexed does not change between "training" and "testing". As a matter of fact, there is no such a thing as a "training" in this scenario, what training reduces to is preprocessing and indexing documents in the most optimal way, so that unknown queries can be answered efficiently. Hence there are a few "validation queries" which we name "sample queries" that basically aim to get participants accustomed to the type of queries expected when they design their retrieval systems. Then at "test time" we provide new, never seen before queries, named "test queries",  that participants' systems have to respond to.

Ground Truth Format

{
    "dataset_name":"docvqa_task2", The name of the dataset, should be invariably "docvqa_task2"
    "dataset_split": "sample", The subset(either "sample" or "test")
    "dataset_version":"0.1", The version of the dataset. A string in the format of major.minor version
    "data": [{...}]
}

The "data" element is a list of dictionary entries with the following structure:

{
    "question_id": 0, A unique ID number for the question
    "questions": In which years did Anna M. Rivers run for the State senator office? The question string - Natural Language asked question
    "answers": [2016, 2020], The answer to the question
    "evidence": [454, 10901], The Doc IDs where the answer can be found
    "ground_truth": [0, 0, 0, 1, 0 ..... 0] List of dimension equal to the number of documents in the dataset. Values are 0 and 1, where 1 means the document is considered as a positive evidence for the question.
    "data_split": "sample" The dataset split this question belongs to.
}

Submission Format

Results are expected to be submitted as a single JSON file (extension .json) that contains a list of dictionaries, in which there are three keys which are "question_id", "evidence" and "answer". The "question_id" key represents the unique id of the question while the "evidence" and "answer" keys should correspond to the model's output. The evidence can be interpreted as the documents in which the answer to the given question can be found, considered as positive evidences. It will consist on a list of relevance scores for each document representing how confident is the model that the document contains the answer to the question. This scores will be used to sort the documents for the Mean Average Precision (MAP) evaluation metric.

Optionally, the participant can also submit the answer to the given question. This answer will consist in a list of values considered corrects. It is expected that each value in this list, corresponds to a value extracted for each page considered as positive evidence. This is, if the model considers that there are 2 positive evidences (documents) for a given question, in the answer field there will be a list of 2 values, each one from its corresponding document. If the answers are provided, the precision and recall metrics will be provided along with the MAP from evidences. On the other hand, if the participant doesn't want to provide the answer, we would recommend to still add the "answer" key even though is empty.

As an example, the result file might be named: result_task2.json and will contain a list similar to:

[
    {'question_id': 0, evidence: [0.01, 0.95, 0.12.....], answer: [___, ___]},
    {'question_id': 1, evidence: [0.32, 0.17, 0.86.....], answer: [___, ___]},
    {'question_id': 2, evidence: [0.73, 0.76, 0.09.....], answer: [___, ___]},
    ...,
    ...,
]

Evaluation Metrics

The methods will be ranked according to the correctness of the evidences provided evaluated through the Mean Average Precision (MAP). The scores of the evidences are used to rank the relevance of the documents and therefore, can be in any range (see example 1). For equal relevance scores, positive evidences will be ranked at the end (see example 2).

Example 1: Ground truth is ranked according to the submitted evidence relevance scores.

  • Evidence: [200.0, 120.0, 1000.0, 0.0, -0.1, 1]
  • Ground Truth: [0, 1, 0, 0, 0, 1]
  • Ranked Ground Truth: [0, 0, 1, 1, 0, 0]
  • MAP: 0.29

Example 2: Evidence scores list is submitted with all scores at 0. Then, the positive document is ranked at the end.

  • Evidence: [0.0, 0.0, 0.0, 0.0, 0.0]
  • Ground Truth: [0, 1, 0, 0, 0]
  • Ranked Ground Truth: [0, 0, 0, 0, 0, 1]
  • MAP: 0.17

If the submission contains the answers to the questions, it will be also evaluated and the precision and recall metrics will be provided. However, these metrics will not be used to rank the methods in the competition.

Important Dates

19 March 2020 : Text Transcriptions for Train_v0.1 Documents available

16 March 2020: Training set  v0.1 available

20 April 2020: Test set available

15 May 2020 (23:59 PST): Submission of results

16-18 June 2020: CVPR workshop