Tasks - Document Visual Question Answering

The challenge will comprise two tasks. On one hand, a typical VQA task, where natural language questions are defined over single documents, and an answer needs to be generated by interpreting the document image. No list of pre-defined responses will be given, hence the problem cannot be easily treated as an n-way classification task. On the other hand, a retrieval-style task where given a question, the aim is to identify and retrieve all the documents in a large document collection that are relevant to answering this question.


Task 1

The objective of this task is to answer questions asked on a document image. The images provided are sourced from the documents hosted at the Industry Documents Library, maintained by the UCSF. The documents contain a mix of printed, typewritten and handwritten content. A wide variety of document types is used for this task including letters, memos, notes, reports etc.

The answers to questions are short text spans taken verbatim from the document. This means that the answers comprise a set of contiguous text tokens present in the document.

There might be more than one valid answer per question. In such a case, a list of possible correct answers is given in the training set. In a typical supervised training setting one might want to use only one answer per question. In such a case we suggest to use the first answer in the list of answers.

Ground Truth Format

The subsets packages provided (see Downloads section) contain a JSON file with the ground truth annotations, and a folder with the document images. The JSON file, called "docvqa_train_vX.X" has the following format (explanations in italics):

    "dataset_name": "docvqa",  The name of the dataset, should be invariably "docvqa"
    "dataset_split": "train",  The subset (either "train" or "test")
    "dataset_version": "0.1",  The version of the dataset. A string in the format of major.minor version
    "data": [{...}]

The "data" element is a list of dictionary entries with the following structure

    "questionId": 52212,  A unique ID number for the question
    "question": "Whose signature is given?",   The question string - natural language asked question
    "image": "documents/txpn0095_1.png",   The image filename corresponding to the document page which the question is defined on. The images are provided in the /documents folder
    "docId": 1968,  A unique ID number for the document
    "ucsf_document_id": "txpn0095",   The UCSF document id number
    "ucsf_document_page_no": "1",  The page number within the UCSF document that is used here
    "answers": ["Edward R. Shannon", "Edward Shannon"],  A list of correct answers provided by annotators
    "data_split": "train"  The dataset split this question pertains to

Submissions Format

Results are expected to be submitted as a single JSON file (extension .json) that contains a list of dictionaries, in which there are two keys which are "questionId" and "answer". The "questionId" key represents the unique id of the question while the "answer" key should correspond to the model's output. As a example, the result file might be named: result_task1.json and will contain a list similar to:

    {'answer': 'TRANSMIT CONFIRMATION REPORT', 'question_id': 10285},
    {'answer': '12/15/88', 'question_id': 18601},
    {'answer': 'Dear Dr. Lobo', 'question_id': 16734},

Evaluation Metirc

We will be using Average Normalized Levenshtein Similarity (ANLS) as the evaluation metric.  For more details on the metric please see the metric used for Task 3 for scene text VQA challenge. Please note that we are considering including other evaluation metrics , which are popular in VQA and Reading Comprehension tasks. We will update the details here before final submissions.

  • Answers are not case sensitive
  • Answers are space sensitive
  • Answers or tokens comprising answers are not limited to a fixed size dictionary. It could be any word/token which is present in the document.


Important Dates

19 March 2020 : Text Transcriptions for Train_v0.1 Documents available

16 March 2020: Training set  V0.1 available

15 April 2020: Test set available

30 April 2020: Submission of results

16-18 June 2020: CVPR workshop