Overview - Document Visual Question Answering

The CVPR 2020 "Document Visual Question Answering" (DocVQA) challenge, focuses on a specific type of Visual Question Answering task, where visually understanding the information on a document image is necessary in order to provide an answer. This goes over and above passing a document image through OCR, and involves understanding all types of information conveyed by a document. Textual content (handwritten or typewritten), non-textual elements (marks, tick boxes, separators, diagrams), layout (page structure, forms, tables), and style (font, colours, highlighting), to mention just a few, are pieces of information that can be potentially necessary for responding to the question at hand.

The challenge is organised in the context of the CVPR 2020 Workshop on Text and Documents in the Deep Learning Era.

DocVQA_ex1_lyph0227_1.png

    

DocVQA_ex2_imagesc_c_l_l_cll11d00_522856817-6818.png

         

DocVQA_ex3.2_qjvj0224_2.png

What is the amount of total due? 56.62

What is the Invoice Number? 834 SC-R

What is the description of the quantity? 5x7 Glossy Prints of Mr- Robert Owen

 

How old is the sender? 26

Who sent this letter? Darline Hurt

 

What is the issue at the top of the pyramid? Retailer calls/ other issues

Which is the least critical issue for live rep support? Retailer calls/other issues

Which is the most critical issue for live rep support? Product quality/liability issues

Figure 1. Example documents from DocVQA with its Questions and Answers.

 

Contemporary Document Analysis and Recognition (DAR) research tends to focus on generic information extraction tasks (character recognition, table extraction, word spotting), largely disconnected from the final purpose the extracted information is used for. The DocVQA challenge, seeks to inspire a “purpose-driven” point of view in Document Analysis and Recognition research, where the document content is extracted and used to respond to high-level tasks defined by the human consumers of this information. In this sense DocVQA provides a high-level task that should dynamically drive information extraction algorithms to conditionally interpret document images.

On the other hand, Visual Question Answering (VQA) as it is currently applied in real scene images is vulnerable to learning coincidental correlations in the data without forming a deeper understanding of the scene. In the case of DocVQA, more profound relations between the question aims (as expressed in natural language), and the document image content (that needs to be extracted and understood) are necessary to establish.

A large-scale dataset of document images reflecting real-world document variety, along with question and answer pairs will be released according to the schedule on the right.

The challenge will comprise two tasks. Task 1 is a typical VQA style task, where natural language questions are defined over single documents, and an answer needs to be generated by interpreting the document image. No list of pre-defined responses will be given, hence the problem cannot be easily treated as an n-way classification task.

Citation for Task 1

@misc{mathew2020docvqa,
    title={{DocVQA}: A Dataset for VQA on Document Images},
    author={Minesh Mathew and Dimosthenis Karatzas and R. Manmatha and C. V. Jawahar},
    year={2020},
    eprint={2007.00398},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

 

Task 2 is a retrieval-style task where given a question, the aim is to identify and retrieve all the documents in a large document collection that are relevant to answering this question.

Important Dates

19 March 2020 : Text Transcriptions for Train_v0.1 Documents available

16 March 2020: Training set  v0.1 available

20 April 2020: Test set available

15 May 2020 (23:59 PST): Submission of results

16-18 June 2020: CVPR workshop