method: HyperDQA_V42020-05-16

Authors: Anisha Gunjal, Vipul Gupta, Moinak Bhattacharya, Digvijay Singh

Affiliation: HyperVerge, Inc.

Description: Our method uses transformer models as the backbone to solve the task provided in the challenge due to:
1. The recent success of transformer models achieving state-of-art results on most natural language tasks on benchmark datasets.
2. Q&A extension to transformer models is a seamless downstream task compared to the traditional way of modelling questions with LSTMs and similar recurrent models.
3. The final task is to obtain the span of answer tokens (start and end) and not answer-sentence creation. This is a natural fit to the one-to-one correspondence prediction done by the transformer models.
Hence, we extend two different types of transformer models i.e BERT[1] and LayoutLM[2]. While BERT is a suitable pick due to its natural language understanding, LayoutLM which is a transformer model with positional embeddings is an even better choice because of the document layout complexity in the provided dataset.
Overall, LayoutLM works really well where understanding of the layout takes precedence especially in cases like forms and tabular data which are a good percentage of the overall dataset.
BERT complimented LayoutLM where it failed due to the strong requirement of language context understanding needed in certain cases.
Apart from the above mentioned document segment layouts, our method also inherently learns to find answers when the same entity information is present in different formats like addresses, titles, headings, salutations and other similar notations.
Finally for our final submission, we experiment with few techniques to ensemble both the said models.

method: Dessurt2022-05-04

Authors: Brian Davis, Bryan Morse, Brian Price, Chris Tensmeyer, Curtis Wigington, Vlad Morariu

Affiliation: Brigham Young University, Adobe

Email: hero.bd@gmail.com

Description: Dessurt: Document End-to-end Self-Supervised Understanding and Recognition Transformer

This model does not use external OCR results.

Learns recognition (OCR) and understanding tasks in end-to-end model. Uses Swin-like transformer with cross-attention to textual autoregressive transformer. Pretrained on IIT-CDIP and extensive synthetic data.

Authors: Brian Davis

Affiliation: Brigham Young University

Email: hero.bd@gmail.com

Description: LayoutLMv2 trained with the dataset OCR and evaluated with Tesseract's OCR.
This is to demonstrate the importance of a good OCR for DocVQA.
It was trained for 6 epochs and a batch size of 5 with the code at https://github.com/herobd/layoutlmv2

Ranking Table

Description Paper Source Code
DateMethodScoreFigure/DiagramFormTable/ListLayoutFree_textImage/PhotoHandwrittenYes/NoOthers
2020-05-16HyperDQA_V40.68930.38740.77920.63090.74780.71870.48670.56300.41380.5685
2022-05-04Dessurt0.61870.28670.78740.62860.64620.47930.27080.56800.41040.4380
2022-04-27LayoutLMv2, Tesseract OCR eval (dataset OCR trained)0.49610.25440.55230.41770.54950.59140.28880.13610.20690.4187
2022-03-29LayoutLMv2, Tesseract OCR eval (Tesseract OCR trained)0.48150.22530.54400.42160.52070.57090.24300.13530.31030.3859
2021-02-08seq2seq0.10810.07580.12830.08290.13320.08220.07860.07790.48280.1052

Ranking Graphic

Ranking Graphic