method: Applica.ai TILT2021-04-11

Authors: Applica.ai Research Team

Affiliation: Applica.ai

Email: rafal.powalski@applica.ai, dawid.jurkiewicz@applica.ai

Description: TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a encoder-decoder architecture. Results were obtained from single TILT-Large model pre-trained as described in a paper. Model was finetuned on challenge train set.

method: IG-BERT (single model)2021-04-09

Authors: Ryota Tanaka, Kyosuke Nishida

Affiliation: NTT Media Intelligence Laboratories, NTT Corporation

Email: ryouta.tanaka.rg@hco.ntt.co.jp

Description: IG-BERT is a V+L model pre-trained on large-scale infographic-text pairs. The model was initialized from BERT-large and trained on training and validation data. We extracted icon visual features using faster-rcnn trained on Visually29K. In the preprocessing stage, we used the google vision API to extract OCR.

method: NAVER CLOVA2021-04-11

Authors: Jungjun Kim, Teakgyu Hong, Hyungmin Lee, Junbum Cha, Sungrae Park

Affiliation: NAVER Corp.

Email: teakgyu.hong@navercorp.com

Description: We used CLOVA OCR to obtain OCR results for images, and then preprocessed the OCR results to solve the problem by extractive QA method. For preprocessing, we followed HyperDQA's approach. To train the extractive QA model, we first pre-trained BROS[1] model (with slight modification - sharing parameters between projection matrices in self-attention) on the IIT-CDIP dataset. Then, additional pre-training was performed on the SQuAD and WikitableQa datasets. After that, answers were obtained through fine-tuning on the DocVQA dataset.

[1] https://openreview.net/pdf?id=punMXQEsPr0

Ranking Table

Description Paper Source Code
Answer typeEvidenceOperation
DateMethodScoreImage spanQuestion spanMultiple spansNon spanTable/ListTextualVisual objectFigureMapComparisonArithmeticCounting
2021-04-11Applica.ai TILT0.61200.67650.64190.43910.38320.59170.79160.45450.56540.44800.48010.49580.2652
2021-04-09IG-BERT (single model)0.38540.41810.44810.21970.28490.33730.50160.30130.37060.33470.29390.35640.2000
2021-04-11NAVER CLOVA0.32190.39970.23170.10640.10680.26530.44880.18780.30950.32310.20200.14800.0695
2021-04-10Ensemble LM and VLM0.28530.33370.41810.07480.11690.24390.36490.23310.26450.28450.25800.16280.0647
2021-04-05BERT fuzzy search0.20780.26250.23330.07390.02590.18520.29950.08960.19420.17090.18050.01600.0436
2021-04-10BERT0.16780.21490.21170.01260.01520.14790.24500.10540.15050.17680.15780.01580.0185

Ranking Graphic

Ranking Graphic