method: DocGptVQA2023-04-20

Authors: RenZhou,QiaolingDeng,XinfengChang,LuyanWang,XiaochenHu,HuiLi, YaqiangWu

Affiliation: Lenovo Research

Description: We integrated the prediction outputs from the UDOP model and Blip2 to enhance our results,and we optimized the image encoder and included page number features to address the challenge of multi-page documents. GPT to generate python-like modular programs.

method: DocBlipVQA2023-04-16

Authors: RenZhou,QiaolingDeng,XinfengChang,LuyanWang,XiaochenHu,HuiLi, YaqiangWu

Affiliation: Lenovo Research

Description: We integrated the prediction outputs from the UDOP model and Blip2 to enhance our results,and we optimized the image encoder and included page number features to address the challenge of multi-page documents.

method: Multi-Modal T5 VQA2023-04-20

Authors: Hyunbyung, Park

Affiliation: Upstage KR

Description: In this work, we used a T5 multi-modal Visual Question Answering (VQA) model to address the challenges in document understanding. Our approach utilizes a combination of pretraining steps, fine-tuning, and prediction techniques to improve performance on a DUDE2023. We leveraged the following datasets for training and evaluation: ScienceQA, VQAonBD2023, HotpotQA, MPDocVQA, and DUDE2023.The methodology involves two primary steps of pretraining, followed by a final fine-tuning phase:Single-Page VQA Pretraining: In the first pretraining step, we used a combination of ScienceQA, VQAonBD2023, HotpotQA, MPDocVQA, and DUDE2023 datasets. Our model was pretrained on these datasets using two objectives: Masked Language Modeling (MLM) and VQA.Multi-Page VQA Pretraining: The second pretraining step involved training the model on MPDocVQA and DUDE2023 datasets. During this phase, we incorporated three objectives: Masked Language Modeling (MLM), Page Order Matching, and VQA. This step aimed to enhance the model's ability to understand and process multi-page documents effectively.After completing the pretraining steps, we fine-tuned the model using the single VQA objective on the DUDE2023 dataset only.For the prediction stage, we processed each page of the input PDF documents separately and obtained individual predictions. The answers from all pages were then combined to generate the final output.

Ranking Table

Description Paper Source Code
AnswerCalibrationOOD DetectionANLS per Answer type
DateMethodANLSECEAURCAUROCExtractiveAbstractiveList of answersUnanswerable
2023-04-20DocGptVQA0.50020.22400.42100.87440.51860.48320.28220.6204
2023-04-16DocBlipVQA0.47620.30650.48600.78290.50690.46310.30730.5522
2023-04-20Multi-Modal T5 VQA0.37900.59310.59310.50000.41550.40240.20210.3467
2023-04-19Multi-Modal T5 VQA0.37890.59310.59310.50000.41540.40220.20310.3467
2023-04-18Hi-VT5-beamsearch0.35740.61040.61040.50000.28310.32980.10600.6290
2023-04-21Hi-VT5-beamsearch with token type embeddings0.35590.28030.46030.48760.30950.35150.11760.5250

Ranking Graphic

Ranking Graphic

Ranking Graphic

Ranking Graphic