method: DocGptVQA2023-04-20
Authors: RenZhou,QiaolingDeng,XinfengChang,LuyanWang,XiaochenHu,HuiLi, YaqiangWu
Affiliation: Lenovo Research
Description: We integrated the prediction outputs from the UDOP model and Blip2 to enhance our results,and we optimized the image encoder and included page number features to address the challenge of multi-page documents. GPT to generate python-like modular programs.
method: DocBlipVQA2023-04-16
Authors: RenZhou,QiaolingDeng,XinfengChang,LuyanWang,XiaochenHu,HuiLi, YaqiangWu
Affiliation: Lenovo Research
Description: We integrated the prediction outputs from the UDOP model and Blip2 to enhance our results,and we optimized the image encoder and included page number features to address the challenge of multi-page documents.
method: Multi-Modal T5 VQA2023-04-20
Authors: Hyunbyung, Park
Affiliation: Upstage KR
Description: In this work, we used a T5 multi-modal Visual Question Answering (VQA) model to address the challenges in document understanding. Our approach utilizes a combination of pretraining steps, fine-tuning, and prediction techniques to improve performance on a DUDE2023. We leveraged the following datasets for training and evaluation: ScienceQA, VQAonBD2023, HotpotQA, MPDocVQA, and DUDE2023.The methodology involves two primary steps of pretraining, followed by a final fine-tuning phase:Single-Page VQA Pretraining: In the first pretraining step, we used a combination of ScienceQA, VQAonBD2023, HotpotQA, MPDocVQA, and DUDE2023 datasets. Our model was pretrained on these datasets using two objectives: Masked Language Modeling (MLM) and VQA.Multi-Page VQA Pretraining: The second pretraining step involved training the model on MPDocVQA and DUDE2023 datasets. During this phase, we incorporated three objectives: Masked Language Modeling (MLM), Page Order Matching, and VQA. This step aimed to enhance the model's ability to understand and process multi-page documents effectively.After completing the pretraining steps, we fine-tuned the model using the single VQA objective on the DUDE2023 dataset only.For the prediction stage, we processed each page of the input PDF documents separately and obtained individual predictions. The answers from all pages were then combined to generate the final output.
Answer | Calibration | OOD Detection | ANLS per Answer type | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | Method | ANLS | ECE | AURC | AUROC | Extractive | Abstractive | List of answers | Unanswerable | |||
2023-04-20 | DocGptVQA | 0.5002 | 0.2240 | 0.4210 | 0.8744 | 0.5186 | 0.4832 | 0.2822 | 0.6204 | |||
2023-04-16 | DocBlipVQA | 0.4762 | 0.3065 | 0.4860 | 0.7829 | 0.5069 | 0.4631 | 0.3073 | 0.5522 | |||
2023-04-20 | Multi-Modal T5 VQA | 0.3790 | 0.5931 | 0.5931 | 0.5000 | 0.4155 | 0.4024 | 0.2021 | 0.3467 | |||
2023-04-19 | Multi-Modal T5 VQA | 0.3789 | 0.5931 | 0.5931 | 0.5000 | 0.4154 | 0.4022 | 0.2031 | 0.3467 | |||
2023-04-18 | Hi-VT5-beamsearch | 0.3574 | 0.6104 | 0.6104 | 0.5000 | 0.2831 | 0.3298 | 0.1060 | 0.6290 | |||
2023-04-21 | Hi-VT5-beamsearch with token type embeddings | 0.3559 | 0.2803 | 0.4603 | 0.4876 | 0.3095 | 0.3515 | 0.1176 | 0.5250 |