method: Multi-Modal T5 VQA2023-04-20

Authors: Hyunbyung, Park

Affiliation: Upstage KR

Description: In this work, we used a T5 multi-modal Visual Question Answering (VQA) model to address the challenges in document understanding. Our approach utilizes a combination of pretraining steps, fine-tuning, and prediction techniques to improve performance on a DUDE2023. We leveraged the following datasets for training and evaluation: ScienceQA, VQAonBD2023, HotpotQA, MPDocVQA, and DUDE2023.The methodology involves two primary steps of pretraining, followed by a final fine-tuning phase:Single-Page VQA Pretraining: In the first pretraining step, we used a combination of ScienceQA, VQAonBD2023, HotpotQA, MPDocVQA, and DUDE2023 datasets. Our model was pretrained on these datasets using two objectives: Masked Language Modeling (MLM) and VQA.Multi-Page VQA Pretraining: The second pretraining step involved training the model on MPDocVQA and DUDE2023 datasets. During this phase, we incorporated three objectives: Masked Language Modeling (MLM), Page Order Matching, and VQA. This step aimed to enhance the model's ability to understand and process multi-page documents effectively.After completing the pretraining steps, we fine-tuned the model using the single VQA objective on the DUDE2023 dataset only.For the prediction stage, we processed each page of the input PDF documents separately and obtained individual predictions. The answers from all pages were then combined to generate the final output.