- Task 1 - Single Page Document VQA
- Task 2 - Document Collection VQA
- Task 3 - Infographics VQA
- Task 4 - MP-DocVQA
method: qwen2vl-2b ensemble2024-12-19
Authors: Wang KeLong
Description: Qwen2VL-2B is trained for mp-docvqa classification task, and Qwen2VL-2B is trained for sp-docvqa vqa task,The results from the four models are integrated through Qwen2VL-2B.
method: mPLUG-DocOwls2025-01-15
Authors: Anwen Hu, Haiyang Xu†, Liang Zhang, Jiabo Ye, Ming Yan†, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou
Affiliation: Alibaba Group, Renmin University of China
Description: This is an unofficial test and the model was not fine-tuned on this dataset. Since the model does not locate the answer into page, the Page prediction indicator is specified as the starting page.
method: (OCR-Free) Retrieval-based Baseline2023-10-03
Authors: Lei Kang, Rubèn Tito, Ernest Valveny, Dimosthenis Karatzas
Affiliation: Computer Vision Center (CVC)
Description: Documents are 2-dimensional carriers of written communication, and as such their interpretation requires a multi-modal approach where textual and visual information are efficiently combined. Document Visual Question Answering (Document VQA), due to this multi-modal nature, has garnered significant interest from both the document understanding and natural language processing
communities. The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle. They have to concatenate all pages into one large page for processing, demanding substantial GPU resources, even for evaluation. In this work, we propose a novel method and efficient training strategy for multi-page Document VQA tasks. In particular, we employ a visual-only document representation, leveraging the encoder from a document understanding model, Pix2Struct. Our approach utilizes a self-attention scoring mechanism to generate relevance scores for each document page, enabling the retrieval of pertinent pages. This adaptation allows us to extend single-page Document VQA models to multi-page scenarios without constraints on the number of pages during evaluation, all with minimal demand for GPU resources. Our extensive experiments demonstrate not only achieving state-of-the-art performance without the need for Optical Character Recognition (OCR), but also sustained performance in scenarios extending to documents of
nearly 800 pages compared to a maximum of 20 pages in the MP-DocVQA dataset.
Answer | Page prediction | ANLS per answer page position | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | Method | ANLS | Accuracy | Page 0 | Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | |||
2024-12-19 | qwen2vl-2b ensemble | 0.8501 | 85.9534 | 0.8937 | 0.8363 | 0.7980 | 0.7965 | 0.7735 | 0.7903 | 0.6698 | 0.8718 | 0.8133 | 0.7004 | 0.7618 | 0.7229 | 0.8027 | 0.7433 | 0.7207 | 0.7031 | 0.7790 | 0.7050 | 0.8846 | 1.0000 | |||
2025-01-15 | mPLUG-DocOwls | 0.6932 | 50.7870 | 0.7618 | 0.6636 | 0.6403 | 0.6219 | 0.5282 | 0.5507 | 0.5361 | 0.5960 | 0.6388 | 0.6015 | 0.6342 | 0.5000 | 0.5922 | 0.4351 | 0.4612 | 0.4938 | 0.5152 | 0.5333 | 0.6368 | 0.7105 | |||
2023-10-03 | (OCR-Free) Retrieval-based Baseline | 0.6199 | 81.5501 | 0.6755 | 0.5954 | 0.5802 | 0.5611 | 0.4986 | 0.4989 | 0.5760 | 0.4991 | 0.6062 | 0.6652 | 0.5665 | 0.3438 | 0.4470 | 0.4171 | 0.3713 | 0.5909 | 0.4321 | 0.2575 | 0.7308 | 0.9605 | |||
2023-03-28 | Hi-VT5 | 0.6184 | 79.6374 | 0.6571 | 0.6055 | 0.5907 | 0.5450 | 0.5259 | 0.5431 | 0.6747 | 0.6113 | 0.5971 | 0.7997 | 0.5291 | 0.3694 | 0.5466 | 0.3373 | 0.4144 | 0.3879 | 0.4835 | 0.4001 | 0.6187 | 1.0000 | |||
2023-02-14 | (Baseline) Longformer base concat | 0.5287 | 71.1696 | 0.6293 | 0.4746 | 0.4495 | 0.4371 | 0.3966 | 0.3889 | 0.4451 | 0.3883 | 0.4805 | 0.5049 | 0.2860 | 0.1888 | 0.0861 | 0.1600 | 0.1726 | 0.2448 | 0.1486 | 0.1912 | 0.1154 | 0.6625 | |||
2023-02-14 | (Baseline) T5 base concat | 0.5050 | 0.0000 | 0.7122 | 0.4390 | 0.2567 | 0.2081 | 0.1498 | 0.1533 | 0.2186 | 0.1415 | 0.1301 | 0.3135 | 0.1108 | 0.0829 | 0.0866 | 0.0774 | 0.0873 | 0.0481 | 0.1648 | 0.2240 | 0.0000 | 0.3875 | |||
2023-02-14 | (Baseline) BigBird ITC base concat | 0.4929 | 67.5433 | 0.6506 | 0.4529 | 0.3729 | 0.2883 | 0.1890 | 0.1726 | 0.1681 | 0.1962 | 0.1887 | 0.2957 | 0.1802 | 0.0800 | 0.0829 | 0.0595 | 0.0238 | 0.1993 | 0.0778 | 0.1400 | 0.0769 | 0.2375 | |||
2023-02-14 | (Baseline) LayoutLMv3 base - concat | 0.4538 | 51.9426 | 0.6624 | 0.3962 | 0.2020 | 0.1105 | 0.1609 | 0.0494 | 0.1165 | 0.0467 | 0.0596 | 0.3198 | 0.0980 | 0.0800 | 0.0433 | 0.1131 | 0.0000 | 0.0455 | 0.0978 | 0.1467 | 0.0385 | 0.2105 |