method: qwen2vl-2b ensemble2024-12-19

Authors: Wang KeLong

Description: Qwen2VL-2B is trained for mp-docvqa classification task, and Qwen2VL-2B is trained for sp-docvqa vqa task,The results from the four models are integrated through Qwen2VL-2B.

method: mPLUG-DocOwls2025-01-15

Authors: Anwen Hu, Haiyang Xu†, Liang Zhang, Jiabo Ye, Ming Yan†, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou

Affiliation: Alibaba Group, Renmin University of China

Description: This is an unofficial test and the model was not fine-tuned on this dataset. Since the model does not locate the answer into page, the Page prediction indicator is specified as the starting page.

Authors: Lei Kang, Rubèn Tito, Ernest Valveny, Dimosthenis Karatzas

Affiliation: Computer Vision Center (CVC)

Description: Documents are 2-dimensional carriers of written communication, and as such their interpretation requires a multi-modal approach where textual and visual information are efficiently combined. Document Visual Question Answering (Document VQA), due to this multi-modal nature, has garnered significant interest from both the document understanding and natural language processing
communities. The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle. They have to concatenate all pages into one large page for processing, demanding substantial GPU resources, even for evaluation. In this work, we propose a novel method and efficient training strategy for multi-page Document VQA tasks. In particular, we employ a visual-only document representation, leveraging the encoder from a document understanding model, Pix2Struct. Our approach utilizes a self-attention scoring mechanism to generate relevance scores for each document page, enabling the retrieval of pertinent pages. This adaptation allows us to extend single-page Document VQA models to multi-page scenarios without constraints on the number of pages during evaluation, all with minimal demand for GPU resources. Our extensive experiments demonstrate not only achieving state-of-the-art performance without the need for Optical Character Recognition (OCR), but also sustained performance in scenarios extending to documents of
nearly 800 pages compared to a maximum of 20 pages in the MP-DocVQA dataset.

Ranking Table

Description Paper Source Code
AnswerPage predictionANLS per answer page position
DateMethodANLSAccuracyPage 0Page 1Page 2Page 3Page 4Page 5Page 6Page 7Page 8Page 9Page 10Page 11Page 12Page 13Page 14Page 15Page 16Page 17Page 18Page 19
2024-12-19qwen2vl-2b ensemble0.850185.95340.89370.83630.79800.79650.77350.79030.66980.87180.81330.70040.76180.72290.80270.74330.72070.70310.77900.70500.88461.0000
2025-01-15mPLUG-DocOwls0.693250.78700.76180.66360.64030.62190.52820.55070.53610.59600.63880.60150.63420.50000.59220.43510.46120.49380.51520.53330.63680.7105
2023-10-03(OCR-Free) Retrieval-based Baseline0.619981.55010.67550.59540.58020.56110.49860.49890.57600.49910.60620.66520.56650.34380.44700.41710.37130.59090.43210.25750.73080.9605
2023-03-28Hi-VT50.618479.63740.65710.60550.59070.54500.52590.54310.67470.61130.59710.79970.52910.36940.54660.33730.41440.38790.48350.40010.61871.0000
2023-02-14(Baseline) Longformer base concat0.528771.16960.62930.47460.44950.43710.39660.38890.44510.38830.48050.50490.28600.18880.08610.16000.17260.24480.14860.19120.11540.6625
2023-02-14(Baseline) T5 base concat0.50500.00000.71220.43900.25670.20810.14980.15330.21860.14150.13010.31350.11080.08290.08660.07740.08730.04810.16480.22400.00000.3875
2023-02-14(Baseline) BigBird ITC base concat0.492967.54330.65060.45290.37290.28830.18900.17260.16810.19620.18870.29570.18020.08000.08290.05950.02380.19930.07780.14000.07690.2375
2023-02-14(Baseline) LayoutLMv3 base - concat0.453851.94260.66240.39620.20200.11050.16090.04940.11650.04670.05960.31980.09800.08000.04330.11310.00000.04550.09780.14670.03850.2105

Ranking Graphic

Ranking Graphic