Results - Document Visual Question Answering

method: Human Performance2020-06-13

Authors: DocVQA Organizers

Affiliation: CVIT, IIIT Hyderabad, CVC-UAB, Amazon

Description: Human performance on the test set.
A small group of volunteers were asked to enter an answer for the given question and the image.

@InProceedings{docvqa_wacv, author = {Mathew, Minesh and Karatzas, Dimosthenis and Jawahar, C.V.}, title = {DocVQA: A Dataset for VQA on Document Images}, booktitle = {WACV}, year = {2021}, pages = {2200-2209} }

method: qwenvl-max (single generalist model)2024-01-24

Authors: qwen team

Affiliation: alibaba group

Description: QwenVL
1. One single model, no assamble.
2. End-to-end model, no OCR pipeline.
3. Generalist model, no specialist finetuning.
Give it a go with our model at https://tongyi.aliyun.com/qianwen and API at https://help.aliyun.com/zh/dashscope/developer-reference/vl-plus-quick-start/
Follow us at https://github.com/QwenLM/Qwen-VL

method: InternVL 1.5 Plus (generalist)2024-04-27

Authors: InternVL team

Affiliation: Shanghai AI Laboratory & Sensetime & Tsinghua University

Email: czcz94cz@gmail.com

Description: InternVL Family: Closing the Gap to Commercial Multimodal Models with Open-Source Suites —— A Pioneering Open-Source Alternative to GPT-4V

Demo: https://internvl.opengvlab.com/
Code: https://github.com/OpenGVLab/InternVL
Model: https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5

@article{chen2023internvl, title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks}, author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Muyan, Zhong and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others}, journal={arXiv preprint arXiv:2312.14238}, year={2023} }

@article{chen2024far, title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites}, author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others}, journal={arXiv preprint arXiv:2404.16821}, year={2024} }

Source code

Ranking Table

Description Paper Source Code

Date	Method	Score	Figure/Diagram	Form	Table/List	Layout	Free_text	Image/Photo	Handwritten	Yes/No	Others
2020-06-13	Human Performance	0.9811	0.9756	0.9825	0.9780	0.9845	0.9839	0.9740	0.9717	0.9974	0.9828
2024-01-24	qwenvl-max (single generalist model)	0.9307	0.8491	0.9474	0.9195	0.9403	0.9380	0.8652	0.8922	0.8621	0.9341
2024-04-27	InternVL 1.5 Plus (generalist)	0.9234	0.8354	0.9556	0.9123	0.9397	0.9032	0.8313	0.9064	0.9655	0.9098
2023-12-07	qwenvl-plus (single generalist model)	0.9141	0.8146	0.9464	0.8999	0.9277	0.9265	0.8419	0.8776	0.9310	0.8667
2024-04-20	InternVL 1.5 (generalist)	0.9085	0.8185	0.9416	0.8940	0.9306	0.8877	0.8373	0.8830	0.7931	0.8698
2023-11-15	SMoLA-PaLI-X Specialist Model	0.9084	0.7790	0.9416	0.8934	0.9262	0.9188	0.7911	0.8508	0.8966	0.8456
2023-12-07	SMoLA-PaLI-X Generalist Model	0.9055	0.7757	0.9381	0.8924	0.9187	0.9179	0.8364	0.8483	0.7446	0.8609
2024-05-01	Snowflake Arctic-TILT 0.8B (fine-tuned)	0.9020	0.7198	0.9398	0.9152	0.9015	0.9042	0.6860	0.8415	0.6897	0.8604
2022-10-08	BAIDU-DI	0.9016	0.6823	0.9186	0.9139	0.9138	0.9234	0.6841	0.7949	0.6181	0.8344
2024-04-02	InternLM-XComposer2-4KHD-7B	0.9002	0.8041	0.9400	0.8965	0.9143	0.8618	0.7845	0.8264	0.8621	0.8298
2024-02-10	ScreenAI 5B	0.8988	0.7297	0.9419	0.8928	0.9158	0.8873	0.7722	0.8160	0.8966	0.8551
2024-05-01	Snowflake Arctic-TILT 0.8B (zero-shot)	0.8881	0.6826	0.9311	0.9011	0.8867	0.8917	0.6534	0.8219	0.6897	0.8515
2022-03-31	Tencent Youtu	0.8866	0.7576	0.9470	0.8932	0.8821	0.8654	0.6680	0.8877	0.4828	0.8413
2022-01-13	ERNIE-Layout 2.0	0.8841	0.6434	0.9177	0.8996	0.8899	0.9010	0.6223	0.7836	0.6124	0.8118
2023-12-10	DocFormerv2 (Single Model with 750M Parameters)	0.8784	0.6680	0.9382	0.9076	0.8676	0.8555	0.5840	0.8123	0.8276	0.8070
2021-11-26	Mybank-DocReader	0.8755	0.6682	0.9233	0.8763	0.8896	0.8713	0.6290	0.8047	0.5805	0.7804
2021-09-06	ERNIE-Layout 1.0	0.8753	0.6586	0.8972	0.8864	0.8902	0.8943	0.6392	0.7331	0.5434	0.8115
2021-02-12	Applica.ai TILT	0.8705	0.6082	0.9459	0.8980	0.8592	0.8581	0.5508	0.8139	0.6897	0.7788
2023-05-31	PaLI-X (Google Research; Single Generative Model)	0.8679	0.6971	0.8992	0.8400	0.8955	0.8925	0.7589	0.7209	0.8966	0.8468
2020-12-22	LayoutLM 2.0 (single model)	0.8672	0.6574	0.8953	0.8769	0.8791	0.8707	0.7287	0.6729	0.5517	0.8103
2024-01-24	nnrc_vary	0.8631	0.6689	0.9174	0.8354	0.8876	0.8761	0.6891	0.8269	0.6207	0.7696
2023-12-10	54_nnrc_zephyr	0.8560	0.6170	0.8924	0.8603	0.8546	0.9020	0.6083	0.8142	0.7488	0.8386
2020-08-16	Alibaba DAMO NLP	0.8506	0.6650	0.8809	0.8552	0.8733	0.8397	0.6758	0.7691	0.5492	0.7526
2020-05-16	PingAn-OneConnect-Gammalab-DQA	0.8484	0.6059	0.9021	0.8463	0.8730	0.8337	0.5812	0.7692	0.5172	0.7289
2024-01-21	Spatial LLM v1.2	0.8443	0.6300	0.8917	0.8180	0.8644	0.8877	0.6106	0.7390	0.6897	0.8097
2023-02-21	LayoutLMv2_star_seg_large	0.8430	0.7008	0.8737	0.8389	0.8536	0.8498	0.6872	0.7823	0.6181	0.8252
2024-01-12	Spatial LLM v1.1	0.8406	0.6128	0.8872	0.8127	0.8615	0.8991	0.6406	0.7404	0.6897	0.8083
2023-06-30	LATIN-Prompt + Claude (Zero shot)	0.8336	0.6601	0.8553	0.8584	0.8169	0.8726	0.6021	0.6774	0.7126	0.8258
2023-12-01	nnrc mplugowl2_9k	0.8281	0.5780	0.8949	0.7860	0.8662	0.8631	0.6302	0.8054	0.5517	0.7867
2024-01-10	Spatial LLM v1	0.8244	0.5842	0.8708	0.7949	0.8457	0.8986	0.6095	0.7167	0.6207	0.8082
2023-11-27	36_nnrc_llama2	0.8239	0.5404	0.8787	0.7958	0.8475	0.8813	0.5995	0.7991	0.6897	0.7922
2024-01-11	nnrc_udop_224_6ds	0.8227	0.5909	0.8706	0.8352	0.8335	0.8086	0.5972	0.6835	0.5862	0.7472
2023-05-06	Docugami-Layout	0.8031	0.5176	0.8875	0.7902	0.8214	0.8026	0.5089	0.7753	0.4224	0.7022
2024-03-01	Vary	0.7916	0.7415	0.7949	0.7378	0.8475	0.8101	0.6671	0.6552	0.7471	0.7888
2022-01-07	LayoutLMV2-large on Textract	0.7873	0.4924	0.8771	0.8218	0.7726	0.7661	0.4820	0.7276	0.3793	0.6983
2023-01-29	LayoutLMv2_star_seg	0.7859	0.5328	0.8406	0.7859	0.8128	0.7909	0.4879	0.6468	0.3644	0.6953
2023-05-25	YoBerDaV2 Single-page	0.7749	0.4737	0.8894	0.7586	0.7962	0.7398	0.4763	0.7173	0.7586	0.6976
2020-05-14	Structural LM-v2	0.7674	0.4931	0.8381	0.7621	0.7924	0.7596	0.4756	0.6282	0.5517	0.6549
2022-09-18	pix2struct-large	0.7656	0.4424	0.8827	0.7702	0.7774	0.7085	0.5383	0.6320	0.7586	0.6536
2022-12-28	Submission_ErnieLayout_base_finetuned_on_DocVQA_en_train_dev_textract_word_segments_ck-14000	0.7599	0.4313	0.8678	0.7726	0.7641	0.7330	0.4598	0.6957	0.4828	0.6097
2024-02-13	instructblip	0.7429	0.5158	0.7918	0.7019	0.7751	0.8088	0.5765	0.5892	0.5172	0.7062
2020-05-15	QA_Base_MRC_2	0.7415	0.4854	0.8015	0.6738	0.7943	0.8136	0.5740	0.5831	0.5287	0.7161
2020-05-15	QA_Base_MRC_1	0.7407	0.4890	0.7984	0.6675	0.7936	0.8131	0.5854	0.6099	0.4943	0.7384
2020-05-15	QA_Base_MRC_4	0.7348	0.4735	0.8040	0.6647	0.7838	0.8043	0.5618	0.5810	0.4598	0.7332
2020-05-15	QA_Base_MRC_3	0.7322	0.4852	0.7958	0.6562	0.7842	0.8044	0.5679	0.5730	0.4511	0.7171
2024-01-22	OCRF-ALT-c30	0.7285	0.3822	0.8695	0.7234	0.7508	0.6717	0.3656	0.6748	0.6897	0.5507
2020-05-15	QA_Base_MRC_5	0.7274	0.4858	0.7877	0.6550	0.7754	0.8047	0.5405	0.5619	0.4598	0.7084
2022-09-18	pix2struct-base	0.7213	0.4111	0.8386	0.7253	0.7503	0.6407	0.4211	0.5753	0.6552	0.5822
2024-04-02	MiniCPM-V-2	0.7187	0.6012	0.8062	0.6312	0.7880	0.6753	0.6834	0.6789	0.7586	0.6464
2023-01-27	LayoutLM-base+GNN	0.6984	0.4747	0.7973	0.6848	0.7322	0.6323	0.4398	0.5599	0.5431	0.5388
2021-12-05	Electra Large Squad	0.6961	0.4485	0.7703	0.6348	0.7364	0.7644	0.4594	0.5438	0.5172	0.6470
2023-05-25	YoBerDaV1 Multi-page	0.6904	0.3481	0.8335	0.6411	0.7253	0.6854	0.4191	0.6299	0.5517	0.6129
2020-05-16	HyperDQA_V4	0.6893	0.3874	0.7792	0.6309	0.7478	0.7187	0.4867	0.5630	0.4138	0.5685
2020-05-16	HyperDQA_V3	0.6769	0.3876	0.7774	0.6167	0.7332	0.6961	0.4296	0.5373	0.4138	0.5650
2023-07-06	GPT3.5	0.6759	0.4741	0.7144	0.6524	0.7036	0.6858	0.5385	0.5038	0.5954	0.6660
2020-05-16	HyperDQA_V2	0.6734	0.3818	0.7666	0.6110	0.7332	0.6867	0.4834	0.5560	0.3793	0.5902
2020-05-09	HyperDQA_V1	0.6717	0.4013	0.7693	0.6197	0.7167	0.6922	0.3598	0.5596	0.4138	0.5504
2023-08-15	LATIN-Tuning-Prompt + Alpaca (Zero-shot)	0.6687	0.3732	0.7529	0.6545	0.6615	0.7463	0.5439	0.4941	0.3481	0.6831
2023-07-14	donut_base	0.6590	0.3960	0.8407	0.6604	0.6987	0.4630	0.2969	0.6964	0.0345	0.5057
2023-12-04	ViTLP	0.6588	0.3880	0.8220	0.6705	0.6962	0.4670	0.2973	0.6307	0.4483	0.4910
2023-12-21	DocVQA: A Dataset for VQA on Document Images	0.6566	0.3569	0.7645	0.5775	0.7000	0.7205	0.4220	0.4802	0.4483	0.6108
2022-09-22	BROS_BASE (WebViCoB 6.4M)	0.6563	0.3780	0.7757	0.6681	0.6557	0.6175	0.3497	0.5782	0.4224	0.5754
2023-09-24	Layoutlm_DocVQA+Token_v2	0.6562	0.3935	0.7764	0.6228	0.6737	0.6711	0.3385	0.5109	0.5086	0.5515
2023-07-21	donut_half_input_imageSize	0.6536	0.3930	0.8366	0.6548	0.6950	0.4609	0.2486	0.6940	0.0345	0.4941
2021-12-04	Bert Large	0.6447	0.3502	0.7535	0.5488	0.6920	0.7266	0.4171	0.5254	0.5517	0.6076
2022-05-23	Dessurt	0.6322	0.3164	0.8058	0.6486	0.6520	0.4852	0.2862	0.5830	0.3793	0.4365
2024-03-17	DOLMA	0.6205	0.3336	0.7625	0.6009	0.6553	0.5347	0.3283	0.4656	0.5172	0.4913
2024-01-09	dolma	0.6196	0.4003	0.7642	0.5805	0.6609	0.5247	0.3958	0.5596	0.5690	0.4972
2020-05-09	bert fulldata fintuned	0.5900	0.4169	0.6870	0.4269	0.6710	0.7315	0.5124	0.4900	0.4483	0.5907
2020-05-01	bert finetuned	0.5872	0.2986	0.7011	0.4849	0.6359	0.6933	0.4622	0.4751	0.4483	0.4895
2020-04-30	HyperDQA_V0	0.5715	0.3131	0.6780	0.4732	0.6630	0.5716	0.3623	0.4351	0.3793	0.4941
2023-09-26	LayoutLM_Docvqa+Token_v0	0.4980	0.2319	0.6035	0.4320	0.5684	0.4779	0.2768	0.3081	0.1293	0.4178
2022-04-27	LayoutLMv2, Tesseract OCR eval (dataset OCR trained)	0.4961	0.2544	0.5523	0.4177	0.5495	0.5914	0.2888	0.1361	0.2069	0.4187
2022-03-29	LayoutLMv2, Tesseract OCR eval (Tesseract OCR trained)	0.4815	0.2253	0.5440	0.4216	0.5207	0.5709	0.2430	0.1353	0.3103	0.3859
2023-07-26	donut_large_encoderSize_finetuned_20_epoch	0.4673	0.2236	0.6691	0.4581	0.5026	0.2665	0.1356	0.4983	0.5734	0.3430
2020-04-27	bert	0.4557	0.2233	0.5259	0.2633	0.5113	0.7775	0.4859	0.3565	0.0345	0.5778
2020-05-16	UGLIFT v0.1 (Clova OCR)	0.4417	0.1766	0.5600	0.3178	0.5340	0.4520	0.2253	0.3573	0.4483	0.3356
2022-10-21	Finetuning LayoutLMv3_Base	0.3596	0.2102	0.4498	0.3858	0.3262	0.3496	0.1552	0.3404	0.0345	0.2706
2023-09-19	testtest	0.3569	0.3018	0.3407	0.2748	0.4693	0.3186	0.2682	0.2753	0.6207	0.3356
2020-05-14	Plain BERT QA	0.3524	0.1687	0.4489	0.2029	0.4321	0.4812	0.3517	0.3096	0.0345	0.3747
2020-05-16	Clova OCR V0	0.3489	0.0977	0.4855	0.2670	0.3811	0.3958	0.2489	0.2875	0.0345	0.3062
2020-05-01	HDNet	0.3401	0.2040	0.4688	0.2181	0.4710	0.1916	0.2488	0.2736	0.1379	0.2458
2020-05-16	CLOVA OCR	0.3296	0.1246	0.4612	0.2455	0.3622	0.3746	0.1692	0.2736	0.0690	0.3205
2023-07-21	donut_small_encoderSize_finetuned_20_epoch	0.3157	0.1935	0.4417	0.2912	0.3400	0.2075	0.1495	0.2658	0.3103	0.2644
2020-04-29	docVQAQV_V0.1	0.3016	0.2010	0.3898	0.3810	0.2933	0.0664	0.1842	0.2736	0.1586	0.1695
2020-04-26	docVQAQV_V0	0.2342	0.1646	0.3133	0.2623	0.2483	0.0549	0.2277	0.1856	0.1034	0.1635
2021-02-08	seq2seq	0.1081	0.0758	0.1283	0.0829	0.1332	0.0822	0.0786	0.0779	0.4828	0.1052
2024-01-23	lixiang-vlm-7b-handled	0.0990	0.0478	0.0798	0.0348	0.1648	0.0863	0.1309	0.1395	0.5517	0.1191
2024-01-24	lixiang-vlm-7b	0.0631	0.0313	0.0693	0.0272	0.0894	0.0639	0.0122	0.1145	0.5517	0.0826
2024-01-21	lixiang-vlm handled	0.0536	0.0243	0.0272	0.0097	0.1084	0.0400	0.0605	0.0395	0.1034	0.0568
2024-01-21	lixiang-vlm	0.0264	0.0176	0.0123	0.0045	0.0502	0.0262	0.0078	0.0291	0.1034	0.0273
2020-06-16	Test Submission	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000

Inactive evaluations

method: Human Performance2020-06-13

method: qwenvl-max (single generalist model)2024-01-24

method: InternVL 1.5 Plus (generalist)2024-04-27

Ranking Table

Ranking Graphic