- Task 1 - Single Page Document VQA
- Task 2 - Document Collection VQA
- Task 3 - Infographics VQA
- Task 4 - MP-DocVQA
method: InternVL2.5-78B-MPO (generalist)2024-12-24
Authors: InternVL team
Affiliation: Shanghai AI Laboratory & Tsinghua University
Email: wangweiyun@pjlab.org.cn
Description: InternVL2.5-MPO: Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
method: InternVL2-Pro (generalist)2024-06-30
Authors: InternVL team
Affiliation: Shanghai AI Laboratory & Sensetime & Tsinghua University
Email: czcz94cz@gmail.com
Description: InternVL Family: Closing the Gap to Commercial Multimodal Models with Open-Source Suites —— A Pioneering Open-Source Alternative to GPT-4V
Demo: https://internvl.opengvlab.com/
Code: https://github.com/OpenGVLab/InternVL
Model: https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5
method: Molmo-72B2024-09-25
Authors: Molmo Team
Affiliation: Allen Institute for Artificial Intelligence
Description: The 72B member of the Molmo family of open vision-language models developed by the Allen Institute for AI. Molmo models are trained on PixMo, a dataset of 1 million, highly-curated image-text pairs and have open source weights, training data, and training recipe.
Answer type | Evidence | Operation | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | Method | Score | Image span | Question span | Multiple spans | Non span | Table/List | Textual | Visual object | Figure | Map | Comparison | Arithmetic | Counting | |||
2024-12-24 | InternVL2.5-78B-MPO (generalist) | 0.8428 | 0.8765 | 0.8753 | 0.6977 | 0.7357 | 0.8313 | 0.9247 | 0.8398 | 0.8229 | 0.7338 | 0.7280 | 0.8812 | 0.5865 | |||
2024-06-30 | InternVL2-Pro (generalist) | 0.8334 | 0.8681 | 0.8929 | 0.7350 | 0.6969 | 0.8335 | 0.9260 | 0.7757 | 0.8093 | 0.7186 | 0.7301 | 0.8584 | 0.5368 | |||
2024-09-25 | Molmo-72B | 0.8186 | 0.8513 | 0.8827 | 0.6821 | 0.7041 | 0.8184 | 0.9136 | 0.8062 | 0.7945 | 0.6960 | 0.7054 | 0.8188 | 0.5930 | |||
2025-01-10 | VideoLLaMA3-7B | 0.7893 | 0.8269 | 0.8358 | 0.6845 | 0.6447 | 0.7936 | 0.9165 | 0.7446 | 0.7499 | 0.6661 | 0.6411 | 0.7785 | 0.5179 | |||
2024-12-13 | DeepSeek-VL2 | 0.7814 | 0.8189 | 0.8010 | 0.6989 | 0.6363 | 0.7935 | 0.9041 | 0.7371 | 0.7434 | 0.6327 | 0.6206 | 0.7282 | 0.5326 | |||
2024-04-27 | InternVL-1.5-Plus (generalist) | 0.7574 | 0.7989 | 0.8124 | 0.6425 | 0.5987 | 0.7544 | 0.8733 | 0.7306 | 0.7234 | 0.6216 | 0.6065 | 0.7386 | 0.4623 | |||
2024-11-01 | MLCD-Embodied-7B: Multi-label Cluster Discrimination for Visual Representation Learning | 0.6998 | 0.7330 | 0.7930 | 0.5955 | 0.5564 | 0.6951 | 0.8271 | 0.6654 | 0.6614 | 0.5495 | 0.5523 | 0.6350 | 0.4905 | |||
2024-04-02 | InternLM-XComposer2-4KHD-7B | 0.6855 | 0.7336 | 0.7570 | 0.5151 | 0.5124 | 0.6643 | 0.8240 | 0.6598 | 0.6471 | 0.5241 | 0.5120 | 0.6636 | 0.3610 | |||
2024-05-21 | PaliGemma-3B (finetune, 896px) | 0.4775 | 0.5214 | 0.5372 | 0.3301 | 0.3220 | 0.4500 | 0.6057 | 0.4252 | 0.4377 | 0.3690 | 0.3742 | 0.3924 | 0.2507 | |||
2024-05-21 | PaliGemma-3B (finetune, 448px) | 0.4047 | 0.4275 | 0.5801 | 0.2560 | 0.3007 | 0.4010 | 0.4853 | 0.3898 | 0.3742 | 0.3178 | 0.3530 | 0.3360 | 0.2517 | |||
2024-05-21 | PaliGemma-3B (finetune, 224px) | 0.2846 | 0.2888 | 0.5024 | 0.1567 | 0.2425 | 0.2675 | 0.3206 | 0.3164 | 0.2609 | 0.2406 | 0.2979 | 0.2025 | 0.2730 |