- Task 1 - Single Page Document VQA
- Task 2 - Document Collection VQA
- Task 3 - Infographics VQA
- Task 4 - MP-DocVQA
method: InternVL2-Pro (generalist)2024-06-30
Authors: InternVL team
Affiliation: Shanghai AI Laboratory & Sensetime & Tsinghua University
Email: czcz94cz@gmail.com
Description: InternVL Family: Closing the Gap to Commercial Multimodal Models with Open-Source Suites —— A Pioneering Open-Source Alternative to GPT-4V
Demo: https://internvl.opengvlab.com/
Code: https://github.com/OpenGVLab/InternVL
Model: https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5
method: Molmo-72B2024-09-25
Authors: Molmo Team
Affiliation: Allen Institute for Artificial Intelligence
Description: The 72B member of the Molmo family of open vision-language models developed by the Allen Institute for AI. Molmo models are trained on PixMo, a dataset of 1 million, highly-curated image-text pairs and have open source weights, training data, and training recipe.
method: VideoLLaMA3-7B2025-01-10
Authors: VideoLLaMA3 Team
Affiliation: DAMO Academy, Alibaba Group
Description: VideoLLaMA 3 represents a state-of-the-art series of multimodal foundation models designed to excel in both image and video understanding tasks. Leveraging advanced architectures, VideoLLaMA 3 demonstrates exceptional capabilities in processing and interpreting visual content across various contexts. These models are specifically designed to address complex multimodal challenges, such as integrating textual and visual information, extracting insights from sequential video data, and performing high-level reasoning over both dynamic and static visual scenes.
Answer type | Evidence | Operation | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | Method | Score | Image span | Question span | Multiple spans | Non span | Table/List | Textual | Visual object | Figure | Map | Comparison | Arithmetic | Counting | |||
2024-06-30 | InternVL2-Pro (generalist) | 0.8334 | 0.8681 | 0.8929 | 0.7350 | 0.6969 | 0.8335 | 0.9260 | 0.7757 | 0.8093 | 0.7186 | 0.7301 | 0.8584 | 0.5368 | |||
2024-09-25 | Molmo-72B | 0.8186 | 0.8513 | 0.8827 | 0.6821 | 0.7041 | 0.8184 | 0.9136 | 0.8062 | 0.7945 | 0.6960 | 0.7054 | 0.8188 | 0.5930 | |||
2025-01-10 | VideoLLaMA3-7B | 0.7893 | 0.8269 | 0.8358 | 0.6845 | 0.6447 | 0.7936 | 0.9165 | 0.7446 | 0.7499 | 0.6661 | 0.6411 | 0.7785 | 0.5179 | |||
2024-04-27 | InternVL-1.5-Plus (generalist) | 0.7574 | 0.7989 | 0.8124 | 0.6425 | 0.5987 | 0.7544 | 0.8733 | 0.7306 | 0.7234 | 0.6216 | 0.6065 | 0.7386 | 0.4623 | |||
2024-05-31 | GPT-4 Vision Turbo + Amazon Textract OCR | 0.7191 | 0.7575 | 0.7795 | 0.6591 | 0.5553 | 0.7183 | 0.8201 | 0.6696 | 0.6904 | 0.6926 | 0.5815 | 0.6759 | 0.4281 | |||
2024-11-01 | MLCD-Embodied-7B: Multi-label Cluster Discrimination for Visual Representation Learning | 0.6998 | 0.7330 | 0.7930 | 0.5955 | 0.5564 | 0.6951 | 0.8271 | 0.6654 | 0.6614 | 0.5495 | 0.5523 | 0.6350 | 0.4905 | |||
2023-11-15 | SMoLA-PaLI-X Specialist Model | 0.6621 | 0.7166 | 0.7252 | 0.5838 | 0.4292 | 0.6448 | 0.8261 | 0.6714 | 0.6110 | 0.5065 | 0.5238 | 0.5054 | 0.3506 | |||
2024-02-10 | ScreenAI 5B | 0.6590 | 0.7162 | 0.7247 | 0.5734 | 0.4140 | 0.6525 | 0.8315 | 0.5968 | 0.6020 | 0.4467 | 0.4815 | 0.5303 | 0.3000 | |||
2021-04-11 | Applica.ai TILT | 0.6120 | 0.6765 | 0.6419 | 0.4391 | 0.3832 | 0.5917 | 0.7916 | 0.4545 | 0.5654 | 0.4480 | 0.4801 | 0.4958 | 0.2652 | |||
2024-07-22 | Snowflake Arctic-TILT 0.8B | 0.5695 | 0.6274 | 0.6074 | 0.4123 | 0.3653 | 0.5478 | 0.7530 | 0.4204 | 0.5109 | 0.4410 | 0.4350 | 0.5042 | 0.2238 | |||
2023-08-20 | PaLI-X (Google Research, Single Generative Model) | 0.5477 | 0.5940 | 0.6950 | 0.4122 | 0.3534 | 0.5145 | 0.6891 | 0.6373 | 0.5040 | 0.4013 | 0.4290 | 0.4053 | 0.3091 | |||
2022-03-03 | InfographicVQA paper model | 0.2720 | 0.3278 | 0.2386 | 0.0450 | 0.1371 | 0.2400 | 0.3626 | 0.1705 | 0.2551 | 0.2205 | 0.1836 | 0.1559 | 0.1140 |