- Task 1 - Single Page Document VQA
- Task 2 - Document Collection VQA
- Task 3 - Infographics VQA
- Task 4 - MP-DocVQA
method: MiMo-VL-7B-RL2025-06-04
Authors: Xiaomi LLM-Core
Affiliation: Xiaomi
Description: We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models
delivering state-of-the-art performance in both general visual understanding and multimodal
reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and
scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding
applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized
models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with
Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify
the importance of incorporating high-quality reasoning data with long Chain-of-Thought into
pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain
optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote
reproducibility and advance the field. The model checkpoints and full evaluation suite are
available at https://github.com/XiaomiMiMo/MiMo-VL.
method: InternVL2-Pro (generalist)2024-06-30
Authors: InternVL team
Affiliation: Shanghai AI Laboratory & Sensetime & Tsinghua University
Email: czcz94cz@gmail.com
Description: InternVL Family: Closing the Gap to Commercial Multimodal Models with Open-Source Suites —— A Pioneering Open-Source Alternative to GPT-4V
Demo: https://internvl.opengvlab.com/
Code: https://github.com/OpenGVLab/InternVL
Model: https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5
method: Molmo-72B2024-09-25
Authors: Molmo Team
Affiliation: Allen Institute for Artificial Intelligence
Description: The 72B member of the Molmo family of open vision-language models developed by the Allen Institute for AI. Molmo models are trained on PixMo, a dataset of 1 million, highly-curated image-text pairs and have open source weights, training data, and training recipe.
Answer type | Evidence | Operation | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | Method | Score | Image span | Question span | Multiple spans | Non span | Table/List | Textual | Visual object | Figure | Map | Comparison | Arithmetic | Counting | |||
2025-06-04 | MiMo-VL-7B-RL | 0.8806 | 0.9005 | 0.8949 | 0.8446 | 0.8092 | 0.9008 | 0.9332 | 0.8373 | 0.8577 | 0.7298 | 0.8264 | 0.8913 | 0.7439 | |||
2024-06-30 | InternVL2-Pro (generalist) | 0.8334 | 0.8681 | 0.8929 | 0.7350 | 0.6969 | 0.8335 | 0.9260 | 0.7757 | 0.8093 | 0.7186 | 0.7301 | 0.8584 | 0.5368 | |||
2024-09-25 | Molmo-72B | 0.8186 | 0.8513 | 0.8827 | 0.6821 | 0.7041 | 0.8184 | 0.9136 | 0.8062 | 0.7945 | 0.6960 | 0.7054 | 0.8188 | 0.5930 | |||
2025-01-10 | VideoLLaMA3-7B | 0.7893 | 0.8269 | 0.8358 | 0.6845 | 0.6447 | 0.7936 | 0.9165 | 0.7446 | 0.7499 | 0.6661 | 0.6411 | 0.7785 | 0.5179 | |||
2024-04-27 | InternVL-1.5-Plus (generalist) | 0.7574 | 0.7989 | 0.8124 | 0.6425 | 0.5987 | 0.7544 | 0.8733 | 0.7306 | 0.7234 | 0.6216 | 0.6065 | 0.7386 | 0.4623 | |||
2024-05-31 | GPT-4 Vision Turbo + Amazon Textract OCR | 0.7191 | 0.7575 | 0.7795 | 0.6591 | 0.5553 | 0.7183 | 0.8201 | 0.6696 | 0.6904 | 0.6926 | 0.5815 | 0.6759 | 0.4281 | |||
2024-11-01 | MLCD-Embodied-7B: Multi-label Cluster Discrimination for Visual Representation Learning | 0.6998 | 0.7330 | 0.7930 | 0.5955 | 0.5564 | 0.6951 | 0.8271 | 0.6654 | 0.6614 | 0.5495 | 0.5523 | 0.6350 | 0.4905 | |||
2023-11-15 | SMoLA-PaLI-X Specialist Model | 0.6621 | 0.7166 | 0.7252 | 0.5838 | 0.4292 | 0.6448 | 0.8261 | 0.6714 | 0.6110 | 0.5065 | 0.5238 | 0.5054 | 0.3506 | |||
2024-02-10 | ScreenAI 5B | 0.6590 | 0.7162 | 0.7247 | 0.5734 | 0.4140 | 0.6525 | 0.8315 | 0.5968 | 0.6020 | 0.4467 | 0.4815 | 0.5303 | 0.3000 | |||
2021-04-11 | Applica.ai TILT | 0.6120 | 0.6765 | 0.6419 | 0.4391 | 0.3832 | 0.5917 | 0.7916 | 0.4545 | 0.5654 | 0.4480 | 0.4801 | 0.4958 | 0.2652 | |||
2024-07-22 | Snowflake Arctic-TILT 0.8B | 0.5695 | 0.6274 | 0.6074 | 0.4123 | 0.3653 | 0.5478 | 0.7530 | 0.4204 | 0.5109 | 0.4410 | 0.4350 | 0.5042 | 0.2238 | |||
2023-08-20 | PaLI-X (Google Research, Single Generative Model) | 0.5477 | 0.5940 | 0.6950 | 0.4122 | 0.3534 | 0.5145 | 0.6891 | 0.6373 | 0.5040 | 0.4013 | 0.4290 | 0.4053 | 0.3091 | |||
2022-03-03 | InfographicVQA paper model | 0.2720 | 0.3278 | 0.2386 | 0.0450 | 0.1371 | 0.2400 | 0.3626 | 0.1705 | 0.2551 | 0.2205 | 0.1836 | 0.1559 | 0.1140 |