method: MiMo-VL-7B-RL2025-06-04

Authors: Xiaomi LLM-Core

Affiliation: Xiaomi

Description: We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models
delivering state-of-the-art performance in both general visual understanding and multimodal
reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and
scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding
applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized
models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with
Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify
the importance of incorporating high-quality reasoning data with long Chain-of-Thought into
pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain
optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote
reproducibility and advance the field. The model checkpoints and full evaluation suite are
available at https://github.com/XiaomiMiMo/MiMo-VL.

Ranking Table

Description Paper Source Code
Answer typeEvidenceOperation
DateMethodScoreImage spanQuestion spanMultiple spansNon spanTable/ListTextualVisual objectFigureMapComparisonArithmeticCounting
2025-06-04MiMo-VL-7B-RL0.88060.90050.89490.84460.80920.90080.93320.83730.85770.72980.82640.89130.7439
2024-06-30InternVL2-Pro (generalist)0.83340.86810.89290.73500.69690.83350.92600.77570.80930.71860.73010.85840.5368
2024-09-25Molmo-72B0.81860.85130.88270.68210.70410.81840.91360.80620.79450.69600.70540.81880.5930
2025-01-10VideoLLaMA3-7B0.78930.82690.83580.68450.64470.79360.91650.74460.74990.66610.64110.77850.5179
2024-04-27InternVL-1.5-Plus (generalist)0.75740.79890.81240.64250.59870.75440.87330.73060.72340.62160.60650.73860.4623
2024-05-31GPT-4 Vision Turbo + Amazon Textract OCR0.71910.75750.77950.65910.55530.71830.82010.66960.69040.69260.58150.67590.4281
2024-11-01MLCD-Embodied-7B: Multi-label Cluster Discrimination for Visual Representation Learning0.69980.73300.79300.59550.55640.69510.82710.66540.66140.54950.55230.63500.4905
2023-11-15SMoLA-PaLI-X Specialist Model0.66210.71660.72520.58380.42920.64480.82610.67140.61100.50650.52380.50540.3506
2024-02-10ScreenAI 5B0.65900.71620.72470.57340.41400.65250.83150.59680.60200.44670.48150.53030.3000
2021-04-11Applica.ai TILT0.61200.67650.64190.43910.38320.59170.79160.45450.56540.44800.48010.49580.2652
2024-07-22Snowflake Arctic-TILT 0.8B0.56950.62740.60740.41230.36530.54780.75300.42040.51090.44100.43500.50420.2238
2023-08-20PaLI-X (Google Research, Single Generative Model)0.54770.59400.69500.41220.35340.51450.68910.63730.50400.40130.42900.40530.3091
2022-03-03InfographicVQA paper model0.27200.32780.23860.04500.13710.24000.36260.17050.25510.22050.18360.15590.1140

Ranking Graphic