method: VideoLLaMA3-7B2025-01-10

Authors: VideoLLaMA3 Team

Affiliation: DAMO Academy, Alibaba Group

Description: VideoLLaMA 3 represents a state-of-the-art series of multimodal foundation models designed to excel in both image and video understanding tasks. Leveraging advanced architectures, VideoLLaMA 3 demonstrates exceptional capabilities in processing and interpreting visual content across various contexts. These models are specifically designed to address complex multimodal challenges, such as integrating textual and visual information, extracting insights from sequential video data, and performing high-level reasoning over both dynamic and static visual scenes.

Ranking Table

Description Paper Source Code
Answer typeEvidenceOperation
DateMethodScoreImage spanQuestion spanMultiple spansNon spanTable/ListTextualVisual objectFigureMapComparisonArithmeticCounting
2024-06-30InternVL2-Pro (generalist)0.83340.86810.89290.73500.69690.83350.92600.77570.80930.71860.73010.85840.5368
2024-09-25Molmo-72B0.81860.85130.88270.68210.70410.81840.91360.80620.79450.69600.70540.81880.5930
2025-01-10VideoLLaMA3-7B0.78930.82690.83580.68450.64470.79360.91650.74460.74990.66610.64110.77850.5179
2024-04-27InternVL-1.5-Plus (generalist)0.75740.79890.81240.64250.59870.75440.87330.73060.72340.62160.60650.73860.4623
2024-05-31GPT-4 Vision Turbo + Amazon Textract OCR0.71910.75750.77950.65910.55530.71830.82010.66960.69040.69260.58150.67590.4281
2024-11-01MLCD-Embodied-7B: Multi-label Cluster Discrimination for Visual Representation Learning0.69980.73300.79300.59550.55640.69510.82710.66540.66140.54950.55230.63500.4905
2023-11-15SMoLA-PaLI-X Specialist Model0.66210.71660.72520.58380.42920.64480.82610.67140.61100.50650.52380.50540.3506
2024-02-10ScreenAI 5B0.65900.71620.72470.57340.41400.65250.83150.59680.60200.44670.48150.53030.3000
2021-04-11Applica.ai TILT0.61200.67650.64190.43910.38320.59170.79160.45450.56540.44800.48010.49580.2652
2024-07-22Snowflake Arctic-TILT 0.8B0.56950.62740.60740.41230.36530.54780.75300.42040.51090.44100.43500.50420.2238
2023-08-20PaLI-X (Google Research, Single Generative Model)0.54770.59400.69500.41220.35340.51450.68910.63730.50400.40130.42900.40530.3091
2022-03-03InfographicVQA paper model0.27200.32780.23860.04500.13710.24000.36260.17050.25510.22050.18360.15590.1140

Ranking Graphic