method: VideoLLaMA3-7B2025-01-10

Authors: VideoLLaMA3 Team

Affiliation: DAMO Academy, Alibaba Group

Description: VideoLLaMA 3 represents a state-of-the-art series of multimodal foundation models designed to excel in both image and video understanding tasks. Leveraging advanced architectures, VideoLLaMA 3 demonstrates exceptional capabilities in processing and interpreting visual content across various contexts. These models are specifically designed to address complex multimodal challenges, such as integrating textual and visual information, extracting insights from sequential video data, and performing high-level reasoning over both dynamic and static visual scenes.

Ranking Table

Description Paper Source Code
Answer typeEvidenceOperation
DateMethodScoreImage spanQuestion spanMultiple spansNon spanTable/ListTextualVisual objectFigureMapComparisonArithmeticCounting
2024-06-30InternVL2-Pro (generalist)0.83340.86810.89290.73500.69690.83350.92600.77570.80930.71860.73010.85840.5368
2024-09-25Molmo-72B0.81860.85130.88270.68210.70410.81840.91360.80620.79450.69600.70540.81880.5930
2025-01-10VideoLLaMA3-7B0.78930.82690.83580.68450.64470.79360.91650.74460.74990.66610.64110.77850.5179
2024-04-27InternVL-1.5-Plus (generalist)0.75740.79890.81240.64250.59870.75440.87330.73060.72340.62160.60650.73860.4623
2024-11-01MLCD-Embodied-7B: Multi-label Cluster Discrimination for Visual Representation Learning0.69980.73300.79300.59550.55640.69510.82710.66540.66140.54950.55230.63500.4905

Ranking Graphic