Authors: qwen team

Affiliation: alibaba group

Description: QwenVL
1. One single model, no assamble.
2. End-to-end model, no OCR pipeline.
3. Generalist model, no specialist finetuning.
Give it a go with our model at and API at
Follow us at

Authors: SMoLA PaLI Team

Affiliation: Google Research

Description: Omni-SMoLA uses the Soft MoE approach to (softly) mix many multimodal low rank experts. The specialist model is further lora tuned on the InfoVQA task from the SMoLA-PaLI-X generalist model.

method: ScreenAI 5B2024-02-10

Authors: ScreenAI Team

Affiliation: Google Research

Description: Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction.
We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding.
Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets.
At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements.
We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale.
We run ablation studies to demonstrate the impact of these design choices.
At only 5B parameters, ScreenAI achieves new state-of-the-art results on UI- and infographics-based tasks (Multipage DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size.
Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.

Ranking Table

Description Paper Source Code
Answer typeEvidenceOperation
DateMethodScoreImage spanQuestion spanMultiple spansNon spanTable/ListTextualVisual objectFigureMapComparisonArithmeticCounting
2024-01-24qwenvl-max (single generalist model)0.73410.77560.80830.60350.57170.72910.88560.67080.68920.59670.60090.71520.4388
2023-11-15SMoLA-PaLI-X Specialist Model0.66210.71660.72520.58380.42920.64480.82610.67140.61100.50650.52380.50540.3506
2024-02-10ScreenAI 5B0.65900.71620.72470.57350.41400.65250.83150.59680.60210.44670.48150.53030.3000 TILT0.61200.67650.64190.43910.38320.59170.79160.45450.56540.44800.48010.49580.2652
2023-08-20PaLI-X (Google Research, Single Generative Model)0.54770.59400.69500.41220.35340.51450.68910.63730.50400.40130.42900.40530.3091
2021-04-09IG-BERT (single model)0.38540.41810.44810.21970.28490.33730.50160.30130.37060.33470.29390.35640.2000
2021-04-11NAVER CLOVA0.32190.39970.23170.10640.10680.26530.44880.18780.30950.32310.20200.14800.0695
2021-04-10Ensemble LM and VLM0.28530.33370.41810.07480.11690.24390.36490.23310.26450.28450.25800.16280.0647
2022-03-03InfographicVQA paper model0.27200.32780.23860.04500.13710.24000.36260.17050.25510.22050.18360.15590.1140
2021-04-05BERT fuzzy search0.20780.26250.23330.07390.02590.18520.29950.08960.19420.17090.18050.01600.0436

Ranking Graphic