- Task 1 - Single Page Document VQA
- Task 2 - Document Collection VQA
- Task 3 - Infographics VQA
- Task 4 - MP-DocVQA
method: qwen2vl-2b ensemble2024-12-19
Authors: Wang KeLong
Description: Qwen2VL-2B is trained for mp-docvqa classification task, and Qwen2VL-2B is trained for sp-docvqa vqa taskļ¼The results from the four models are integrated through Qwen2VL-2B.
method: Snowflake Arctic-TILT 0.8B2024-08-21
Authors: Snowflake Document AI team
Affiliation: Snowflake
Description: Improved Applica.ai TILT model: better text-vision modality fusion, long context support, and a better training procedure. We submitted results from a single model.
method: ScreenAI 5B2024-02-18
Authors: ScreenAI Team
Affiliation: Google Research
Description: Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction.
We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding.
Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets.
At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements.
We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale.
We run ablation studies to demonstrate the impact of these design choices.
At only 5B parameters, ScreenAI achieves new state-of-the-art results on UI- and infographics-based tasks (Multipage DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size.
Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.
Answer | Page prediction | ANLS per answer page position | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | Method | ANLS | Accuracy | Page 0 | Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | |||
2024-12-19 | qwen2vl-2b ensemble | 0.8501 | 85.9534 | 0.8937 | 0.8363 | 0.7980 | 0.7965 | 0.7735 | 0.7903 | 0.6698 | 0.8718 | 0.8133 | 0.7004 | 0.7618 | 0.7229 | 0.8027 | 0.7433 | 0.7207 | 0.7031 | 0.7790 | 0.7050 | 0.8846 | 1.0000 | |||
2024-08-21 | Snowflake Arctic-TILT 0.8B | 0.8122 | 50.7870 | 0.8639 | 0.7967 | 0.7551 | 0.7312 | 0.7105 | 0.7837 | 0.6916 | 0.7239 | 0.7793 | 0.6648 | 0.7817 | 0.6445 | 0.7003 | 0.6393 | 0.7202 | 0.6364 | 0.7355 | 0.5650 | 0.6923 | 1.0000 | |||
2024-02-18 | ScreenAI 5B | 0.7711 | 77.8840 | 0.8304 | 0.7394 | 0.7261 | 0.7407 | 0.6100 | 0.7213 | 0.6454 | 0.6389 | 0.6573 | 0.7500 | 0.7262 | 0.7429 | 0.6295 | 0.5147 | 0.5932 | 0.6818 | 0.5383 | 0.5900 | 0.6154 | 0.9605 | |||
2025-01-15 | mPLUG-DocOwls | 0.6932 | 50.7870 | 0.7618 | 0.6636 | 0.6403 | 0.6219 | 0.5282 | 0.5507 | 0.5361 | 0.5960 | 0.6388 | 0.6015 | 0.6342 | 0.5000 | 0.5922 | 0.4351 | 0.4612 | 0.4938 | 0.5152 | 0.5333 | 0.6368 | 0.7105 | |||
2023-10-03 | (OCR-Free) Retrieval-based Baseline | 0.6199 | 81.5501 | 0.6755 | 0.5954 | 0.5802 | 0.5611 | 0.4986 | 0.4989 | 0.5760 | 0.4991 | 0.6062 | 0.6652 | 0.5665 | 0.3438 | 0.4470 | 0.4171 | 0.3713 | 0.5909 | 0.4321 | 0.2575 | 0.7308 | 0.9605 | |||
2023-03-28 | Hi-VT5 | 0.6184 | 79.6374 | 0.6571 | 0.6055 | 0.5907 | 0.5450 | 0.5259 | 0.5431 | 0.6747 | 0.6113 | 0.5971 | 0.7997 | 0.5291 | 0.3694 | 0.5466 | 0.3373 | 0.4144 | 0.3879 | 0.4835 | 0.4001 | 0.6187 | 1.0000 | |||
2023-02-14 | (Baseline) Longformer base concat | 0.5287 | 71.1696 | 0.6293 | 0.4746 | 0.4495 | 0.4371 | 0.3966 | 0.3889 | 0.4451 | 0.3883 | 0.4805 | 0.5049 | 0.2860 | 0.1888 | 0.0861 | 0.1600 | 0.1726 | 0.2448 | 0.1486 | 0.1912 | 0.1154 | 0.6625 | |||
2023-02-14 | (Baseline) T5 base concat | 0.5050 | 0.0000 | 0.7122 | 0.4390 | 0.2567 | 0.2081 | 0.1498 | 0.1533 | 0.2186 | 0.1415 | 0.1301 | 0.3135 | 0.1108 | 0.0829 | 0.0866 | 0.0774 | 0.0873 | 0.0481 | 0.1648 | 0.2240 | 0.0000 | 0.3875 | |||
2023-02-14 | (Baseline) BigBird ITC base concat | 0.4929 | 67.5433 | 0.6506 | 0.4529 | 0.3729 | 0.2883 | 0.1890 | 0.1726 | 0.1681 | 0.1962 | 0.1887 | 0.2957 | 0.1802 | 0.0800 | 0.0829 | 0.0595 | 0.0238 | 0.1993 | 0.0778 | 0.1400 | 0.0769 | 0.2375 | |||
2023-02-14 | (Baseline) LayoutLMv3 base - concat | 0.4538 | 51.9426 | 0.6624 | 0.3962 | 0.2020 | 0.1105 | 0.1609 | 0.0494 | 0.1165 | 0.0467 | 0.0596 | 0.3198 | 0.0980 | 0.0800 | 0.0433 | 0.1131 | 0.0000 | 0.0455 | 0.0978 | 0.1467 | 0.0385 | 0.2105 |