method: GRAM2024-01-16

Authors: Tsachi Blau, Sharon Fogel, Roi Ronen, Alona Golts, Roy Ganz, Shahar Tsiper, Elad Ben Avraham, Aviad Aberdam, Ron Litman

Affiliation: AWS AI Labs and Technion Israel

Description: GRAM model based on Docformerv2 trained on Multi-Page DocVQA dataset.

method: GRAM C-Former2024-01-16

Authors: Tsachi Blau, Sharon Fogel, Roi Ronen, Alona Golts, Roy Ganz, Shahar Tsiper, Elad Ben Avraham, Aviad Aberdam, Ron Litman Affiliation: AWS AI Labs and Technion Israel

Affiliation: AWS AI Labs and Technion Israel

Description: GRAM model with C-Former based on Docformerv2 trained on Multi-Page DocVQA dataset.

method: ScreenAI 5B2024-02-18

Authors: ScreenAI Team

Affiliation: Google Research

Description: Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction.
We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding.
Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets.
At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements.
We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale.
We run ablation studies to demonstrate the impact of these design choices.
At only 5B parameters, ScreenAI achieves new state-of-the-art results on UI- and infographics-based tasks (Multipage DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size.
Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.

Ranking Table

Description Paper Source Code
AnswerPage predictionANLS per answer page position
DateMethodANLSAccuracyPage 0Page 1Page 2Page 3Page 4Page 5Page 6Page 7Page 8Page 9Page 10Page 11Page 12Page 13Page 14Page 15Page 16Page 17Page 18Page 19
2024-01-16GRAM0.803219.98410.83800.78540.75280.79080.74520.79220.74590.72290.74640.71020.81200.69050.75890.64730.57140.59090.74540.63670.88461.0000
2024-01-16GRAM C-Former0.781219.98410.81520.76590.73630.75690.71640.72380.74070.71800.75870.80030.76240.67710.77130.67720.57980.61720.63940.56640.83270.9250
2024-02-18ScreenAI 5B0.771177.88400.83040.73940.72610.74070.61000.72130.64540.63890.65730.75000.72620.74290.62950.51470.59320.68180.53830.59000.61540.9605
2023-10-03(OCR-Free) Multi-Page DocVQA Method0.619981.55010.67550.59540.58020.56110.49860.49890.57600.49910.60620.66520.56650.34380.44700.41710.37130.59090.43210.25750.73080.9605
2023-03-28Hi-VT50.618479.63740.65710.60550.59070.54500.52590.54310.67470.61130.59710.79970.52910.36940.54660.33730.41440.38790.48350.40010.61871.0000
2023-02-14(Baseline) Longformer base concat0.528771.16960.62930.47460.44950.43710.39660.38890.44510.38830.48050.50490.28600.18880.08610.16000.17260.24480.14860.19120.11540.6625
2023-02-14(Baseline) T5 base concat0.50500.00000.71220.43900.25670.20810.14980.15330.21860.14150.13010.31350.11080.08290.08660.07740.08730.04810.16480.22400.00000.3875
2023-02-14(Baseline) BigBird ITC base concat0.492967.54330.65060.45290.37290.28830.18900.17260.16810.19620.18870.29570.18020.08000.08290.05950.02380.19930.07780.14000.07690.2375
2023-02-14(Baseline) LayoutLMv3 base - concat0.453851.94260.66240.39620.20200.11050.16090.04940.11650.04670.05960.31980.09800.08000.04330.11310.00000.04550.09780.14670.03850.2105

Ranking Graphic

Ranking Graphic