Results - ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering

method: ssbaseline2020-09-09

Authors: Qi Zhu, Chenyu Gao, Peng Wang, Qi Wu

Affiliation: Northwestern Polytechnical University

Description: We wish this work to set the new baseline for these two OCR text related applications and to inspire new thinking of multi-modality encoder design.

https://arxiv.org/abs/2012.05153

Source code

method: TIG2020-08-15

Authors: Xiangpeng Li

Description: Text-Instance Graph: We build an OCR-Obj graph using overlapping relationships between OCR token texts and visual instances in the image. Then question conditioned multi-step graph attention network is adopted to extend the perception of each node, which makes the node is described by their neighboring nodes.

https://www.sciencedirect.com/science/article/pii/S0031320321006312

Source code

method: M4C (single model)2019-11-02

Authors: Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach

Affiliation: Facebook AI Research (FAIR); University of California, Berkeley

Email: ronghang.hu@gmail.com

Description: We propose a novel model for the TextVQA task based on a multimodal transformer architecture with iterative answer prediction and rich feature representations for OCR tokens, largely outperforming previous work on three datasets.

R. Hu, A. Singh, T. Darrell, M. Rohrbach, Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA. arXiv preprint arXiv:1911.06258, 2019 (to appear in CVPR 2020)

Source code

Ranking Table

Description Paper Source Code

Date	Method	Score
2020-09-09	ssbaseline	0.5500
2020-08-15	TIG	0.5051
2019-11-02	M4C (single model)	0.4621
2019-04-29	Focus: A bottom-up approach for Scene Text VQA	0.0882

Inactive evaluations

method: ssbaseline2020-09-09

method: TIG2020-08-15

method: M4C (single model)2019-11-02

Ranking Table

Ranking Graphic