method: ssbaseline2020-09-09

Authors: Qi Zhu, Chenyu Gao, Peng Wang, Qi Wu

Affiliation: Northwestern Polytechnical University


Description: We wish this work to set the new baseline for these two OCR text related applications and to inspire new thinking of multi-modality encoder design.

method: TIG2020-08-15

Authors: Xiangpeng Li

Description: Text-Instance Graph: We build an OCR-Obj graph using overlapping relationships between OCR token texts and visual instances in the image. Then question conditioned multi-step graph attention network is adopted to extend the perception of each node, which makes the node is described by their neighboring nodes.

method: M4C (single model)2019-11-02

Authors: Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach

Affiliation: Facebook AI Research (FAIR); University of California, Berkeley


Description: We propose a novel model for the TextVQA task based on a multimodal transformer architecture with iterative answer prediction and rich feature representations for OCR tokens, largely outperforming previous work on three datasets.

Ranking Table

Description Paper Source Code
2019-11-02M4C (single model)0.4621
2019-04-29Focus: A bottom-up approach for Scene Text VQA0.0882

Ranking Graphic