method: ssbaseline2020-09-09

Authors: Qi Zhu, Chenyu Gao, Peng Wang, Qi Wu

Affiliation: Northwestern Polytechnical University


Description: We wish this work to set the new baseline for these two OCR text related applications and to inspire new thinking of multi-modality encoder design.

Authors: Shailza Jolly* (TU Kaiserslautern & DFKI, Kaiserslautern), Shubham Kapoor* (Fraunhofer IAIS, Germany), Andreas Dengel (TU Kaiserslautern & DFKI, Kaiserslautern) [*equal contribution]

Description: We propose a novel scene text Visual Question Answering architecture called Focus. The proposed architecture uses a bottom-up attention mechanism, via Faster R-CNN (with Resnet 101), to extract the visual features of multiple regions of interests (ROI). The top-down attention on these multiple ROIs is calculated using the question embedding from a GRU encoder network. The attended visual features are fused with the question embedding to generate a joint representation of image and question. At last, a GRU based decoder generates open-ended answer sequences conditioned on the joint representation.

Ranking Table

Ranking Graphic