method: ssbaseline2020-09-09
Authors: Qi Zhu, Chenyu Gao, Peng Wang, Qi Wu
Affiliation: Northwestern Polytechnical University
Email: zephyrzhuqi@gmail.com
Description: We wish this work to set the new baseline for these two OCR text related applications and to inspire new thinking of multi-modality encoder design.
method: SMA2020-03-06
Authors: Anonymous
Affiliation: Anonymous
Description: Structured Multimodal Attentions
method: VTA2019-04-30
Authors: Fengren Wang, iFLYTEK, frwang@iflytek.com; Jinshui Hu, iFLYTEK, jshu@iflytek.com; Jun Du, USTC, jundu@ustc.edu.cn; Lirong Dai, USTC, lrdai@ustc.edu.cn; Jiajia Wu, iFLYTEK, jjwu@iflytek.com
Description: An ED model for ST-VQA
1. We use OCR and object detection models to extract text and objects from images.
2. Then We use Bert to encode the extracted text and QA pairs.
3. Finally We use a similar model of Bottom-Up and Top-Down[1] to handle the image and question input and give the answer output.
Date | Method | Score | |||
---|---|---|---|---|---|
2020-09-09 | ssbaseline | 0.5490 | |||
2020-03-06 | SMA | 0.5081 | |||
2019-04-30 | VTA | 0.5063 | |||
2021-08-15 | ss1.0 | 0.5045 | |||
2020-05-22 | RUArt | 0.4817 | |||
2019-04-29 | USTB-TQA | 0.4553 | |||
2019-04-29 | Focus: A bottom-up approach for Scene Text VQA | 0.2959 | |||
2019-04-30 | Visual Question Answering via deep multimodal learning | 0.1411 | |||
2019-04-29 | USTB-TVQA | 0.1243 | |||
2019-04-29 | TRAN MINH TRIEU | 0.0545 |