Authors: Panfeng Cao, Jian Wu
Affiliation: University of Michigan, University of Science and Technology of China
Email: firstname.lastname@example.org, email@example.com
Description: Our model employs a transformer encoder and a graph convolutional module to effectively combine textual, visual and relative positional features from VRD. The features are fed into a BiLSTM-CRF module for decoding the IOB tags. Our model is light-weight and has only 60-70M parameters and can be effectively trained within a few hours with a middle class GPU.
Due to the fact that our model is character based, we need post process the prediction results. For example, some characters in the text segment may be ignored by CRF, we need the original OCR transcripts to restore the characters and produce the final complete result. And also the official OCR transcripts have a lot of errors, which are corrected by us. Besides, some total values in the website evaluation scripts are randomly prefixed with 'RM'. We also fixed it.