Authors: Zening Lin, Teng Li, Wenhui Liao, Jiapeng Wang, Songxuan Lai, Lianwen Jin
Affiliation: South China University of Technology; Huawei Cloud
1. Take segment-level OCR as input, use XYCut & pre-trained-model-based-NER model to extract entities.
2. Use entity-level pre-trained-model-based RE model to extract pairs.
1. All strings are converted to half-width before sending to the NER model.
2. Space generated by tokenizer is discarded using a string comparison algorithm in postprocessing step.
3. Box position jittering is applied when training the RE model.
4. For nested-key sorting, we firstly used regular expressions to find line numbers, then move it to the first place. For the remaining entities, we filter out the keys located on the left table head and move it to the front.