method: Pre-trained model based fullpipe pair extraction (opti_v2, no inf_aug)2023-03-16

Authors: Zening Lin, Teng Li, Wenhui Liao, Jiapeng Wang, Songxuan Lai, Lianwen Jin

Affiliation: South China University of Technology; Huawei Cloud

Description: Model
1. Take segment-level OCR as input, use XYCut & pre-trained-model-based-NER model to extract entities.
2. Use entity-level pre-trained-model-based RE model to extract pairs.

Details
1. All strings are converted to half-width before sending to the NER model.
2. Space generated by tokenizer is discarded using a string comparison algorithm in postprocessing step.
3. Box position jittering is applied when training the RE model.
4. For nested-key sorting, we use several rule based methods to determine the order.
5. XYCut algorithm is optimized to handle the order problem between lines inside an entity.