method: Pre-trained model based fullpipe pair extraction (opti_v3, no inf_aug)2023-03-16

Authors: Zening Lin, Teng Li, Wenhui Liao, Jiapeng Wang, Songxuan Lai, Lianwen Jin

Affiliation: South China University of Technology; Huawei Cloud

Description: Model
1. Take segment-level OCR as input, use XYCut & pre-trained-model-based-NER model to extract entities.
2. Use entity-level pre-trained-model-based RE model to extract pairs.

Details
1. All strings are converted to half-width before sending to the NER model.
2. Space generated by tokenizer is discarded using a string comparison algorithm in postprocessing step.
3. Box position jittering is applied when training the RE model.
4. For nested-key sorting, we use several rule based methods to determine the order.
5. XYCut algorithm is optimized to handle the order problem between lines inside an entity.
6. Add rules for keys with colon