method: LayoutLMV3&StrucText2023-03-21

Authors: Minhui Wu(伍敏慧),Mei Jiang(姜媚),Chen Li(李琛),Jing Lv(吕静),Qingxiang Lin(林庆祥),Fan Yang(杨帆)

Affiliation: TencentOCR

Description: Our methods are mainly based on LayoutLMv3 and StrucTextv1 model architecture. All training models are finetuned on large pretrained models of LayoutLM and StrucText. During training and testing, we did some preprocessings to merge and split some badly detected boxes. Since entity label of kv-pair boxes are ignored, we used model trained on task1 images to predict kv relations of text boxes in task2 training/testing images. Thus we added additional 2 classes of labels (question/answer) and mapped original labels to new labels(other -> question/answer) to ease the difficulty of training. Similarly, During testing, we used kv-prediction model to filter those text boxes with kv relations and used model trained on task2 to predict entity label of the lefted boxes. In addition, we combined predicted results of different models based on scores and rules and did some postprocessings to merge texts with same entity label and generated final output.

method: LayoutLM&StrucText2023-03-20

Authors: Minhui Wu(伍敏慧),Mei Jiang(姜媚),Chen Li(李琛),Jing Lv(吕静),Qingxiang Lin(林庆祥),Fan Yang(杨帆)

Affiliation: TencentOCR

Description: Our methods are mainly based on LayoutLMv3 and StrucTextv1 model architecture. All training models are finetuned on large pretrained models of LayoutLM and StrucText. During training and testing, we did some preprocessings to merge and split some badly detected boxes. Since entity label of kv-pair boxes are ignored, we used model trained on task1 images to predict kv relations of text boxes in task2 training/testing images. Thus we added additional 2 classes of labels (question/answer) and mapped original labels to new labels(other -> question/answer) to ease the difficulty of training. Similarly, During testing, we used kv-prediction model to filter those text boxes with kv relations and used model trained on task2 to predict entity label of the lefted boxes. In addition, we combined predicted results of different models based on scores and rules and did some postprocessings to merge texts with same entity label and generated final output.

Authors: Hengguang Zhou, Zeyin Lin, Xingjian Zhao, Yue Zhang, Dahyun Kim, Sehwan Joo, Minsoo Khang, Teakgyu Hong Contact email: hengguangzhou0@gmail.com

Affiliation: Deep SE x Upstage HK

Email: hengguangzhou0@gmail.com

Description: For the OCR, we use a cascade approach where the pipeline is broken up into text detection and text recognition. For text detection, we use the CRAFT architecture with the backbone changed to EfficientUNet-b3. For text recognition, we use the ParSeq architecture with the visual feature extractor changed to SwinV2.
Regarding the parsing models, we trained both the LiLT and LayoutLMv3 models on the Task2 dataset. For LiLT, we also conducted transfer learning on either task1 or task3 before fine-tuning on task2 dataset. Finally, we take an ensemble of these four models to get the final predictions.

Ranking Table

Description Paper Source Code
DateMethodScore1Score2Score
2023-03-21LayoutLMV3&StrucText57.78%55.32%57.29%
2023-03-20LayoutLM&StrucText55.65%52.99%55.12%
2023-03-21task 1 transfer learning LiLT + task3 transfer learning LiLT + LilLT + Layoutlmv3 ensemble45.70%40.20%44.60%
2023-03-21EXO-brain for KIE44.02%39.63%43.14%
2023-03-21Ex-brain for KIE44.00%39.46%43.09%
2023-03-21Ex-brain for KIE44.00%39.46%43.09%
2023-03-21Ex-brain for KIE43.66%39.30%42.79%
2023-03-20Aaaa42.03%37.14%41.05%
2023-03-20Ant-FinCV41.61%35.98%40.48%
2023-03-21Ex-brain for KIE41.38%35.14%40.13%

Ranking Graphic