Results - Document Information Localization and Extraction

Authors: Yan Wang, Jiefeng Ma, Zhenrong Zhang, Pengfei Hu, Jianshu Zhang, Jun Du

Affiliation: University of Science and Technology of China (USTC), iFLYTEK AI Research

Description: We pre-trained several GraphDoc models on provided unlabelled documents under different configurations. We then fine-tuned the models on the training set for 200-500 epochs. After classifying OCR boxes into various categories, we proposed a Merger module to handle the aggregation process.
We also used some pre/post-processing according to the text content and distances between OCR boxes. Finally, we adopted model ensembling to further enhance the system performance.

Multimodal Pre-training Based on Graph Attention Network for Document Understanding

Source code

method: YOLOv8X+Grid2023-05-08

Authors: Jakub Straka

Affiliation: University of West Bohemia, Department of Cybernetics

Description: KILE task may be solved in many different ways. We chose to approach this task as object detection. This means that we treated each field in the document as an object. As the detection model was used YOLOv8X. The model is based on the convolutional neural network. One of the advantages of this model is its speed and small size. We also incorporated methods used in [1].

1. Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda,
Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. Chargrid: Towards
understanding 2d documents. arXiv preprint arXiv:1809.08799, 2018

Source code

method: Baseline+Ensemble+Pseudo+Post-Processing2023-05-16

Authors: UIT@AICLUB_TAB

Affiliation: UIT - University of Information Technology - VNUHCM

Email: 22520121@gm.uit.edu.vn

Description: Our approach is based on the checkpoint baseline with some improvements. We trained/used models:
1. Model RoBERTa base from scratch using FGM and Lion Optimizer with synthetic data for 30 epochs, after that, I trained on annotated data.
2. Model RoBERTa ours (checkpoint) with Lion Optimizer
3. Model RoBERTa base (checkpoint)

After that, we ensemble them by unioning words that are marked at 1 of 55 field type, post-processing.
After that, we used the ensembled model to predict unlabeled data, we have pseudo data, use them to pre-train 3 models, and train on annotated data after that.

Pipeline: https://ibb.co/4MWcXgb

Source code

Ranking Table

Description Paper Source Code

Date	Method	AP	F1	Precision	Recall
2023-05-24	GraphDoc+Classify+Merge	71.25%	74.25%	71.41%	77.31%
2023-05-08	YOLOv8X+Grid	67.99%	74.66%	73.51%	75.85%
2023-05-16	Baseline+Ensemble+Pseudo+Post-Processing	61.24%	65.22%	61.13%	69.90%
2023-05-02	baseline - RoBERTa-base with synthetic pre-training	53.90%	66.38%	65.86%	66.92%
2023-05-02	baseline - RoBERTa-base	53.45%	66.42%	65.80%	67.05%
2023-05-02	baseline - LayoutLMv3 with unsupervised and synthetic pre-training	51.22%	65.47%	66.20%	64.76%
2023-05-02	baseline - LayoutLMv3 with unsupervised pre-training	50.68%	63.86%	63.58%	64.15%
2023-05-25	SRCB Submission on Key Information Localization and Extraction	45.31%	65.98%	66.82%	65.17%

Inactive evaluations

method: GraphDoc+Classify+Merge2023-05-24

method: YOLOv8X+Grid2023-05-08

method: Baseline+Ensemble+Pseudo+Post-Processing2023-05-16

Ranking Table

Ranking Graphic

Ranking Graphic