method: StrucTexT2021-11-24

Authors: Baidu-OCR

Affiliation: Baidu

Description: 1. StrucTexT is a joint segment-level and token-level representation enhancement model for document image understanding, such as pdf, invoice, receipt and so on.
2. Using 50 million Chinese and English document images for the StrucTexT large model pre-training.
3. We finetune the single large pretrain-model on the SROIE dataset.

method: GraphDoc2022-03-18

Authors: Zhenrong Zhang, Jiefeng Ma, Jun Du

Affiliation: National Engineering Research Center of Speech and Language Information Processing (NERC-SLIP), University of Science and Technology of China.


Description: 1. GraphDoc is a multi-modal graph attention-based model for various Document Understanding tasks.
2. GraphDoc is pretrained on the RVL-CDIP training dataset, which contains only 320k document images.
4. Following the same evaluation rules as others, the OCR mismatch errors are excluded in the submission.

Authors: Njoyim Tchoubith Peguy Calusha

Affiliation: University of Fribourg, Switzerland


Description: Here is a simple neural language model (NLM) that relies only on character-level inputs. This model employs a convolutional neural network (CNN) and a highway network over characters, whose output is given to a bidirectional long short-term memory (BLSTM) recurrent neural network language model (RNN-LM).

Unlike previous works that utilize subword information via morphemes, this model does not require morphological tagging as a pre-processing step. And, unlike the recent line of work which combines input word embeddings with features from a character-level model, this model does not utilize word embeddings at all in the input layer. Given that most of the parameters in NLMs are from the word embeddings, the proposed model has significantly fewer parameters than previous NLMs, making it attractive for applications where model size may be an issue (e.g. cell phones).

To adapt this model to the scanned receipts, the following modifications has been made:

- Unlike the original predictions made at word-level, the predictions are made at entity-level.
- The two LSTM layers are bidirectional (BiLSTM).
- A batch norm layer is added before the highway layer(s).
- The initialization of parameters is different for BiLSTM, and it is based on this paper:

Using the website evaluation procedure, the OCR mismatches are removed and the discrepancies of total amount randomly prefixed by "RM" are fixed for fair comparison results with other participants.

Ranking Table

Description Paper Source Code
2022-04-15Character-Aware CNN + Highway + BiLSTM 2.098.20%98.48%98.34% Lambert 2.0 + Excluding OCR Errors + Fixing total entity96.83%99.56%98.17% TILT + Excluding OCR Errors + Fixing total entity96.83%99.41%98.10%
2020-12-24LayoutLM 2.0 (single model)96.61%99.04%97.81% Lambert 2.0 + Excluding OCR Mismatch96.40%99.11%97.74%
2021-10-25Character-Aware CNN + Highway + BiLSTM 1.096.18%97.45%96.81%
2020-04-15PICK-PAPCIC & XZMU95.46%96.79%96.12%
2019-05-04H&H Lab89.63%89.63%89.63%
2019-05-02CLOVA OCR89.05%89.05%89.05%
2019-04-28A Simple Method for Key Information Extraction as Character-wise Classification with LSTM75.58%75.58%75.58%
2019-05-02With receipt framing63.04%63.54%63.29%

Ranking Graphic