Overview - ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images
Abstract
Structured text extraction is one of the most valuable and challenging application direction in the field of Document AI. However, the scenarios of past benchmarks are limited, and the corresponding evaluation protocols usually focus on the submodules of the structured text extraction scheme. In order to eliminate these problems, we set up two tracks for the Structured text extraction from Visually-Rich Document images (SVRD) competition:
- Track 1: HUST-CELL aims to evaluate the end-to-end performance of Complex Entity Linking and Labeling.
- Track 2: Baidu-FEST focuses on evaluating the end-to-end performance and generalization of Few-shot Structured Text extraction.
Compared to the current document benchmarks, our two tracks of competition benchmark enriches the scenarios greatly and contains more than 50 types of visually-rich document images (mainly from the actual enterprise applications). In addition, our task settings not only include complex end-to-end entity linking and labeling, based on track 1, but also provide the zero-shot and few-shot tracks to objectively evaluate the performance and generalization of the competition schemes. We believe that our competition will attract many researchers in the field of CV and NLP, and bring some new thoughts to the field of Document AI. There are four main tasks in this competition, which will are detailed in the Tasks tab.
Benchmark Description
Track 1: HUST-CELL
Our proposed HUST-CELL complexity goes over and above previous datasets in four distinct aspects. First, we provide 30 categories of documents with more than 4k documents, 2 times larger than the existing English and Chinese datasets including SROIE (973) [1], CORD (1,000) [2], EATEN (1,900) [3], FUNSD (199) [4], XFUNSD (1,393) [5], and EPHOIE (1,494) [6]. Second, HUST-CELL contains 400+ diverse keys and values. Third, HUST-CELL covers complex keys more challenging than others, for instance, nested keys, fine-grained key-value pairs, multi-line keys/values, long-tailed key-value pairs, as shown in Figure 1. Current state-of-the-art Key Information Extraction (KIE) techniques [7-9] fail to deal with such situations that are essential for a robust KIE system in the real world. Fourth, our dataset comprises real-world documents reflecting real-life diversity of content and the complexity of the background, e.g. different fonts, noise, blur, seal.
In this regard, under the consideration of the importance and huge application value of KIE, we propose to set up track 1 competition on complex entity linking and labeling
Our proposed HUST-CELL were collected from public websites and cover a variety of scenarios, e.g., receipt, certificate, and license of various industries. The language of the documents is mainly Chinese, along with a small portion of English. The number of images collected for each specific scenario varies, ranging from 10 to 300, with a long-tail distribution, which can avoid introducing any bias towards specific real application scenarios. Due to the complexity of the data source, the diversity of this dataset can be guaranteed. To be able to use publicly, this data is collected from open websites, and we delete images that contain private information for privacy protection. Some examples are shown in Figure 1.
Figure 1. Samples of HUST-CELL collected from various scenarios.
The dataset is split into the training set and test set. The training set consists of 2,000 images, which will be available to the participants along with OCR and KIE annotations. The test set consists of 2,000 images, whose OCR annotations and KIE annotations will not be released.
Track 2: Baidu-FEST
Our proposed Baidu-FEST benchmark comes from the practical scenarios, mainly including finance, insurance, logistics, customs inspection, and other fields. Different applications have different requirements for text fields of interest. In addition, the data collection methods in different scenarios may be affected by different cameras and environments, thus the benchmark is relatively rich and challenging.
Specifically, the benchmark contains about 11 kinds of synthetic business documents for training, and 10 types of real visually-rich document images for testing. The format of documents major consists of cards, receipts, and forms. Each type of document provides about 60 images.
Each image in the dataset is annotated with text-field bounding boxes (bbox) and the transcript, class name, class id of each text bbox. Locations are annotated as rectangles with four vertices, which are in clockwise order starting from the top. Annotations for an image are stored in a text file with the same prefix. Some examples of images and the corresponding annotations are shown in Figure 2.
Figure 2. Some visually-rich document samples of Baidu-FEST.
References
[1] Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and CV Jawahar. ICDAR2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019.
[2] Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. CORD: a consolidated receipt dataset for post-ocr parsing. In Workshop on Document Intelligence at NeurIPS 2019, 2019.
[3] He Guo, Xiameng Qin, Jiaming Liu, Junyu Han, Jingtuo Liu, and Errui Ding. EATEN: Entity-aware attention for single shot visual text extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 254–259. IEEE, 2019.
[4] Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. FUNSD: A dataset for form understanding in noisy scanned documents. In ICDARW, volume 2, pages 1–6, 2019.
[5] Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, and Furu Wei. XFUND: A benchmark dataset for multilingual visually rich form understanding. In Findings of the Association for [Computational Linguistics: ACL 2022, pages 3214–3224, 2022.
[6] Jiapeng Wang, Chongyu Liu, Lianwen Jin, Guozhi Tang, Jiaxin Zhang, Shuaitao Zhang, Qianying Wang, Yaqiang Wu, and Mingxiang Cai. Towards robust visual information extraction in real world: new dataset and novel solution. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2738–2745, 2021.
[7] Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, and Rong Xiao. PICK: processing key information extraction from documents using improved graph learning-convolutional networks. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 4363–4370. IEEE, 2020.
[8] Peng Zhang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Jing Lu, Liang Qiao, Yi Niu, and Fei Wu. TRIE: end-to-end text reading and information extraction for document understanding. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1413–1422, 2020.
[9] Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. LayoutLM: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1192–1200, 2020.
[10] Haoyu Cao, Jiefeng Ma, Antai Guo, Yiqing Hu, Hao Liu, Deqiang Jiang, Yinsong Liu, and Bo Ren. GMN: Generative Multi-modal Network for Practical Document Information Extraction. NAACL, 2022.
[11] Haoyu Cao, Xin Li, Jiefeng Ma, Deqiang Jiang, Antai Guo, Yiqing Hu, Hao Liu, Yinsong Liu, and Bo Ren. Query-driven Generative Network for Document Information Extraction in the Wild. In Proceedings of the 30th ACM International Conference on Multimedia, pp. 4261-4271. 2022.
Challenge News
- 03/20/2023
SVRD: Task 4 few-shot examples available - 03/15/2023
SVRD: Task 2 Test set available - 03/15/2023
SVRD: Task 2 Test set available - 03/06/2023
SVRD: Task 3 Test set available - 02/11/2023
SVRD: Task 1&2 Training set Updated - 01/31/2023
SVRD: Task 3 Training set Updated - 01/12/2023
SVRD: Training set available
Important Dates
Note: The time zone of all deadlines is UTC-12. The cut-off time for all dates is 11:59 PM.
December 30, 2022
Website ready
January 10-12, 2023
1) Task 1&2 training dataset available
2) Task 3&4 training dataset available
March 10, 2023
1) Test set of task 1 available, submission open
March 15, 2023
1) Task 1 submission deadline
2) Test set of task 2 available and submission open
March 20, 2023
1) Task 2 submission deadline
------------------------------------------------------
March 6, 2023
1) Test set of task 3 available
March 10, 2023
1) Task 3 submission open
March 17, 2023
1) Task 3 submission deadline
2) Test set of task 4 available and submission open
3) Few-shot training examples of task 4 available
March 24, 2023
1) Task 4 submission deadline
March 25, 2023
Submit reproducible script and short description of the method for Task 1-4. (The detailed instructions will be uploaded.)
March 27, 2023
The notification for reproducible script submission has been sent to top-5 participants via email.
Note: Task 1&2 submission data has been extended.
Note: Task 3&4 submission data has been extended.