Tasks - ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images

Our proposed competition has two tracks totaling four main tasks:

Track 1: HUST-CELL

          Task-1: E2E Complex Entity Linking

          Task-2: E2E Complex Entity Labeling

Track 2: Baidu-FEST

          Task-3: E2E Zero-shot Structured Text Extraction

          Task-4: E2E Few-shot Structured Text Extraction

Task-1: E2E Complex Entity Linking

Task Description: This task aims to extract key-value pairs (entity linking) from given images only, then save the key-value pairs of each image into a JSON file. For the train set, both KIE annotation files for training and human-checked OCR annotation files are provided. So the OCR annotation is clean and can be used as the ground truth of the OCR task. The test set of Task 1 will only provide images without any annotation including OCR and KIE. It requires the method to accomplish both OCR and KIE tasks in an end-to-end manner.

Ground Truth Format: The subsets packages (see Downloads section) contains a folder with the document images and a JSON file with the ground truth annotations. The JSON file, called "train_label.json", has the following format:

{

  “0.jpg”:{

      “Keys”:{

           "0": [{“Coords”:[x1,y1,…,xn,yn], “Content”:”transcription”}, … ],   # A unique ID number for every key :  its corresponding coordinates and content

           "1": [{“Coords”:[x1,y1,…,xn,yn], “Content”:”transcription”}, … ],   # A unique ID number for every key :  its corresponding coordinates and content

           "2": [{“Coords”:[x1,y1,…,xn,yn], “Content”:”transcription”}, … ],   # A unique ID number for every key :  its corresponding coordinates and content

           "3": [{“Coords”:[x1,y1,…,xn,yn], “Content”:”->transcription”}, … ],   # The prefix -> in the Content means this key is not explicit (invisible) key. The corresponding kv-pair will be ignored when evaluating.

        …

      },

      “KV-Pairs”:[

          {

           “Keys”:[0], # This means that this kv-pair corresponding to the IDs of the above Keys object are associated.

           “Values”:[{“Coords”:[x1,y1,…,xn,yn], “Content”:”transcription”}, … ],

            "row_index":None # It indicates the row of this pair inside a sheet, which is only used for sheet data. Other document type is None

           }, 

          {

           “Keys”:[1, 2], # This means that this kv-pair corresponding to the IDs of the above Keys object are associated.

           “Values”:[{“Coords”:[x1,y1,…,xn,yn], “Content”:”transcription”}, … ],

            "row_index":1 # It indicates the row of this pair inside a sheet, which is only used for sheet data. Other document type is None

           }, 

          {

           “Keys”:[], # If the list is empty, this means that this pair has not explicit key. When evaluating, this situation will be ignored.

           “Values”:[{“Coords”:[x1,y1,…,xn,yn], “Content”:”transcription”}, … ],

            "row_index":2 # It indicates the row of this pair inside a sheet, which is only used for sheet data. Other document type is None

           }, 

          {

           “Keys”:[3], 

           “Values”:[{“Coords”:[x1,y1,…,xn,yn], “Content”:”###”}, … ],  # ### means that this pair has invisible values (empty box). When evaluating, this situation will be ignored.

            "row_index":2 # It indicates the row of this pair inside a sheet, which is only used for sheet data. Other document type is None

           },

        … ]

       “Backgrounds”: [{“Coords”:[x1,y1,…,xn,yn], “Content”:”transcription”}, … ]

   }

}

In the JSON file, each [image_id].jpg corresponds to an image sample, where every sample contains all key-value pairs or background text instance, and their bounding box coordinates and transcription. If key or value exists multi-line or multiple instance, the Keys or Values in JSON has multiple object.

Submission Format: The participant will be asked to submit a JSON files containing results for all test images. The results format has the following format:

{

“0.jpg”:{

      “KV-Pairs”:[

           {

          “Keys”:[{“Content”:”transcription”}],

           “Values”:[{“Content”:”transcription”}],

           }, 

        … ]

   }

}

where the key of JSON file is the file name of the corresponding test images. The participant need extract all key-value pairs from given images. Note that the key-value pairs have one-to-one, one-to-many, and many-to-many situation. If predicted Keys has nested situation, participants need merge it to one string instance using \t separator by read order when submitting (Refer to below FAQ Q3). If not, we will merge the Content value to one string according to the list order of Keys/Values when evaluating. If Values has multi-line situation, it need to be merge it to one string instance without any separator by read order. The examples shown in following:

SVRD_task1_submit_demo2.png              

Task-2: E2E Complex Entity Labeling

Task Description: The end-to-end complex entity labeling is to extract texts of a number of predefined key text fields from given images (entity labeling), and save the texts for each image in a JSON file with required format. Task 2 has 13 predefined entities. For the train set, both KIE annotation files for training and human-checked OCR annotation files are provided. So the OCR annotation is clean and can be used as the ground truth of the OCR task. The test set of Task 2 will only provide images without any annotation including OCR and KIE. It requires the method to accomplish both OCR and KIE tasks in an end-to-end manner.

Ground Truth Format: The subsets packages (see Downloads section) contains a folder with the document images and a JSON file with the ground truth annotations. The JSON file, called "train_label.json", has the following format:

{

"info": {

     "doc_id": -1, # The type id of the document, -1 means the mixture type of document,

     "entity":[ # The predefined entities

               {“caption_en”: "The description of entity in English", # The caption_en in GT is not allowed to be modified.

                “caption_ch”: "The description of entity in Chinese", # The caption_ch in GT is not allowed to be modified.

                “entity_id”: 0 #  The id of the entity, -1 means background entity text 

               },  ... ]

       },

"image_items": [{...}]

}

The "image_items" element is a list of dictionary entries with the following structure:

{

"image_name": "0.jpg", # The image filename corresponding to the document image.

"ocr_instances":[

                        [{"entity_id": -1, # The entity id of the text instance

                        "bbox": [x1,y1,...,xn,yn], # The coordinate of the text instance

                         "text": "transcription"   # The text label of the text instance

                         "sub_idx": 0   # The multi-instance id of the corresponding entity

                         }],

                        [{"entity_id": 0, # The entity id of the text instance

                        "bbox": [x1,y1,...,xn,yn], # The coordinate of the text instance

                         "text": "transcription"   # The text label of the text instance

                         "sub_idx": 0   # The multi-instance id of the corresponding entity

                         },

                         {"entity_id": 0, # The entity id of the text instance

                        "bbox": [x1,y1,...,xn,yn], # The coordinate of the text instance

                         "text": "transcription"   # The text label of the text instance

                         "sub_idx": 1   # The multi-instance id of the corresponding entity

                         } ],

                         ... ]

}

Note: Due to the little number of entity id 12, we have removed corresponding data from training set and test set in the updated annotation version.

Submission Format: The participant will be asked to submit a JSON files containing results for all test images. The results format has the following format:

{

“0.jpg”: [

  {“entity_id”: 0, "text":"predicted_text"}, .... ], # The value of text can be None, if the entity does not exist.

“1.jpg”: [

  {“entity_id”: 0, "text":"predicted_text"}, .... ], # The value of text can be None, if the entity does not exist.

}

where the key of JSON file is the file name of the corresponding test images. The participant need extract texts of a number of predefined key text fields from given images. Note that the predefined 13 keys of Task 2 may not appear in a single image at the same time. For predefined key text fields that do not exist in the given image, the corresponding text value of entity_id can set to None. Besides, each entity_id only accept one string instance. If predefined key has multi-line situation, participants need merge it to one string instance. Task 3 and 4 also follow this protocol.

Task-3: E2E Zero-shot Structured Text Extraction

Task Description: The zero-shot structured text extraction is to extract texts of a number of key fields from given images, and save the texts for each image in a json file with required format. Different from Task2, there is no intersection between the scenarios of the provided training-set and the scenarios of the provided test-set. Of course, the training data consists of the real data provided by Track 1 and  the synthetic data generated officially. The caption_en and caption_ch in GT can be used as prompt to assist extraction but it is not allowed to be modified.

Ground Truth Format: The subsets packages (see Downloads section) contains 11 folders with different type document images and corresponding JSON file with the ground truth annotations. The JSON file, called "label.json", is the similar format as Task 2 except for the ocr_instances items format:

"ocr_instances":[

                        {"entity_id": -1, # The entity id of the text instance

                        "bbox": [x1,y1,...,xn,yn], # The coordinate of the text instance

                         "text": "transcription"   # The text label of the text instance

                         }

                         ... ] .

Submission Format: The participant will be asked to submit a ZIP file containing 10 JSON files, the file name of each of JSON file is the corresponding document id. The JSON format is the same as Task 2 for evaluation, which is mentioned above. In addition, the top 5 contestants are invited to submit executable programs to verify the authenticity.

Task-4: E2E Few-shot Structured Text Extraction

Task Description: The few-shot structured text extraction is to extract texts of a number of key fields from given images, and save the texts for each image in a json file with required format. Different from Task-2, the localization information and transcript are provided, but the total number of the provided training-set will no more than five images for each scenario of the provided test-set. The caption_en and caption_ch in GT can be used as prompt to assist extraction but it is not allowed to be modified.

Ground Truth Format: The ground truth of Task 4 is the same format as Task 3. Task 4 will provides few-shot training examples according to competition schedule.

Submission Format: The submission format is the same as Task 3. In addition, the top 5 contestants are invited to submit executable programs and reproducible procedures  of at least three scenario to verify the authenticity.

Evaluation Protocol

Task 1 Evaluation

For Task 1, the evaluation metrics include two parts:

Normalized edit distance. For each predicted kv-pair, if it matched with GT kv-pair in the given image, the normalized edit distance (NED) between the predicted kv-pair s1 and ground-truth kv-pair s2 will be calculated as following:

svrd_task1_ned_new.png

where n denotes the number of matched kv-pairs (both the edit distance of key and value are larger than a threshold simultaneously. The calculated details refer to the following Matching Protocol.). s1_k/s2_k, s1_v/s2_v indicate the content of key and value of the kv-pair s1/s2, respectively. Note that for predicted kv-pairs that do not matched in the GT of the given image, the edit distance will be calculated between predicted kv-pairs and empty string.

Matching Protocol: Given the predicted kv-pair s1 and ground-truth kv-pair s2. The matching protocol is calculated as following:

svrd_task1_match_new.png

where ed denotes the edit distance function. the factor_k and factor_v are set to 0.15, 0.15, respectively.

F-score. Considering all the predicted kv-pairs and all GT kv-pairs, the F-score will be calculated as following:

SVRD_f1.png

where N1 denotes the number of kv-pairs that exists in the given image, N2 denotes the number of predicted kv-pairs, N3 denotes the number of perfectly matched kv-pairs (both the edit distance of key and value are larger than a threshold simultaneously. Specifically, the factor_k and factor_v in matching protocol are set to 0.). The final score is the weighted score of score1 and score2:

SVRD_task1_eval_score.png

The final weighted score will be used as submission ranking purpose for Task 1.

Task 2-4 Evaluation

For Task 2, Task 3, and Task 4, the evaluation metrics include two parts:

Normalized edit distance. For each predefined key text field, if it exists in the given image, the normalized edit distance (NED) between predicted text s1 and ground-truth text s2 will be calculated as following:

SVRD_ned.png

where n denotes the number of perfectly matched key text fields (both entity_id and text are predicted correctly). Note that for predicted key text fields that do not exist in the given image, the edit distance will be calculated between predicted text and empty string.

F-score. Considering all the predicted key text fields and all the predefined key text fields, the F-score will be calculated as following:

SVRD_f1.png

where N1 denotes the number of key text fields that exists in the given image, N2 denotes the number of predicted key text fields, N3 denotes the number of perfectly matched key text fields (both entity_id and text are predicted correctly). The final score is the weighted score of score1 and score2:

SVRD_task2-4_final_score.png

The final weighted score will be be used as submission ranking purpose for Task 2, Task 3, and Task 4.

FAQ

Q1. For Task2, there are some value belonging to predefined entity in the data that are not labeled.

A1. The predefined entity of task2 is invisible key. Although some value matches a predefined category, if it has an obvious kv pair relationship in the image, then this value will not be labeled.

Q2. For Task2, Entity Institution (id 5) and Company (id 10) has similar meaning. 

A2. When evaluating, we will merge this similar entity categories.

Q3. For Task 1, the merge rule of keys.

A3. a) if the keys is multi-line or fractured OCR bbox, the keys should be merged directly forming a complete entity.

      b) if the keys is nested situation, merged each other  into one string instance using \t separator by read order. Besides, if the document is form/table, the order also need to  follow that of the left-side table head first, then the top-side table head. 

Q4. For Task 1, the merge rule of values.

A4. a) if the values is multi-line or fractured OCR bbox, the values should be merged directly forming a complete value.

       b) if one key has multi values (e.g. table head has multi-row value), these values do not need to be merged and are output independently.

Q5. How to obtain OCR results for a test set.

A5. There are two options:

      a) The organizers will provide an internal OCR API that generates OCR results, and participants can choose to use it. However, participants are not allowed to use any other open OCR API.

      b) Participants can train an OCR model using a publicly available dataset and obtain detection and recognition results for the test set.

 

Important Dates

Note: The time zone of all deadlines is UTC-12. The cut-off time for all dates is 11:59 PM.

December 30, 2022

Website ready

January 10-12, 2023

1) Task 1&2 training dataset available

2) Task 3&4 training dataset available

March 10, 2023

1) Test set of task 1 available, submission open

March 15, 2023

1) Task 1 submission deadline

2) Test set of task 2 available and submission open

March 20, 2023

1) Task 2 submission deadline

------------------------------------------------------

March 6, 2023

1) Test set of task 3 available

March 10, 2023

1) Task 3 submission open

March 17, 2023

1) Task 3 submission deadline

2) Test set of task 4 available and submission open

3) Few-shot training examples of task 4 available

March 24, 2023

1) Task 4 submission deadline

March 25, 2023

Submit reproducible script and short description of the method for Task 1-4. (The detailed instructions will be uploaded.)

March 27, 2023

The notification for reproducible script submission has been sent to top-5 participants via email.

Note: Task 1&2  submission data has been extended.

Note: Task 3&4  submission data has been extended.