Tasks - Information Extraction in Historical Handwritten Records

NEW: The report of the ICDAR 2017 competition is available here.

 

The objective is to extract information from the records. So, the aim is to detect the named entities and assign each of them a semantic category, such as family names, places, occupations, etc.

For this competition, we have manually labelled the marriage records with semantic information at word level. The lines and the records in this dataset have been also manually annotated. In this way, each line is associated to each corresponding record.

The training and test sets are composed of:

  • Training set: 100 pages, containing 968 marriage records.
  • Test set: 25 pages, containing 253 marriage records.

With the aim to make easier the participation of the maximum amount of research teams, we will provide:

  • Images of segmented text lines.
  • Images of segmented words.
  • Text files with the corresponding transcription.
  • Text files with the corresponding categories. The categories can be: name, surname, occupation, location, state.
  • Text files with the corresponding person: husband, husband’s father, husband’s father, wife, wife’s father, wife’s mother, other_person (a different person from the ones mentioned before).
  • A CSV file with the list of transcriptions, categories and associated persons, with the following format:
    •   transcription_word1, category_word1, person_word1
    •   transcription_word2, category_word2, person_word2
    •   transcription_word3, category_word3, person_word3
    •   etc.

 

The dataset is grouped into records. There is one folder per record, composed of:

  • Folder “lines”: Line Images, transcriptions, categories, persons.
  • Folder “words”: Word Images, transcriptions, categories, persons.
  • CSV file.

All TXT files are provided in correspondence, that is, each word in the marriage record will have associated its category (just one category per word). This information has been manually checked for avoiding inconsistencies, but take into account that some names and locations are composed of several words. For those non-relevant words (e.g. conjunctions, prepositions, verbs, etc.) the category will be other and the person will be none.

For example:

MRecord_GT.png
 
Contrary, the CSV file will only contain relevant words. This means that only words with an associated category (e.g. names, locations, etc.) will appear in the CSV file. Note that the final goal is to simulate the filling in of a database.

Example of CSV file of the record shown above:

MRecord_tableCSV.png

 

EVALUATION

Participants must provide, for each record, the CSV file with the transcription of the relevant words (i.e. named entities) with their semantic category. However, providing the person associated to each category is optionalTherefore, participants can decide in which track they would like to participate, either:

  1. Track 1 (Basic): The CSV must contain the transcription and the semantic category (name, surname, occupation, etc.).
  2. Track 2 (Complete): The CSV must contain the transcription, the semantic category and the person (husband, wife, etc.).


Track 1 - Basic.

Provide a compressed ZIP file containing ONE CSV file per register. 

EVALUATION AT WORD LEVEL
In the CSV file, please provide the identifier of the word as follows:

CSV Format:

        File name: "idPageX_RecordY_output.csv"

        idPageX_RecordY_LineZ_Word1,transcription_word1, category_word1
        idPageX_RecordY_LineZ_Word2,transcription_word2, category_word2

Example of CSV format:

        File "idPage10406_Record2_output.csv"

        idPage10406_Record2_Line0_Word4,Juan,name
        idPage10406_Record2_Line0_Word5,Batista,name
        idPage10406_Record2_Line0_Word6,folch,surname
        idPage10406_Record2_Line0_Word7,pages,occupation
        idPage10406_Record2_Line0_Word9,Sabadell,location
        etc.

 

EVALUATION AT LINE LEVEL
If you have only used the images of segmented text lines (not the segmented words), then, provide the identifier of the line:

        File "idPage10406_Record2_output.csv"

        idPage10406_Record2_Line0,Juan,name
        idPage10406_Record2_Line0,Batista,name
        idPage10406_Record2_Line0,folch,surname
        etc.

 

Track 2 (Complete)

Provide a compressed ZIP file containing ONE CSV file per register. 

EVALUATION AT WORD LEVEL
In the CSV file, please provide the identifier of the word as follows:

CSV Format:

        File name: "idPageX_RecordY_output.csv"

        idPageX_RecordY_LineZ_Word1,transcription_word1, category_word1, person_word1
        idPageX_RecordY_LineZ_Word2,transcription_word2, category_word2, person_word2

Example of CSV format:

        File "idPage10406_Record2_output.csv"

        idPage10406_Record2_Line0_Word4,Juan,name,husband
        idPage10406_Record2_Line0_Word5,Batista,name,husband
        idPage10406_Record2_Line0_Word6,folch,surname,husband
        idPage10406_Record2_Line0_Word7,pages,occupation,husband
        idPage10406_Record2_Line0_Word9,Sabadell,location,husband
        etc.


EVALUATION AT LINE LEVEL
If you have only used the images of segmented text lines (not the segmented words), then, provide the identifier of the line:

        File "idPage10406_Record2_output.csv"

        idPage10406_Record2_Line0,Juan,name,husband
        idPage10406_Record2_Line0,Batista,name,husband
        idPage10406_Record2_Line0,folch,surname,husband
        etc.

 

METRICS

The evaluation will be done at marriage record level. Since the focus of the competition is on information extraction, the semantic label will be prioritized. This means that if the semantic category of a word is incorrect, the transcription is not taken into account. Contrary, if the semantic category has been correctly detected, then the Character Error Rate (CER) is used to evaluate the transcript. Finally, the mean will be computed.


Track 1 - Basic.

For each relevant word in the marriage record, the score will be computed as follows:

  • Score at category level:

    • 0 if the category is incorrect.

    • If the category is correct, then the score is [0-1], depending on the CER (transcript).
       

Track 2 - Complete.

For each relevant word in the marriage record, two scores will be computed as follows:

  • Score at category level:

    • 0 if the category is incorrect.
    • If the category is correct, then the score is [0-1], depending on the CER (transcript).

  • Score at person level:

    • 0 if the person or category are incorrect.

    • If the person and category are correct, then the score is [0-1], depending on the CER (transcript).

 

Note that the score at category level in both tracks is the same, so they are directly comparable. This means that participants in Track 2 are also participating in Track 1.


 

Challenge News

Important Dates

Registration: February 15 - May 15, 2017

Training set available: February 15, 2017

Test set available: May 20, 2017

Submission of results: until July 9, 2017 (final extension).