Overview - Information Extraction in Historical Handwritten Records
The extraction of relevant information from historical handwritten document collections is one of the key steps in order to make these manuscripts available for access and searches. In this context, instead of a pure transcription, the objective is to move towards document understanding. Concretely, the aim is to detect the named entities and assign each of them a semantic category, such as family names, places, occupations, etc.
A typical application scenario of named entity recognition is demographic documents, since they contain people's names, birthplaces, occupations, etc. In this scenario, the extraction of the key contents and its storage in databases allows the access to their contents and envision innovative services based in genealogical, social or demographic searches.
Lately, the interest of the document image analysis community in document understanding, named entity recognition and semantic categorization is awaking, and some techniques based on HMMs, BLSTMs and CNNs have been proposed. With this competition, we aim to foster the research in this field an offer a benchmark for the research community.
The esposalles dataset
For this competition we will use 125 pages of the Esposalles database, a marriage license book conserved at the Archives of the Cathedral of Barcelona. The corpus is written in old Catalan by only one writer in the 17th century. Each marriage record contains information about the husband’s occupation, place of origin, husband’s and wife’s former marital status, parents’ occupation, place of residence, geographical origin, etc.
The structure of the marriage record tends to follow a regular expression. Some anchor words (in bold) separate the different persons, as follows:
In some cases, other persons may appear in the record. For example, when a widow gets married again, the record may include information on the former husband/wife. In those cases, the information on the wife’s parents usually disappears:
Note that the above structures are usually followed. However, in some cases, the marriage records show variations.
Registration: February 15 - May 15, 2017
Training set available: February 15, 2017
Test set available: May 20, 2017
Submission of results: until July 9, 2017 (final extension).