Tasks - Out of Vocabulary Scene Text Understanding

End-to-End Text Recognition

mlt.jpg

In this task, the participants will be provided with images and expected to localize and recognize all the text in the image at word granularity. For example, given the above image, we expect the methods to output "TARONGA", "ZOOLOGICAL", and "PARK" along with their 4-point bounding box in the image.

The results will be evaluated and ranked in terms of their performance on the subset of Out of Vocabulary (OOV) words, as well as their balanced performance between OOV and In Vocabulary (IV) words. In particular, we will report performance over different word subsets (1) on OOV words only, which in this case includes only "TARONGA", (2) on IV words only, which in this case includes "ZOOLOGICAL" and "PARK", and (3) a balanced metric that is the average of the performance between the OOV and IV words. The evaluation is CASE-SENSITIVE.

We consider as top performers in the competition both methods that top the ranking on OOV words only, and methods that display a balanced performance: they yield good results on OOV words without undermining their performance on IV words. Top-performing methods will be recognized in both categories.

The evaluation protocol proposed by Wang 2011 [2] will be used which considers a detection as a match if it has an Intersection over Union above 0.5 with any ground truth bounding box (same as [1] and Task 4.1) and the word transcriptions match, case sensitive.

The alphabet that will be used limits to the following characters:

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '€', '£', '¥', '°', '₹']

 

Instructions for downloading the images and annotation data can be found in the Downloads tab.

Cropped Word Text Recognition

taronga.png

In the cropped word text recognition challenge, we focus on recognition only of previously extracted words. We ask the participants to recognize all cropped words extracted from the images of the test set, including OOV and IV words. As an example, the participants will be asked to recognize the text given in the above image and the accepted transcription will be "TARONGA". Same as before, the evaluation is CASE-SENSITIVE.

We report the total Edit Distance (ED) and the percentage of Correctly Recognised Words (CRW) for both OOV and IV words, as well as a balanced metric that is the average CRW between the OOV and IV words. Similarly as in Task 1, we consider as top performers in the competition both methods that top the ranking on OOV words only, and methods that display a balanced performance: they yield good results on OOV words without undermining their performance on IV words. Top-performing methods will be recognized in both categories.

Instructions for downloading the images and annotation data can be found in the Downloads tab.

 

Submission Format

Participants can upload their submissions for the end-to-end and cropped word recognition tasks by uploading a single json file in the following formats.

End-To-End

Submissions on the end-to-end task have to be uploaded as a single json file. This file needs to contain a list of dictionaries (one for each image), and must contain the following fields per image:

[
    {
        “image_id”: the unique identifier of the image, must be a string
        “text”: [
                       {
                            “transcription”: the transcription of the detection (must be a string)
                            “confidence”: the confidence of the detection (must be float)
                            “vertices”: the bounding box of the detection as a list of 4 clockwise vertices (each one being a list of two x and y coordinates) in the format [[x1, y1], [x2, y2], [x3, y3], [x4, y4]] (the vertices can be integers or floats, they will be casted to float)
                        },
                        ...
                    ]

    },
    ...
]

In the validation split, the ID of each image can be found inside the dictionaries of the json validation GT file in the field “image_id”. As for the test split each image is named with its ID followed by the extension of the image.

Cropped Word Recognition

Submissions to the cropped word recognition have to be uploaded as a single file. This file needs to contain a list of dictionaries (one for each cropped word), and must contain the following fields per word:

[
    {
        “text_id”: the unique identifier of the text instance crop, must be a string (This is different than image_id)
       “transcription”: the transcription of the crop (must be a string)
    }
]

Each cropped word in the validation and test splits is named by its ID followed by the file extension.

References

[1] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, (2014). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98-136.

[2] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition”, in Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 1457-1464), IEEE, November 2011

Challenge News

Important Dates

11 May 2022: Web site online

15 June 2022: Test set available

12 July 2022: Important: Test set was updated to include more diverse data. Please download the new test set.

20 July 2022: Submission of results deadline

25 July 2022: Announcement of results winners

October 2022: Results presentation at the TiE Workshop @ ECCV 2022