Tasks - RoadText Competition on Video Text Detection, Tracking and Recognition

The challenge comprises of a single task where the objective is to accurately identify and locate words in both time and space within a video. This includes identifying when each word appears and disappears in the video, as well as identifying the location of each word in each individual frame. It is important that the correctly recognized words are also accurately located in every frame and tracked correctly throughout the entire video sequence.The annotations in the dataset are at the line level, indicating that words that are in the same line and intended to be grouped together should be tracked within a single bounding box.

 

Ground Truth Format

The ground truth will be a json file in the following format 

{
  "0": {
// Video number
    "1": { // Frame number
      "labels": [ //  Array of all detected text objects from the frame. null if no objects are present
        {
          "box2d": { 
 // Bounding box coordinates in Pascal VOC format
            "x1": 560.1225114854518,
            "x2": 582.1745788667688,
            "y1": 208.7595824260624,
            "y2": 218.68301274765506
          },
          "category": "English",
// Category of the detected object which can either be "Illegible", "Non_English_Legible" or "English"
          "id": 0, //An object identifier which a unique code that identifies an object. In a video, all instances of the same object will have the same object identifier.
          "ocr": "one way" //  The textual transcription of the word
        }
      ]
    }
  }
}

 

Submission Format

Submit a single JSON file with the following format

{
  "tracking": {
    "701": {
// Video number
      "1": { // Frame number
        "labels": [ //  Array of all detected text objects from the frame. null if no objects are present
          {
            "box2d": {
 // Bounding box coordinates in Pascal VOC format
              "x1": 560,
              "x2": 582,
              "y1": 208,
              "y2": 218
            },

            "id": 0, //An object identifier which a unique code that identifies an object.
          }
        ]
      }
    }
  },
  "recognition": {
    "701": {          
// Video number
      "0": "one way",  // key will be the object id of the tracked object and the value will be the recognized text. Value will be null if the text is illegible.
    },
    "702": {}        
// {} when there are no recognitions 
  }
}

Please note that areas of Illegible and Non_English_Legible text will not be taken into account during the evaluation. This means that a method will not be penalized if it fails to detect these words, while a method that does detect them will not receive a higher score.

Evaluation

Instead of annotating individual words (which are typically split at spaces) like most scene text datasets, annotations are line level in RoadText-1K. (Disclaimers: Some of the data in train and val set can have annotations in word level). The evaluation of the tracks will be done at line level.

Evaluation of text localisation and tracking will be based on the evaluation framework of the CVPR19 MOTChallenge (Multiple Object Tracking) benchmark [1]. In particular, we make use of the publicly available py-motmetrics library (https://github.com/cheind/py-motmetrics). The CVPR19 MOTChallenge evaluation, following recent trends for quantitative evaluation of multiple target tracking, is based on two sets of measures: the CLEAR-MOT metrics [2] and ID metrics [3,4].

The MOTChallenge evaluation framework offers a number of different performance measures. The methods will be ranked based on the MOTA metric.

The text line recognition performance is evaluated by whether a word recognition result is correct. Text line recognition evaluation is case-insensitive and accent-insensitive. Text recognition performance is only calculated for the English category.  A predicted word is classified as a true positive if its intersection over union with a ground-truth word is greater than 0.5 and the word recognition is correct. Word recognition is ignored in case of "without recognition" results.

References

[1] Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., & Leal-Taixe, L. (2019). CVPR19 Tracking and Detection Challenge: How crowded can it get? arXiv preprint arXiv:1906.04567.

[2] Bernardin, K. & Stiefelhagen, R. Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. Image and Video Processing, 2008(1):1-10, 2008.

[3] Ristani, E., Solera, F., Zou, R., Cucchiara, R. & Tomasi, C. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. In ECCV workshop on Benchmarking Multi-Target Tracking, 2016.

[4] Li, Y., Huang, C. & Nevatia, R. Learning to associate: HybridBoosted multi-target tracker for crowded scene. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009.

Challenge News

Important Dates

24th -31st December 2022: Initial website launch

24th -31st December 2022: Initial training data release

15 February 2023: Full training data along with test data release

1st March 2023: Submission site open

20 March 2023: Deadline for competition submissions

27 March 2023: Extended deadline for competition submissions

1 May 2023: Initial submission of competition report

21 - 26 August 2023: Result announcement and presentation

 

All deadlines are in the AoE time zone