F.A.Q. - Focused Scene Text

General

Why do I need to register?

Registering is important for us as it gives us an indication about the possible participation to the competition and also a way to contact potential participants in case we need to communicate useful information about the competition (and only about the competition). You will need to be registered in order to get access to the "Downloads" and "Submit Results" sections.

Am I obliged to participate if I register?

No, registration is only meant to be an expression of interest and it will give you access to the "Downloads" and "Submit Results" section.

Do I have to participate in all of the tasks of the Challenge?

No. You can participate in any and as many of the tasks as you wish to.

I noticed there are five Challenges organised under the “ICDAR 2017 Robust Reading Competitions” site. Do I have to participate in other challenges as well?

No you do not have to. But we would really appreciate it if you did!

We have strived to structure all challenges in a similar way when possible, so if you have a system that can be trained and produce results for one of them, then your system should also be easily adapted for some (if not all) of the rest! So, the additional effort is minimal in some cases. Not to mention that you get more chance to win:)

Why have you organised different challenges on static images? Challenges 1, 2 and 4 seem to be very similar.

There are crucial differences between the Born-Digital (Challenge 1) and Real Scene (Challenges 2 and 4) text. Real scene images are captured by high-resolution cameras, and might suffer from illumination problems, obtrusions and shadows. On the other hand born-digital images are designed directly on the computer, text is designed in situ and it might suffer from compression or anti-aliasing artefacts, the fonts used are very small and the resolution is 72dpi as these images are designed to be transfered online. There are more differences to list but the main point here is that algorithms that might work well in one domain will not necessarily work well in the other. The idea of hosting different challenges and addressing both domains in parallel is to try to qualify and quantify the simiarities and the differences and establish the state of the art in both domains.

Between Challenge 2 and Challenge 4 there are crucial differences as well. Challenge 2 comrpises what could be called "focused text", referring to the fact that the textual content is the focus of the capture. This scenario is typical of translation applications for example, where the user is pointing the camera to the text of interest and puts in reasonable effort to capture a good quality photo. Text in such scenarios is typically well focused and follows horizontal layouts. Challenge 4 covers the completely different, and more challenging scenario of "incidental text", referring to text that appears in the scene without the user having taken any specific prior action to cause its appearance or improve its positioning or quality in the frame. Incidental scene text covers a wide range of applications linked to wearable cameras or massive urban captures where the capture is difficult or undesirable to control.

How is Challenge 3 (videos) different from Challenge 2 / 4 (static images). Isn't the case of video equivalent to running the text extraction algorithm to all frames one by one?

A key aspect of video text extraction is the ability of the algorithm to track the text box over different frames. We therefore expect solutions that can demonstrate this ability, and our evaluation framework penalises algorithms with faults in the tracking part.

I found a mistake in the ground truth! What can I do?

Please let us know by sending us a note at rrc@cvc.uab.es or robustreadingcompetition@gmail.com. If we receive your note early, we will strive to correct any errors found in the ground truth and include your correction to an up to date version of the dataset that will be realeased about 10 days before the submission deadline. Before you report any errors, please review the protocol we have used for ground truthing to make sure you are reporting a real error, and not the result of a conscious decision of our ground truthers. We really appreciate your help!

Challenges 1 and 2

Your "Text Localisation" ground truth seems to be at the level of words, but my algorithm is made to locate whole text lines! Are you going to penalise my algorithm during evaluation?

We will do our best not to penalise such behaviour. This was actually one of the few issues reported by authors after past Robust Reading competitions. For the evaluation of this task we have implemented the methodology described in [2]. This methodology addresses the problem of one-to-many and many-to-one correspondences of detected areas in a satisfactory way, and algorithms that are not designed to work at the word level should not be penalised.

I see that not every piece of text in the images is ground truthed, is this an error?

We aim to ground truth every bit of text in the images, there are however cases when we consciously do not include certain text in the ground truth description. These are the following.

Characters that are partially cut (see for example the cut line at the bottom of Figure 1a - this is not included in the ground truth). Cut text usually appears when a large image is split to a collage of many smaller ones; traditionally this practice was used to speed up the download of Web pages but it is not encountered a lot nowadays.
Text that was not meant to be read but appears in the image accidentally as part of photographic content (see for example the names of the actors on the "The Ugly Truth" DVD in Figure 1b). The text there can only be infered because of the context; it was never meant to be read. On the contrary we do include text which is part of photographic content when it's presence is not accidental in the image (for example the names of the movies in Figure 1b are indeed included in the ground truth).
Text that we cannot read in general. This can be because of very low resolution for example, but there are other cases as well. See for example the image of Figure 1c, the word "twitter" seems to be used as the background, behind "follow". This is treated as background and is not included in the ground truth.

In any other case, we probably have made a mistake, so please let us know!

Why are there two evaluation protocols ("ICDAR 2013" and "DetEval") for Text Localisation?

The "ICDAR 2013" evaluation protocol for the text localization task is as described in the report of the competition [1], and is based on the framework described in [2]. The "ICDAR 2013" evaluation protocol is a custom implementation, tightly integrated to the competition Web portal in order to enable the advanced evaluation services offered through the competition Web, and as such it is not making use of the DetEval tool (code offered by the authors of [2]).

Over time, it has come to our attention that slight differences exist between the ICDAR 2013 evaluation protocol and the results obtained by using DetEval. These are due to a number of heuristics that are not document in the paper [2]. These include the following:

The DetEval tool implements two pass matching for one-to-one matches, where even if a one-to-one match is found in the beginning (according to the overlapping thresholds set), it is still considered as a possible one-to-many or many-to-one match if it overlaps with more regions. The decision as to what type of match to consider is taken at the end. This heuristic makes intuitive sense and in many cases produces results that are easier to interpret, especially for methods that consistently over- or under-segment. The ICDAR 2013 implementation considers the one-to-one matching rule first (as described in [2]) and does not consider any alternative interpretations if an one-to-one match is found.
The DetEval tool looks for many-to-one matches before one-to-many matches. The ICDAR 2013 implementation follows the order described in [2] and looks for one-to-many matches before many-to-one matches. This actually has minimal impact in the results.

To ensure compatibility and to assist authors who make parallel use of the DetEval framework offline we have implemented an alternative evaluation protocol which is tested to be consistent to the DetEval tool and takes into account all undocumented heuristics. Any method submitted to Task 1 of Challenge 1 or 2, will be automatically evaluated using both evaluation framework, while results and ranking tables can be visualised for either.

Note that the final numerical results produced by either protocol are very similar, while the methods' ranking obtained by either evaluation protocol rarely changes.

What are the parameter values you are using for the text localisation evaluation algorithm?

Both ICDAR 2013 and DetEval schemes make use of the same parameter values.

Two thresholds on the area precision (tp) and area recall (tr) control the way matches between ground truth rectangles are determined. We use the default values suggested in [2] for these thresholds, namely tr = 0.8 and tp = 0.4.

For calculating the overall object Precision and Recall the method considers all matches over the whole collection.

One-to-many and many-to-one matches can be assigned different weights allowing the uneven penalisation of different behaviours of the method under question. In this implementation, we give a lower weight to one-to-many matches than the rest. The rationale behind this decision is that we make use of the word level of our ground truth for evaluation; therefore, although we want to penalise methods that produce multiple rectangles for a single ground truth word, we do not want to penalise methods designed to produce results at the text-line level, detecting many ground truth words of the same line with a single rectangle. Hence, for many-toone matches we do not inflict any penalty, while for oneto-many matches we use the suggested fixed weight of 0.8. We encourage the interested reader to review [2] for further details.

Challenge 3

Your ground truth seems to be at the level of words, but my algorithm is made to locate whole text lines or paragraphs of text. Are you going to penalise my algorithm during evaluation?

It is very difficult, if not impossible, to decide what the right level for the ground truth should be in the case of real scenes, be it videos or static images. We have decided to create ground truth at the level of words, because they are the smallest common denominator.

The current evaluation framework we are using (based on CLEARMOT [3]) cannot be easily adapted to take into account any granularity differences between the ground-truth and reported results. Therefore, your method will be penalised if results are not given at the level of words.

How did you create your ground truth? Did you follow a particular protocol?

You can download the protocol we followed to create the ground truth for Challenge 3 from here. Important aspects to note is that the ground truth is made at the level of words, and that the quality attribute is used to control whether a region is good enough to take into account during the evaluation or not, in which case it is treated as a "don't care" region during the evaluation.

Challenge 4

I have made a submission to this Challenge, but I cannot see the evaluation of my submitted results online

This is normal. We have to respect the ICDAR guidelines which do not allow us to make the results of the competition public before the conference. So we are obliged to maintain secret the performance of submitted algorithms over the test set until the conference takes place. In the meanwhile your method will appear as not evaluated yet. This is normal behaviour and does not mean that there is anything wrong with your submission.

As soon as ICDAR 2015 is over, we will switch to a continuous mode as we do every time, and at that point evaluation and submission of new versions of your method will be available online at the time of submission.

Is it right that for Challenge 4 you have changed the evaluation protocol for localisation?

This is correct. As is explained in the task descrption for task 4.1, contrary to Challenges 1 and 2, the evaluation of the results will be based on a single Intersection-over-Union criterion, with a threshold of 50%, similarly to standard practice in object recognition and the Pascal VOC challenge.

The new text localisation evaluation protocol, does not make any provision for tackling the granularity difference between detections and the ground truth, does it?

No it does not. Using the standard intersection-over-union metric as is, is not enough to take care of any granularity difference between the ground truth and the reported results. For this we would have to manage one-to-many and many-to-one relationships between ground truth bounding boxes and detected bounding boxes. Therefore, if a method is tuned to detect text lines or text blocks instead of words, it will be penalised.

Tackling this granularity difference, is quite important when dealing with text, and standard practice in the document image analysis community. Researchers approaching text localisation from an object recognition point of view though see tackling this granularity difference as a distraction, as the results might be more difficult to interpret. Over the past four years we had quite some questions about this and requests to make this evaluation simpler.

There is no easy solution to the problem. We want to encourage participation from the wider computer vision community, not only the document image analysis part, so we have decided to use the intersection-over-union metric for 2015. An extra reason for our decision is to make the results of task 4.1 and 4.4 compatible, so that the effect of the recognition stage in a typical end-to-end pipeline could be easily measured.

Our intention is to eventually provide text localisation results using both the DetEval (that provides mechanisms to deal with this granularity difference) and the intersection-over-union metrics for Challenges 1, 2 and 4. This dual evaluation will allow to compare performances directly between Challenge 1 and 2 (currently using DetEval) and Challenge 4 (currently using IoU).

You say that the ground truth is defined at the Word granularity level, but I have seen that some bounding quadrilaterals include multiple words or even multiple text lines.

The ground truth of the “CARE” words of Challenge 4 is defined at the word level.

There are cases (e.g. if the resolution is very low and the text is unreadable or if the script is different than Latin) where we mark areas of text as “DO NOT CARE” regions. These “DO NOT CARE” regions are not taken into account during evaluation. A method that is unable to localise them will not be penalised, and if they are localised they will not be ignored during evaluation. It is only in these cases that we allow ground truth to be produced at the text line or text block level, as this is an area marked to be ignored.

These areas are marked with a “###” as a transcription in the ground truth files so that authors can take them into account during training if they want to, or ignore them (the suggested way of treating them).

If you have found a “CARE” region, (that is a region that does NOT have “###” as a transcription) that comprises more than a single word, please let us know so we can correct the ground truth, as it is a mistake.

References

1. D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. Gomez, S. Robles, J. Mas, D. Fernandez, J. Almazan, L.P. de las Heras , "ICDAR 2013 Robust Reading Competition", In Proc. 12th International Conference of Document Analysis and Recognition, 2013, IEEE CPS, pp. 1115-1124.

2. C. Wolf and J.M. Jolion, "Object Count / Area Graphs for the Evaluation of Object Detection and Segmentation Algorithms", International Journal of Document Analysis, vol. 8, no. 4, pp. 280-296, 2006.

3. K. Bernardin and R. Stiefelhagen. “Evaluating multiple object tracking performance: the CLEAR MOT metrics”, J. Image Video Process., 2008