Tasks - Comics Understanding

The challenge comprises different tasks. In ICDAR 2025, we will host Task 1.

On one hand, the ICDAR 2025 competition task is called ``Pick a Panel``, which is an extension of the 2017 COMICS Clozure tasks where the output is still a classification task, but the input is composed of two sets: the context, a sequence of panels (or description), and the options, a set of panels.  As our objective is to provide the dataset, without forcing the participants to use our framework for other dataset images (for which we don't have the rights to share, e.g. Manga109, PopManga, eBDtheque), in this task the images are cropped panels from copyright-free images.

On the other hand, ``Multi-task single-page Comics Understanding`` is an ensemble of tasks applied to a single-page multi-style comic (French and American comics, as well as manga). The tasks are: Object detection, Speaker identification, Character re-identification, Character naming, and Dialog generation. All the material for these tasks (validation set and evaluation code, as well as the code to get the private dataset's images) is -temporarly- provided in the CoMix repository.

Task 1: Pick a Panel (ComPAP)

This task is composed of three sets of skills: Sequence filling, Clozure, and Caption relevance

  1. Skill Sequence Filling
    sequence_filling.png

    This task is presented as choosing the right panel among a set of options. Three elements are provided to the model: the context, the index, and the options. The context corresponds to a sorted sequence of panels. The index indicates in which position the missing panel should be inserted so that the sequence is complete. Finally, a set of options panels is also provided, with its size changing from a sample to another. 
    It should be noted that i=0 means that the context is a complete sequence, and the missing panel should be located in the first location, causing all the other panels to shift to the right. However, i=(N -1) also means that the context is a complete sequence and the missing item is postponed to the sequence.

  2. Clozure
    This is the well-known closure task initially proposed in COMICS, but with the text-closure task reframed for (i) panel options input and (ii) text options input. In this case, the context length may vary as well as the possible options. The image to fill is always the "next image of the input sequence", so some of these skills can be seen as a limited variation of the previous skill. However, in this case, as we are rendering the text in the panel balloons, and the model is not biased to choose based on aesthetics. As in the previous skill, here the inputs are: context, options, and the index is always N-1 (or len(context)-1).

    1. character coherence

      char_coherence.png
    2. visual closure

      visual_closure.png
    3. text closure

      text_closure.png
  3. Caption relevance
    caption_relevance.png
    This skill inherits from the previous ones but adds a line of complexity: the context is given uniquely from a detailed description of the previous panel. For this skill, the sample doesn't have the context (it has an empty list), but it has previous_panel_caption keys, which correspond to the caption of a panel. The goal is to choose the correct panel as a continuation of that caption.

Ground Truth format

Independently from the skill, the dataset is provided as the following JSON object:

```json

{
    "dataset_name":"comics_task1", // The name of the dataset, should be invariably "comics_task1"
    "dataset_split": "val", // The subset(either "val" or "test")
    "dataset_version":"0.1", // The version of the dataset. A string in the format of major.minor version
    "data": [{...}]
}

```

The "data" element is a list of dictionary entries with the following structure:

```json
{
    "data_split": "val" , // The dataset split this sample belongs to
    "skill": "seq_fill", // The skill this refer to, can be ["seq_fill", "clos_1", "clos_2", "clos_3", "cap_rel"] 
    "sample_id": 0, // A unique ID number for the question
    "context_ids": [178, 561, 3673, 288, 32], // The context panel indexes. If "skill"="cap_rel" this is an empty list
    "previous_panel_caption": "" , // The previous panel caption. If "skill"!="cap_rel" this is an empty string, otherwise a caption like "This is a panel from a comic book. The setting ..."
    "index": 0, // The index of the missing panel. If "skill"="cap_rel" this is -1
    "options_ids": [454, 10901, 1282, 3355], // The indexes of the options (one is the correct one, the others are distractors)
    "correct_id": 10901 // The index of the correct option. *It is the global index, Not the index of the `options_ids` indices.
}

```

Submission Format

Results are expected to be submitted as a single JSON file (extension.json) that contains a list of dictionaries, in which there are four keys, which are "sample_id", "skill", "prediction", and "full_prediction". The "sample_id" key represents the unique id of the test-set sample, while the "prediction" keys should correspond to the model's output: the index of the correct solution. In the case of MLLMs, we know that models first output the text answer, then the prediction has to be extracted from the text. In this case, we also expect "full_prediction", which contains the full MLLM output, and we will (optionally) parse it using some regexes to extract the correct index (and check if the users have extracted the correct index or not). Lastly, every sample corresponds to one skill, thus we have the field "skills" which serves at addressing different skills performances, separately.

As an example, the result file might be named "result_test.json" and will contain a list similar to:
```json
[
    { "sample_id" : 0, "prediction" : 1, "skill": "sequence_filling", "full_prediction" : "The first panel seems to not fit the .... Thus the correct panel option is the fourth."},
    { "sample_id" : 1, "prediction" : 3, "skill": "text_closure", "full_prediction" : "..."},
    { "sample_id" : 2, "prediction" : 0, "skill": "char_coherence", "full_prediction" : "..."},
    ...,
]
```

Evaluation Metrics

Classification Metric

In the ICDAR 2025 edition, the skills are treated as n-way classification tasks, thus the correct option index should be the only answer. However, as we expect this challenge to be a starting point for assessing whether MLLMs can understand intricate sequences of comic panels, the index of the correct answer could also be extracted from a text output with specific regexes. If you need support to do so, please contact the organizers at the address below. However, we expect the "prediction" and we'll calculate the accuracy on that key only. Methods are ranked according to their accuracy score on the five skills, and the global rank is obtained by averaging the local scores (weighted on the number of samples per skills).

Contacts

If you have any questions, please contact the organizers

Challenge News

Important Dates

ICDAR 2025 Edition

17-21/09/2025: Results presentation

30/04/2025: Camera-ready of competition report

20/04/2025: Initial submission of competition report

15/04/2025: Deadline for Competition submissions

25/02/2025: Benchmark model and Dev sets v0.1

19/02/2025: Test/Val sets v0.1 available

10/01/2025: Tasks have been updated for the ICDAR 2025 COMICS competition at https://www.icdar2025.com/program/competitions

 

CoMix Benchmark

November 2024: The dataset & evaluation repository CoMix https://github.com/emanuelevivoli/CoMix have been released