Tasks - Document UnderstanDing of Everything 😎
Dataset and annotations
Dataset. The dataset will consist of 5K PDF files, each annotated with questions of different types, including extractive, abstractive, list, and non-answerable. It is split into a training/validation set (“trainval”) and a test set (“test”). The “trainval” set consists of 18.7K question-answer pairs related to 3.7K multi-page documents. The “test” set, scheduled to be published a few weeks before the submission deadline, consists of 1.3K documents. Additionally, we will provide a sample training/validation set ("sample") before the start of the competition to ensure everyone can properly set up and test their systems for competing on DUDE.
Annotations. The annotations format is similar to the already popular format of DocVQA, with some new additions.
Please refer to the examples below, including the DUDE_example.pdf file (displayed as a grid of pages) and others in Downloads/demo_annotations.json.
Task 1: DUDE
Document Understanding comprises a large set of skills, including holistically consuming textual and visual elements structured according to rich layouts. While the "understanding" part is often described loosely, in DUDE, we specify it as the capability to reason with compositional information extracted from a visually-rich document (VRD).
DUDE is formulated as an instance of Document Question Answering (DocQA) to evaluate how well current solutions deal with multi-page documents, if they can navigate and reason over the layout, and if they can generalize these skills to different document types and domains. Since we cannot provide question-answer pairs about, e.g., ticked checkboxes, on each document instance or document type, the challenge presented by DUDE characterizes equally as a Multi-Domain Long-Tailed Recognition problem .
When presented with a question (with type in extractive, abstractive, list, or non-answerable) and an input PDF document, we expect a participant system to provide the following:
"answers": ["Moscow Sheremet, Russia - Terminal E - International"],
"answer_confidence": [0.9298765], a list with an answer confidence score (1 value), ideally encoded as a 64-bit float between 0 and 1.
"answer_abstain": False, a boolean value for flagging documents from an unseen domain, only relevant during evaluation phase 2.
To make a fair comparison when employing OCR-based pipelines, we also provide starter OCR files for all PDFs in the dataset under Downloads/demo_OCR.json.
If you use a different OCR system, please report this together with the submission.
The first evaluation phase assumes only iid (same mixture of document types and question-answer types) data for all train-validation-test splits.
To score all possible answer types, the evaluation metric will be a modified Average Normalized Levenshtein Similarity (ANLS) metric. For lists, the metric is made invariant to the order of provided answers.
Whenever fewer values are returned, the smaller set will be padded with empty strings. This modified ANLS metric can be perceived as a fuzzy variant of the F1 metric, in contrast to the original ANLS (fuzzy accuracy), and, therefore, it will be referred to as F1ANLS. For non-answerable questions, if the method provides an answer, the score is 0.
To score answers_confidence, the evaluation metric will be Expected Calibration Error (ECE). [3,4] An alternative to deal with the absence of exact matches is to report ECE at different edit distance thresholds, following . For consistency in calibration evaluation, unanswerable questions and list-answers both require a single answers_confidence (regardless of the number of answers).
In the second evaluation phase, we will open submission to a second holdout set, which contains a mix of documents from seen and unseen domains.
Here, we expect a DUDE competition system to detect questions from the unseen domain, on which it should either lower its confidence (answers_confidence) or abstain from giving an answer (answers_abstain).
To score how graceful a system deals with unseen domain data, the evaluation metric will be the area-under-risk-coverage curve (AURC), following work on selective question answering under domain shift .
Standalone evaluation scripts are provided with metrics implementations: https://github.com/Jordy-VL/DUDEeval
Competing for subtasks
Additionally, during training, we provide the answer_type in order to allow competitors to only train and submit results for a specific answer_type.
For example, the subtask of extractive questions allows for encoder-only architectures to compete in DUDE, which will be represented in a separate results column.
On the other hand, abstractive questions might require some generative decoder architecture to properly answer questions.
Another important subtask is to have a calibrated document QA system, which does not hallucinate in case of non-answerable questions and which lowers answers_confidence when unsure about its answers.
1. Is it allowed to use external data to fine-tune or pre-train our models in the competition? For example, publicly available QA datasets or other internally labeled datasets?
2. Are pre-trained models permissible for use in the competition?
3. Can we use our own OCR service besides the one provided by the organizers?
Regarding 1,2 and 3: You can use anything you want: pretrained models, external data, or your own OCR service.
You only need to provide these details in the method description upon submission.
4. Is it required that participants submit their code together with the submission?
Regarding 4: You do not have to upload the code, only the results output by your methods.
Open-source code is always appreciated and you can link your GitHub repository in the details of the submission.
But it is not mandatory, seeing that some submissions will be from companies who want to benchmark their proprietary models.
5. Is there a real-time leaderboard that updates upon submission to see the rankings during the competition?
Regarding 5: Until the end of the competition [1 April 2023] the results will not be public to prevent the participants from finding the best hyperparameters on the test set.
Instead, we would like to encourage you to propose novel contributions that make a significant impact on the task.
You can still search for the best hyperparameters in the validation set, but you will need to find a good trade-off that is not overfitted towards this split.
To help DUDE participants, we have provided standalone evaluation scripts so that they can know how well they are doing on the validation split.
 Yang, Y., Wang, H. and Katabi, D., 2022. On Multi-Domain Long-Tailed Recognition, Generalization and Beyond. European Conference on Computer Vision (ECCV) 2022.
 Jaeger, P.F., Lüth, C.T., Klein, L. and Bungert, T.J., 2022. A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification. arXiv preprint arXiv:2211.15259.
 Guo, C., Pleiss, G., Sun, Y. and Weinberger, K.Q., 2017. On calibration of modern neural networks. In International conference on machine learning (pp. 1321-1330). PMLR.
 Naeini, M.P., Cooper, G. and Hauskrecht, M., 2015. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
 Slossberg, R., Anschel, O., Markovitz, A., Litman, R., Aberdam, A., Tsiper, S., Mazor, S., Wu, J. and Manmatha, R., 2020. On calibration of scene-text recognition models. arXiv preprint arXiv:2012.12643.
 Kamath, A., Jia, R. and Liang, P., 2020. Selective Question Answering under Domain Shift. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5684-5696).
DUDE - Extended Submission Deadline
DUDE - submission instructions
DUDE - Test Set Available
DUDE - Updated submission schedule
DUDE - GitHub Repository for discussions
Updated training/validation set
DUDE: RRC submission system being upgraded
Training/validation dataset available
Registration open: Christmas 2022
Competition Q&A period: 20 December - 10 January
Sample dataset available: 6 January 2023
Training/validation dataset available: 30 January 2023
Test set submissions open (Task 1 & evaluation phase2): 9 March 2023
General submission deadline: 20 April 2023
Method description submissions deadline: 20 April 2023
Notification to authors: 1 May 2023
All dates are 23:59 AoE and subject to change.
Note on the registration for the DUDE challenge:
There is no need to register explicitly for the DUDE challenge. As long as you are registered to the RRC portal you will be able to submit your results when the submission is open.
Any questions, contact the DUDEs at firstname.lastname@example.org