← Homepage

Datasets & Benchmarks

Resources released by the CVML group

Egocentric & Assistive

EgoBlind NeurIPS 2025
egocentric video QA assistive AI accessibility
The first egocentric VideoQA dataset collected from blind and visually impaired individuals. Questions reflect genuine in-situation needs for visual assistance. State-of-the-art MLLMs achieve ~60% vs. 87.4% human performance.
1,392 egocentric videos  ·  5,311 QA pairs  ·  avg. 3 reference answers per question
EgoIntention ICCV 2025
egocentric grounding visual intention affordance
The first dataset for egocentric visual intention grounding. Models must localise objects in first-person views based on implicit intention queries — e.g. "somewhere to sit" rather than "chair". Challenges models on affordance understanding and contextual reasoning.
Built on PACO-Ego4D  ·  multiple intention sentences per object  ·  context and uncommon query types
QA?
EgoTextVQA CVPR 2025
egocentric video QA scene text text grounding
A benchmark for egocentric scene-text aware video question answering. Models must read and reason about text appearing naturally in first-person videos — signs, labels, displays — to answer questions about the wearer's environment and activities.

Video Understanding

DeVE-QA SIGIR 2025
dense video events long video QA grounding
A benchmark for question-answering on dense video events, requiring models to answer and ground questions about multiple events in long videos. Challenges MLLMs to faithfully comprehend and reason across extended time periods with multiple overlapping events.
78K questions  ·  26K events  ·  10.6K long videos
ViTXT-GQA TMM 2025
scene-text grounding text-based video QA spatio-temporal
A benchmark for scene-text grounding in text-based video question answering. Models must read scene text in videos and localise the spatio-temporal evidence for their answers. Current MLLMs achieve only 28% grounding accuracy vs. 77% for humans.
52K scene-text bounding boxes  ·  2.2K temporal segments  ·  2K questions  ·  729 videos
?
NExT-GQA CVPR 2024 highlight
grounded video QA temporal localisation interpretability
Extends NExT-QA by requiring models to localise the temporal evidence for their answers — not just answer correctly, but show where in the video. Exposes hallucination and shortcut learning in video-language models.
8,911 QA pairs  ·  1,557 videos  ·  10,531 annotated temporal segments
Assembly101-Mistakes CVIU 2025 · arXiv 2023
mistake detection procedural activity ordering errors
A new annotation layer on Assembly101 providing coarse-level action labels based on part positioning, enabling the study of ordering mistake detection in assembly procedures.
328 annotated sequences  ·  ordering / wrong position / unnecessary action labels
Assembly101 CVPR 2022
procedural activity multi-view video hand pose mistake detection
A large-scale multi-view dataset of people assembling and disassembling 101 take-apart toy vehicles without fixed instructions. The first dataset with simultaneous static (8) and egocentric (4) recordings. Awarded the EgoVis 2022/2023 Distinguished Paper Award. Now available on Hugging Face.
4,321 videos  ·  513 hours  ·  1M+ fine-grained action segments  ·  18M 3D hand poses  ·  12 camera views
?
NExT-QA CVPR 2021
video QA causal reasoning temporal reasoning
A video question-answering benchmark pushing models to reason about why and how events happen, rather than just describing them. Widely adopted as a standard benchmark for video-language models.
5,440 videos  ·  52,044 QA pairs  ·  causal / temporal / descriptive question types
TASTY Videos ICCV 2019
instructional video zero-shot anticipation procedural
A collection of cooking recipe videos for zero-shot activity anticipation. Each video is paired with an ingredient list and step-wise instructions, enabling anticipation from structured procedural knowledge without seen training examples.
2,511 recipe videos  ·  ingredient lists  ·  step-wise annotations

3D Vision & Reconstruction

CoarseLiDAR-GS Dataset ICCV 2025 highlight
3D reconstruction LiDAR + RGB Gaussian splatting SLAM
Real-world indoor and outdoor scenes captured with a custom multi-modal SLAM rig (64-channel LiDAR, 4 wide-angle RGB cameras, IMU) for benchmarking 3D Gaussian Splatting from coarsely-posed images and noisy LiDAR point clouds, without SfM initialisation.
Custom device: Ouster OS1-64 LiDAR  ·  4× Decxin AR0234 cameras  ·  360° coverage
SYN REAL
HandSynthesis CVPR 2025
synthetic hand data 3D hand pose domain gap
A synthetic hand image dataset and synthesis pipeline studying the synthetic-to-real domain gap in 3D hand pose estimation. Identifies key gap components (forearm, image frequency, hand pose, object occlusions) and demonstrates synthetic data can match real data performance when these are addressed.