Angela Yao — Datasets & Benchmarks

Egocentric & Assistive

EgoBlind NeurIPS 2025

egocentric video QA assistive AI accessibility

The first egocentric VideoQA dataset collected from blind and visually impaired individuals. Questions reflect genuine in-situation needs for visual assistance. State-of-the-art MLLMs achieve ~60% vs. 87.4% human performance.

1,392 egocentric videos · 5,311 QA pairs · avg. 3 reference answers per question

paper github

EgoIntention ICCV 2025

egocentric grounding visual intention affordance

The first dataset for egocentric visual intention grounding. Models must localise objects in first-person views based on implicit intention queries — e.g. "somewhere to sit" rather than "chair". Challenges models on affordance understanding and contextual reasoning.

Built on PACO-Ego4D · multiple intention sentences per object · context and uncommon query types

paper github

EgoTextVQA CVPR 2025

egocentric video QA scene text text grounding

A benchmark for egocentric scene-text aware video question answering. Models must read and reason about text appearing naturally in first-person videos — signs, labels, displays — to answer questions about the wearer's environment and activities.

paper project github

Video Understanding

DeVE-QA SIGIR 2025

dense video events long video QA grounding

A benchmark for question-answering on dense video events, requiring models to answer and ground questions about multiple events in long videos. Challenges MLLMs to faithfully comprehend and reason across extended time periods with multiple overlapping events.

78K questions · 26K events · 10.6K long videos

paper github

ViTXT-GQA TMM 2025

scene-text grounding text-based video QA spatio-temporal

A benchmark for scene-text grounding in text-based video question answering. Models must read scene text in videos and localise the spatio-temporal evidence for their answers. Current MLLMs achieve only 28% grounding accuracy vs. 77% for humans.

52K scene-text bounding boxes · 2.2K temporal segments · 2K questions · 729 videos

paper github

NExT-GQA CVPR 2024 highlight

grounded video QA temporal localisation interpretability

Extends NExT-QA by requiring models to localise the temporal evidence for their answers — not just answer correctly, but show where in the video. Exposes hallucination and shortcut learning in video-language models.

8,911 QA pairs · 1,557 videos · 10,531 annotated temporal segments

paper github

Assembly101-Mistakes CVIU 2025 · arXiv 2023

mistake detection procedural activity ordering errors

A new annotation layer on Assembly101 providing coarse-level action labels based on part positioning, enabling the study of ordering mistake detection in assembly procedures.

328 annotated sequences · ordering / wrong position / unnecessary action labels

paper github

Assembly101 CVPR 2022

procedural activity multi-view video hand pose mistake detection

A large-scale multi-view dataset of people assembling and disassembling 101 take-apart toy vehicles without fixed instructions. The first dataset with simultaneous static (8) and egocentric (4) recordings. Awarded the EgoVis 2022/2023 Distinguished Paper Award. Now available on Hugging Face.

4,321 videos · 513 hours · 1M+ fine-grained action segments · 18M 3D hand poses · 12 camera views

project paper download huggingface github leaderboard

NExT-QA CVPR 2021

video QA causal reasoning temporal reasoning

A video question-answering benchmark pushing models to reason about why and how events happen, rather than just describing them. Widely adopted as a standard benchmark for video-language models.

5,440 videos · 52,044 QA pairs · causal / temporal / descriptive question types

project paper github leaderboard

TASTY Videos ICCV 2019

instructional video zero-shot anticipation procedural

A collection of cooking recipe videos for zero-shot activity anticipation. Each video is paired with an ingredient list and step-wise instructions, enabling anticipation from structured procedural knowledge without seen training examples.

2,511 recipe videos · ingredient lists · step-wise annotations

project & download paper

3D Vision & Reconstruction

CoarseLiDAR-GS Dataset ICCV 2025 highlight

3D reconstruction LiDAR + RGB Gaussian splatting SLAM

Real-world indoor and outdoor scenes captured with a custom multi-modal SLAM rig (64-channel LiDAR, 4 wide-angle RGB cameras, IMU) for benchmarking 3D Gaussian Splatting from coarsely-posed images and noisy LiDAR point clouds, without SfM initialisation.

Custom device: Ouster OS1-64 LiDAR · 4× Decxin AR0234 cameras · 360° coverage

paper github

HandSynthesis CVPR 2025

synthetic hand data 3D hand pose domain gap

A synthetic hand image dataset and synthesis pipeline studying the synthetic-to-real domain gap in 3D hand pose estimation. Identifies key gap components (forearm, image frequency, hand pose, object occlusions) and demonstrates synthetic data can match real data performance when these are addressed.

paper github