grounded video QA
temporal localisation
interpretability
Extends NExT-QA by requiring models to localise the temporal evidence for their answers — not just answer correctly, but show where in the video. Exposes hallucination and shortcut learning in video-language models.
8,911 QA pairs · 1,557 videos · 10,531 annotated temporal segments