When Transformers Can (or Can’t) Generalize Compositionally? A Data-Distribution Perspective

Aug 20, 2025·
Yao Tong
Yao Tong
,
Jiayuan Ye
,
Anastasia Borovykh
,
Reza Shokri
· 1 min read
Abstract
Someone who learns to walk shortest paths in New York can, upon receiving a map of Paris, immediately apply the same rule to navigate, despite never having practiced in Paris. This ability to apply known rules to novel input combinations exemplifies compositional generalization (CG), a hallmark of human cognition. While transformer models have shown both successes and failures on CG tests, our understanding of when can(’t) CG occur, particularly from the perspective of the training data distribution, remains limited. In this work, we investigate how train-data distributional properties (e.g., coverage, diversity, and sampling biases) jointly shape CG via quantitative analysis on controlled map navigation tasks.
Publication
Accepted to NeuriPS 2025 Workshop on What Can(’t) Transformers Do?

This work is driven by the results in my previous paper accepted to NeuriPS Workshop 2025 on What Can(’t) Transformers Do?