When Transformers Can (or Can’t) Generalize Compositionally? A Data-Distribution Perspective
Aug 20, 2025·
,,,·
1 min read

Yao Tong
Jiayuan Ye
Anastasia Borovykh
Reza Shokri
Abstract
Someone who learns to walk shortest paths in New York can, upon receiving a map of Paris, immediately apply the same rule to navigate, despite never having practiced in Paris. This ability to apply known rules to novel input combinations exemplifies compositional generalization (CG), a hallmark of human cognition. While transformer models have shown both successes and failures on CG tests, our understanding of when can(’t) CG occur, particularly from the perspective of the training data distribution, remains limited. In this work, we investigate how train-data distributional properties (e.g., coverage, diversity, and sampling biases) jointly shape CG via quantitative analysis on controlled map navigation tasks.
Publication
Accepted to NeuriPS 2025 Workshop on What Can(’t) Transformers Do?
This work is driven by the results in my previous paper accepted to NeuriPS Workshop 2025 on What Can(’t) Transformers Do?