When Transformers Can (or Can’t) Generalize Compositionally? A Data-Distribution Perspective

Aug 20, 2025·

Yao Tong

Jiayuan Ye

Anastasia Borovykh

Reza Shokri

· 1 min read

Abstract

Someone who learns to walk shortest paths in New York can, upon receiving a map of Paris, immediately apply the same rule to navigate, despite never having practiced in Paris. This ability to apply known rules to novel input combinations exemplifies compositional generalization (CG), a hallmark of human cognition. While transformer models have shown both successes and failures on CG tests, our understanding of when can(’t) CG occur, particularly from the perspective of the training data distribution, remains limited. In this work, we investigate how train-data distributional properties (e.g., coverage, diversity, and sampling biases) jointly shape CG via quantitative analysis on controlled map navigation tasks.

Publication

Accepted to NeuriPS 2025 Workshop on What Can(’t) Transformers Do?

This work is driven by the results in my previous paper accepted to NeuriPS Workshop 2025 on What Can(’t) Transformers Do?

Last updated on Aug 20, 2025