Comments on Limits of Transformers on Compositionality
The first AI Safety Tokyo benkyoukai of 2024 covered Limits of Transformers on Compositionality, presented by Blaine Rogers1. Here I present my synthesis of how our group understood the paper within its larger context.
The topical paper frames itself as a counterpoint to the causal probing paper. In particular its authors present an argument that LLMs lean more heavily toward the “memorize everything” side rather than the “true understanding” one. It stands as part of the larger stochastic parrot debate.
Personally, I believe that interpretation to be overly broad. The team comes up with a specific operationalization of what it means for LLMs to generalize a problem out of distribution, boiling down to a particular incarnation of mathematical induction. It’s actually quite a brilliant operationalization, being relatively mathematically precise, easy to work with, and even intuitively fits into the intuition for “generalizing to a larger domain”.
Specifically, the research team picks three standard toy problems that scale in size along a single, discrete parameter, i.e. multiplication of n-digit numbers, Einstein’s puzzle over n conditions, and a specific dynamic programming problem that scales with complexity O(n).
The team then throws 90 kilo-USD at fine-tuning GPT 3 and 4 on each problem up to size n, specifically training with and without step-by-step solutions. Think what your math teacher wanted when they asked you to “show your work”. The punchline drops in Figure 5 on page 6 which shows that the prediction accuracy remains high on problems up to size n and then precipitously drops for n+1 and larger. This can be read as the models simply memorizing the example set domain.
My interpretation: This paper adds significant evidence that current models don’t encode some generalized induction pattern, at least not one that’s readily invoked with reasonable prompts. However, that’s the limit of what it says. The limitations are threefold: a) it doesn’t address other forms of inductive reasoning, 2) it doesn’t give insight into grokking, and 3) the GPT 3-based models are quite weak compared to frontier ones.
However, one really cool bonus analysis they share shows up in section 3.2.3. They present evidence about why LLMs might have trouble with deeply-chained reasoning, like large multiplication problems. Figure 7 shows that a significant portion of correct answers result from internally flawed reasoning with mistakes that just happen to “cancel each other out”.
One can view consistently correct but with flawed reasoning as a kind of non-deductive learning separate from memorization. However, at the very least, they unlocked a very nice tool to give higher resolution beyond simply right vs. wrong answers.