Comments on Evaluating the World Model Implicit in a Generative Model
Vafa et al released a paper earlier this year covering a particularly lucid operationalization of the question, “Do LLMs embed a world model?” On the one hand, we have an LLM that literally generates interactive 3D worlds, dropped by DeepMind earlier this week. But on the other hand, LLMs are notorious for wildly unrealistic behavior on out-of-distribution inputs.
The central insight is that something called the Myhill-Nerode theorem lets us reconstruct a world model simply from observing token sequences. Slightly more precisely, it provides sharp conditions such that, simply by observing output behavior, we know that some given computational system such as an LLM must be equivalent to a canonical description. In other words, in this special case, no matter how inscrutable the matrices may be, as long as the conditions are met, then the matrices must be isomorphic to a standard, human-style solution.
The details are absolutely fascinating, and I highly encourage you to read the original paper. Here I would like to poke at the edges of the argument and it’s particular implementation by Vafa et al.
Not only is the theorem application inspired, but since it applies to deterministic finite automata, we nicely restrict the domain of “world models” to a class of (computational) outputs that are particularly tractable, cf DFAs and the Chomsky hierarchy.
And the golden goose egg is in section 2.6, where the authors define their evaluation metrics, boundary precision and boundary recall. It takes a bit of work to unwind the math, but they boil down to the fact that an LLM can be wrong about world states in precisely two ways: 1) it can be missing information about a state; or 2) it can encode counterfactual information about a state.
After unwinding definitions, it turns out that boundary recall provides a measure of the former and boundary precision the latter.
The magic of Myhill-Nerode, however, is that it lets us completely ignore the content of a world state, which is mostly unobservable in LLMs right now, and just focus on how one state can or cannot transition to another, i.e. their connectivity. Specifically, the connectivity is encoded in a DFA’s transition graph or an LLMs set of all possible token sequences.
The authors then use these nicely setup tools to evaluate models of their own training. This is where the paper unfortunately loses some of its water-tightness. One obstacle is that we have no way of obtaining such a set of “all possible token sequences” for an LLM and must resort to stochastic sampling.
The paper finds that the reconstructed world models of their LLMs behave poorly under their recall and precision metrics. However, initial definitions didn’t explore how expectation values of recall and precision behave under stochastic sampling. I don’t have a clear idea what the expectation values should be. Coarsely, I believe that we should a priori expect to underestimate recall and overestimate precision, but the extent of this effect, especially relative to their particular problems and sampling methods, remains to be investigated.
A more parochial issue is that only one of the examined problems was measured relative to a breadth of multiple LLMs. In particular, the initial taxi route problem only investigated a single, small self-trained model.
Such a good set of scalpels deserves much broader application, IMHO!