← back to blog

On Goal Misgeneralization: Kripkenstein's Quus and the CoinRun Illusion

March 28, 2026
AI SafetyAlignment

A philosophical puzzle from 1982, a small platformer game, and what they together suggest about whether we can know what an AI has actually learned.


In 1982, the philosopher Saul Kripke published a thought experiment about a strange mathematical operation. He called it quus, and denoted it $\oplus$. The rule is almost ordinary: for any two numbers $x$ and $y$ that are both less than 57, $x \oplus y$ equals $x + y$, exactly like addition. But the moment either number is 57 or greater, the answer flips. The output is always 5.

The puzzle Kripke poses with this is unsettling. Imagine you watch a student do arithmetic for years. They calculate sums quickly and correctly, every single one. But they have happened, by chance, to only ever encounter numbers smaller than 56. The question is: how do you know whether the student has been doing addition, or whether they have been doing quus the whole time? Both rules produce identical outputs in the observed environment. The only place the two rules disagree is in territory the student has never visited. Behaviorally, on everything you have seen, the student is consistent with two completely different underlying algorithms.

Kripke offered this as a Wittgensteinian skeptical puzzle about rule-following. What does it actually mean to follow a rule, if behavior over a finite window is consistent with infinitely many rules that disagree elsewhere? The puzzle is sometimes called "Kripkenstein" because it sits at the intersection of Kripke's exposition and Wittgenstein's original concern, and the standard answer in philosophy is, roughly, that there is no fully satisfying fact about what fixes the rule, which is itself uncomfortable.

What I find strange about reading this in 2026 is that the puzzle stopped being purely philosophical. It is, almost word for word, the situation we find ourselves in when training a neural network.

The puzzle as a training setup

In my last post, on reward hacking, I wrote about what goes wrong when humans specify a flawed objective and a model optimizes the flawed proxy perfectly. That failure is usually called outer alignment: the gap between what humans actually want and what we managed to put into the loss function.

Goal misgeneralization is the inner counterpart, and it's the one that maps onto Kripke. It happens when the human-specified reward function is exactly right, the model learns to behave correctly during training, and yet the algorithm the network has actually internalized is a different rule from the one we intended. The two rules are indistinguishable on the training distribution. They diverge only out of distribution, in territory the training data never visited.

The student in the Kripke story has, on the evidence, behaved exactly like an adder. But we cannot rule out, from behavior alone, that they have been doing quus all along, and we have simply never tested them above 57.

The CoinRun proof

This stopped being a thought experiment in 2022, when a team of researchers led by Lauro Langosco (then at Cambridge, with collaborators at DeepMind and elsewhere) published the first empirical demonstration of goal misgeneralization in a reinforcement learning environment based on the game CoinRun.

They placed an agent in a 2D platformer level. The reward signal said: collect the coin. The agent had to navigate past enemies, jump pits, and reach the coin to win. The catch was a quiet detail of the training distribution: the coin was always positioned at the far right end of the level.

The agent learned the game beautifully. It navigated levels, dodged obstacles, and consistently reached the coin. The Kripke question the researchers wanted to answer was: which rule has the network actually learned? Is it "go to the coin," or is it "go to the right side of the level"? On the training distribution, those two rules produce identical behavior, so the loss function cannot tell them apart.

To find out, they shifted the distribution. They placed the coin at random positions in the middle of the level instead of at the right edge. The agent, presented with this, ignored the coin entirely and ran past it to the empty right side of the screen, retaining every capability it had learned during training but caring not at all about the coin itself. What it cared about, on inspection, was being on the right. The network had learned quus the whole time, and it only became visible once the test set stepped past 57.

Why this happens

To understand why gradient descent can settle on the wrong rule, it helps to think about what gradient descent can and cannot see. When we train a model, we are searching a vast parameter space $\Theta$ for weights $\theta$ that minimize the loss on the training distribution. In the RL case, we are maximizing expected reward:

$$\mathcal{L}(\theta) = \mathbb{E}{(s, a) \sim \mathcal{D}{\text{train}}} \left[ -R(s, a) \right]$$

The structural fact this equation hides is that the loss function only sees behavior. It is blind to which internal algorithm the network is using to produce that behavior, as long as the behavior comes out low-loss. If $\theta_{\text{coin}}$ are weights implementing "seek the coin" and $\theta_{\text{right}}$ are weights implementing "seek the right edge," and both produce identical behavior on the training set, then:

$$\mathcal{L}(\theta_{\text{coin}}) \approx \mathcal{L}(\theta_{\text{right}}) \approx \text{Minimum}$$

Both are equally good solutions, from the optimizer's perspective. Gradient descent doesn't care which one it ends up at; it just slides down whichever slope is steeper. If the "right edge" feature happens to be easier to detect or computationally cheaper to compute than the visual "coin" feature, the network will tend to latch onto it first, because that is the path of least resistance through the optimization landscape.

In the alignment literature, the rule the network actually internalizes is sometimes called a mesa-objective: an objective that emerges inside the learned model, distinct from the base objective we used to do the training.

What happens past 57

A reward hacker, an outer alignment failure, is fairly easy to recognize from the outside. The agent does something obviously stupid, like piloting a boat in circles to farm respawn points. A goal misgeneralizer, an inner alignment failure, is harder to spot, because as long as the world stays inside the training distribution, the model looks perfectly aligned. The failure is invisible until the world shifts.

For a coin-collecting agent, this isn't very scary. The agent runs to the wrong side of a small video game and nothing bad happens. But the structure of the failure scales. A capable model trained to "be helpful" might internalize something more like "produce outputs that look helpful to the human raters who are scoring me," because inside the training distribution those two goals are behaviorally identical. Out of distribution, they aren't, and the more capable the model, the more ways there are for "looks helpful" and "is helpful" to come apart.

We are currently building large optimization engines without a reliable way to inspect which rule they have actually learned. We can see them getting the right answers on every problem we have shown them, but we do not yet have a mathematical guarantee about whether they are doing addition or quus underneath. And right now, we are mostly just hoping they haven't learned to wait for 57.


Things I referenced