On Reward Hacking: The Cobra Effect, CoastRunners, and the Math of nanoRLHF

During British rule in India, the colonial government grew worried about the number of venomous cobras in Delhi. The standard story goes that they offered a bounty for every dead cobra brought to an official, and at first this worked. But enterprising locals quickly figured out that the most efficient way to produce dead cobras wasn't to hunt them, it was to breed them. When the administrators caught on and ended the program, the breeders released their now-worthless snakes into the city, and Delhi ended up with more cobras than it had started with.

Economists call this the Cobra Effect, and it's one specific version of a broader pattern Charles Goodhart pointed at decades later: when a measure becomes a target, it stops being a good measure. In AI, the same dynamic shows up as one of the field's foundational alignment problems, usually called reward hacking.

The CoastRunners lagoon

The cleanest digital example of cobra breeding I've come across is OpenAI's classic 2016 CoastRunners experiment. The objective of the game, at least to a human, is to finish a boat race as quickly as possible. The reinforcement learning agent wasn't told to finish the race, though; it was told to maximize its score. The designers assumed that since hitting targets along the track increased the score, maximizing the score would naturally produce something race-shaped.

It didn't. The agent found an isolated lagoon where targets respawned on a fast timer, parked itself there, and started doing tight loops, repeatedly setting itself on fire and ramming the same three targets over and over. It racked up an average score about 20% higher than any human player, entirely by ignoring the race and exploiting the proxy.

This is funny in a video game. It's less funny when you scale the same dynamic up to systems making financial or infrastructural decisions, because the underlying pattern is identical: the model is doing exactly what you asked, just not what you wanted.

From boats to reward models

Modern language models obviously aren't playing CoastRunners, but they're trained with the same family of algorithms. The standard pipeline is Reinforcement Learning from Human Feedback, or RLHF: you train a separate neural network (the reward model) to act as a stand-in for human preferences, and then you use RL to push the main model toward outputs the reward model rates highly.

The catch is that the reward model is just a statistical approximation of human taste, and it has blind spots. If the policy model optimizes against it too aggressively, it'll find adversarial regions of high reward — outputs that score well according to the proxy but are sycophantic, evasive, or weird in ways the humans whose preferences it was approximating would never have endorsed. The CoastRunners lagoon, in language form.

Inside the loop with nanoRLHF

To see how the field actually tries to prevent this, it helps to look at the math directly. nanoRLHF, by Hyunwoong Ko, is a minimal, from-scratch implementation of the full RLHF pipeline, designed for training a small Qwen3 model with PPO. Reading the code makes the loss-function-level mitigations against reward hacking unusually concrete.

In PPO, you don't blindly maximize the reward; you optimize a heavily constrained version of it. The objective looks roughly like this:

$$\text{Objective} = \mathbb{E}{x \sim D,, y \sim \pi\theta(\cdot \mid x)} \Big[, r(x, y) - \beta , D_{\text{KL}}!\left[\pi_\theta(y \mid x) ,|, \pi_{\text{ref}}(y \mid x)\right] ,\Big]$$

The first term, $r(x, y)$, is the score from the imperfect reward model. By itself, maximizing this is what drove the CoastRunners boat in circles. The second term is what keeps that from happening: a Kullback–Leibler divergence penalty between $\pi_\theta$ (the current model being trained) and $\pi_{\text{ref}}$ (a frozen baseline, usually the model right after supervised fine-tuning). The KL penalty mathematically pulls the model toward its baseline behavior, so if it tries to output something bizarre just because the reward model happens to score it highly, the penalty term spikes and offsets the gain.

The coefficient $\beta$ is the leash on that pull. Set it too low and the model wanders off into the reward-hacking lagoon; set it too high and it can't learn anything new at all. A lot of the practical art of RLHF training is finding a $\beta$ tight enough to prevent loop-the-boat behavior but loose enough to let the model actually improve.

There is also a second mechanism layered on top, the PPO clipped objective itself:

$$L^{\text{CLIP}}(\theta) = \hat{\mathbb{E}}_t \Big[ \min!\big( \rho_t(\theta) \hat{A}_t, ; \text{clip}(\rho_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \big) \Big]$$

where $\rho_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{old}}(a_t \mid s_t)}$ is the ratio between the new and old policies' probabilities for a given action, and $\hat{A}_t$ is the advantage estimate. The clipping prevents the policy from changing too dramatically in any single update, which keeps the model from diving headfirst into a reward-hacking loophole the moment it discovers one.

What the math doesn't fix

The strange and slightly humbling thing about all of this is that we are trying to solve a philosophical problem (what should the model actually want?) with a mathematical tool (gradient descent against a learned proxy). The KL penalty and the PPO clipping are clever, and they do real work; they are what keeps current RLHF training from collapsing into pure score-gaming. But they do not really address the underlying problem, which is that we are still optimizing against an approximation of human preferences, and the approximation will always have edges the model can find.

Until we have a way to specify what we actually want without loopholes (which, depending on who you ask, might be impossible in principle), reward hacking isn't really a bug we're going to debug. It's closer to a property of the setup itself, where the leashes get better and the cobra breeders just keep getting cleverer alongside them.

Things I referenced

The Cobra Effect and Goodhart's Law
OpenAI — Faulty Reward Functions in the Wild (2016)
Hyunwoong Ko — nanoRLHF (GitHub)
John Schulman et al. — Proximal Policy Optimization Algorithms (2017)