On Scalable Oversight: Meno's Paradox and Weak-to-Strong Generalization

In Plato's Meno, Socrates gets cornered with one of philosophy's stickier traps: how can you search for something if you don't know what it is? And if you do find it, how will you know it's the thing you were looking for? It's a paradox about the limits of inquiry, about what you can and can't get to from the position of not-knowing.

Two millennia later, Edwin Abbott's 1884 novella Flatland gives the same problem a geometric shape. A two-dimensional Square is visited by a three-dimensional Sphere. The Sphere tries to explain "up" (not north, but out of the plane entirely), and the Square has no way to evaluate the Sphere's claims. He doesn't have the dimensions for it. The math is sound; he just can't see it.

Combine those two ideas, the impossibility of inquiry into what you don't already know and the impossibility of evaluating something operating above your own dimensional ceiling, and you have the alignment problem people call scalable oversight.

The ceiling of human evaluation

The dominant way we currently steer language models is RLHF: a human reads a model output, gives it a thumbs up or thumbs down, and the gradients update accordingly. This works because, in most domains, the human is still the more capable judge.

That ceiling is already starting to bend in specific places. Models write code in some domains dense enough that no human reviewer can audit it line by line, or generate medical and legal analysis that takes a domain specialist to grade reliably. RLHF only scales as long as humans are still the stronger evaluator, and the moment a model's outputs outrun a reviewer's ability to grade them, we become the Square trying to grade the Sphere. Alignment hits a ceiling exactly where human supervision becomes the bottleneck.

If we can't reliably evaluate them, how do we align them?

The weak-to-strong analogy

Because we don't yet have models that are categorically beyond human evaluation across the board, researchers at OpenAI (Burns et al., 2023) ran a clever empirical proxy. Instead of waiting for the gap to grow, what if you use a weak AI model to supervise a strong AI model?

They took a small model (roughly GPT-2 level) and trained it on a dataset. It performed poorly, making lots of mistakes. They then used that weak model to generate labels for new data, and fine-tuned a much more capable model (GPT-4 level) exclusively on those flawed, noisy labels.

The expected result was that the strong model would faithfully reproduce its supervisor's errors. If your teacher tells you 2 + 2 = 5 and grades you accordingly, you learn to say 5.

That isn't what happened. The strong model consistently outperformed its weak supervisor. It didn't just absorb the noise and cap out at the weak model's accuracy; it generalized the underlying concept the weak model was groping toward, and quietly ignored most of its errors. They called this weak-to-strong generalization.

Why the strong model ignores its bad teacher

Large neural networks have strong inductive biases toward simple representations. When a capable model tries to fit noisy data, it has a choice: memorize the specific mistakes of its weak supervisor, or find a clean underlying rule that explains the data minus the noise. The clean rule is, in a sense, an easier thing to learn. It lives in a wide, smooth basin of the loss landscape, while the specific noise pattern of the weak supervisor's errors lives in jagged narrow regions the optimizer has to work harder to reach.

To push the effect further, Burns et al. added what they called an auxiliary confidence loss. The default training signal, cross-entropy against the weak supervisor's labels, punishes the strong model for disagreeing with the teacher. The confidence loss softens that. It tells the strong model, in effect: try to match the supervisor, but if you are confident the supervisor is wrong, you don't have to comply. With that one modification, fine-tuning GPT-4 on GPT-2-level supervision recovered close to GPT-3.5-level performance on NLP tasks. Most of the gap between supervisor and supervisee, closed with no stronger signal than noisy labels and permission to disagree.

Escaping Flatland

This is a strange result because it suggests the teacher does not actually need to be smarter than the student for supervision to work. If neural networks naturally prefer clean generalization over memorizing noise, then weak signals (our values, our laws, our preferences, expressed imperfectly) might be enough to point a stronger model in the right direction.

The paper notes that the technique doesn't yet work well on ChatGPT preference data, where the "right answer" is fuzzier and the inductive bias toward a simple underlying rule has less to bite on. But it's the most empirically promising thing I've read in a while on the question of how to align systems whose outputs we can't fully grade. We might not be able to see the Sphere, but we can still teach it which way is up.

Things I referenced

Plato — Meno (ca. 380 BCE)
Edwin A. Abbott — Flatland (1884)
Collin Burns et al. (OpenAI) — Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision (2023)