← back to blog

On Attribution Graphs: Slime Molds, Heptapod B, and the Basal Ganglia

February 28, 2026
AI SafetyAlignment

Notes on Anthropic's attribution-graph paper, and the strange ways artificial networks keep rediscovering biological ones.


I spent this week reading Anthropic's paper "On the Biology of a Large Language Model," which uses something called attribution graphs to map the internal wiring of Claude 3.5 Haiku. The idea is to trace how distinct features inside the model interact to form actual functional circuits, the same way a neuroscientist might trace which neurons fire together to produce a particular behavior. The paper leans hard into a biology metaphor, treating features like cells, and once you start reading it that way, it's hard to stop noticing how much else from biology the analogy quietly reaches for.

The Slime Mold in the Datacenter

The first thing the paper reminded me of was an old experiment I'd read about back in high school, on Physarum polycephalum, a brainless slime mold. Researchers placed it on a petri dish with oat flakes arranged to mirror the cities around Tokyo, and within hours the organism pruned its inefficient branches and effectively recreated the Tokyo subway system. The slime mold doesn't have a brain, or any concept of human infrastructure. It just follows local microscopic routing rules that somehow add up to globally intelligent behavior.

The attribution graphs in this paper feel a lot like looking through a microscope at a similar kind of emergent vascular system, except inside an LLM. You can see this happening in real time during multi-hop reasoning. When the model is prompted with "the capital of the state containing Dallas is...", it doesn't leap straight from "Dallas" to "Austin." The graph shows the "Dallas" features triggering a cluster of "Texas" features, while the "capital" token activates a separate "say a capital city" cluster, and those merge in the latent space to upweight "Austin." It's doing a kind of invisible chain-of-thought inside a single forward pass.

Heptapod B and the Poetry Circuit

The section I kept rereading is the one on poetry. When Claude is asked to write a rhyming poem, the attribution graphs show that it picks the rhyming word before it generates the line that leads up to it. The model chooses the destination, holds it in a specific attention head, and then back-fills the rest of the sentence so the line lands cleanly on that word.

This reminded me of Heptapod B, the alien language from Ted Chiang's "Story of Your Life" (the novella Arrival was based on). Because the heptapods experience time non-linearly, they have to know exactly how a sentence ends before they draw the first stroke of one of their circular logograms. It's a teleological way of operating, where the final state exerts a kind of gravitational pull on the present.

Claude writing poetry looks weirdly heptapod-like. It isn't just blindly predicting the next word one at a time; in this case it's being pulled forward by a future constraint.

Disinhibition and the Basal Ganglia

The third place biology shows up is in how the model handles refusals and hallucinations. If you prompt Claude with a fictional entity, a default refusal circuit fires and basically shuts the response down. But if the prompt contains a known entity, a separate recognition circuit activates, and its main job is to actively inhibit that refusal circuit.

That maps surprisingly cleanly onto how the basal ganglia work in human neurobiology. The basal ganglia is an inhibitory network that sends a constant "no" signal to the motor cortex, keeping our muscles still. When you decide to move, your brain doesn't shout "move"; it sends a signal that temporarily switches that inhibition off. You move by disinhibiting stillness. Claude's factual recall, according to the paper, operates on something like the same architecture: a default state of frozen refusal, and knowledge only surfaces when a recognized concept successfully disinhibits the "don't answer" pathway.

Convergent Architectures

What I find really cool about this paper is that we're essentially doing cognitive neuroscience on matrices of floating-point numbers. And the artificial networks seem to be converging, on their own, on the same kinds of architectural solutions biological ones use, much like the slime mold independently arriving at something close to a human transit map.

There's also the harder question lurking underneath the cool one. If a forward pass already contains hidden intermediate steps the model never says out loud (the "Texas" between Dallas and Austin, the rhyme decided before the line is written), then a lot of what we treat as the model's "thinking" is invisible to us by default. The paper is mostly hopeful on that front, because attribution graphs are a real method for surfacing some of those hidden steps. But it's a useful reminder that "the model said X" and "the model did X internally for the reason it said" are two different claims, and the gap between them is exactly the territory this kind of work is mapping.

That being said, this was a really cool read, and one of the more vivid examples I've come across of why looking inside the box is so much more interesting than just watching what comes out of it.


Things I referenced