Neural Networks Don’t Learn Features. They Learn Directions.

Aryanjotwani — Tue, 26 May 2026 09:55:47 GMT

A deep look at representational geometry — and why it quietly breaks everything you thought you knew about how neural networks store knowledge.

There is a story we tell about neural networks. It goes like this: a network trained on images learns edges in the first layer, then textures, then parts, then whole objects. Each layer builds on the last. Each neuron is a detector for some clean, nameable concept. It is a good story. It is tidy. It is largely wrong.

The reality of what a trained neural network actually does with information is stranger, more geometric, and in some ways more unsettling than the clean hierarchy we like to imagine. Understanding it properly requires abandoning the neuron as the fundamental unit of analysis — and replacing it with something far more abstract: a direction in high-dimensional space.

This is a piece about representational geometry. It is about what neural networks actually encode, where they encode it, and why the answer to both questions is not what most people think.

The activation space nobody talks about

Consider a layer in a neural network with n neurons. At any given input, those neurons produce a vector of n real numbers — an activation pattern. Every input maps to some point in this n-dimensional space. The network learns by shaping how inputs land in this space.

Now here is the thing almost never said explicitly: the individual axes of this space — the neurons themselves — are not privileged. They are a basis, yes, but not a meaningful one. The neuron activations are just coordinates. The information is not stored in neurons. It is stored in the geometry of how inputs distribute across the space those neurons span.

Key insight: A single neuron’s activation value tells you almost nothing on its own. What matters is the pattern — the vector — across many neurons simultaneously. The neuron is the coordinate axis. The concept lives in the direction.

This is not a minor semantic distinction. It has deep consequences. When you hear someone say “this neuron responds to dog faces,” they are making a measurement convenience claim, not a mechanistic one. They found a neuron with a high activation on dog-face inputs. But whether that neuron represents dog faces — in any deep sense — is a very different question.

What “direction” actually means

Let’s make this concrete. Suppose a network has learned to represent the concept of royalty somewhere in its activations. The classic finding from word embeddings, replicated many times in different forms, is that:

king − man + woman ≈ queen

The operation works not because there is a single “royalty neuron” somewhere, but because royalty corresponds to a specific direction in the embedding space — a vector you can literally add and subtract. Gender is another direction. You can navigate the space by moving along these directions, and the resulting points correspond to real concepts.

This is the core of representational geometry: concepts are directions, not locations. A concept is not stored at a particular neuron or coordinate. It is encoded in a direction that cuts across many neurons, and the strength of that concept in a given representation is the dot product of the activation vector with that direction.

Concepts as directions in a simplified 2D activation space. “Royalty” and “gender” are orthogonal directions — the parallelogram structure means king − man + woman ≈ queen holds exactly.

This is not just a curiosity of word embeddings. It shows up in vision models, in language models, in reinforcement learning agents. Wherever we have looked carefully at the geometry of activation spaces, we find this structure: concepts encoded as linear subspaces, manipulable by vector arithmetic, distributed across many neurons simultaneously.

The superposition hypothesis: one neuron, many concepts

If concepts are directions, and a network has n neurons, then naively you might think a network can only store n independent concepts. In an n-dimensional space, you can have at most n mutually orthogonal directions.

But this is where it gets genuinely strange.

In high-dimensional spaces, you can pack exponentially more than n nearly-orthogonal directions. Specifically, you can find up to O(e^n) unit vectors in an n-dimensional space such that every pair has dot product less than some small ε. They are not perfectly orthogonal, but they are close enough that linear decoders can distinguish them with high accuracy.

This means a network does not need a dedicated neuron per concept. It can store vastly more concepts than it has neurons — by placing each concept in a slightly non-orthogonal direction and tolerating a small amount of interference between them. This is called superposition.

The superposition hypothesisNeural networks represent more features than they have neurons by assigning each feature a direction in activation space, and packing many such directions into the same space. Individual neurons are not feature detectors — they are components of many features simultaneously.

The trade-off is interference. When two concepts occupy non-orthogonal directions, activating one slightly activates the other. The network learns to tolerate this interference — treating it as noise — when the benefit of encoding more concepts outweighs the cost of cross-contamination.

Anthropic’s interpretability research (the Toy Models of Superposition paper, 2022) showed this concretely in small synthetic networks: when you give a network more features than neurons to represent, it does not simply discard features. It folds them into superposition, encoding them in non-orthogonal directions with small but nonzero interference. The network learns the geometry that minimizes total reconstruction error.

Why individual neurons are almost always polysemantic

Superposition immediately explains something that has confused interpretability researchers for years: polysemanticity. When you look at what activates individual neurons in large networks, you almost always find that a single neuron responds strongly to multiple, seemingly unrelated concepts.

A neuron in a vision model might respond to dog ears, car tyres, and curved metal. A neuron in a language model might fire on references to legal proceedings, mathematical proofs, and formal correspondence. The response pattern feels incoherent — like the neuron is broken, or the concept is just noise.

But in superposition, this is exactly what you should expect. If a neuron is not a feature detector but a basis axis, it will participate in many different feature directions simultaneously. When any of those features activates, the neuron fires. The neuron is not confused. It is doing exactly what a basis vector in a compressed representation should do — contributing to many independent directions at once.

Monosemanticity assumes a 1:1 mapping between neurons and features. Superposition shows that a network with just 2 neurons can represent 4 (or many more) concepts as distinct directions — with small mutual interference.

The geometry of generalization

Here is where representational geometry starts explaining something much bigger: why neural networks generalize at all.

Consider what it means for a network to generalize to unseen inputs. It means that some structured relationship in the training data has been captured in a form that correctly extrapolates. If concepts are encoded as directions, then generalization corresponds to a beautifully simple idea: new inputs that combine familiar concepts in novel ways land at the correct location in activation space because vector addition is linear.

A network that has learned direction vectors for red, round, and edible can correctly process a novel red round edible thing even if it was not in the training set — provided those directions are correctly encoded and the right combination is activated. The network does not need to have seen every combination. It needs to have learned the geometry.

This is why compositional generalization — the ability to handle novel combinations of known concepts — is so natural when representations are geometric, and so hard when they are not. A network that stores concepts as discrete symbols or as entangled, non-compositional patterns cannot easily compute with new combinations. A network with clean directional geometry can.

Generalization is not memorization with interpolation. It is geometry with extrapolation.

Linear probes and what they reveal

One of the most striking empirical confirmations of this picture comes from linear probing. If you take the activation vectors from an intermediate layer of a large network and train a simple linear classifier on top — no hidden layers, just a dot product and a sigmoid — you can often recover surprisingly rich information.

A linear classifier on intermediate BERT representations can detect syntactic structure, coreference, named entity type, semantic role, and more. A linear probe on a vision model’s mid-layer activations can separate scene categories, object presence, spatial layout. GPT-style models encode future tokens’ properties linearly in present-token activations.

This matters enormously. A linear probe can only succeed if the relevant information is encoded as a direction — i.e., there exists a linear subspace of the activation space where the signal lives. The success of linear probes across diverse tasks and architectures is direct empirical evidence that representational geometry is not a theoretical abstraction. It is the actual structure of what networks learn.

Why this matters for interpretability. If representations are linear, then understanding a network reduces to finding the right directions. This is the core bet behind mechanistic interpretability: decompose the activation space into meaningful directional components and you decompose the network’s knowledge into understandable pieces.

The residual stream and privileged bases

Transformer architectures give a particularly clean instantiation of these ideas. The residual stream in a transformer — the sequence of d_model-dimensional vectors that get updated by each attention head and MLP layer — is a communication channel between layers. Each layer reads from it, adds its contribution, and the next layer reads again.

What gets written into the residual stream? Directions. Each attention head computes a low-rank update — a small set of directions projected into the residual stream space. Each MLP neuron’s output is a direction in residual stream space scaled by that neuron’s activation. The entire computation of a transformer is, in a real sense, a sequence of directional updates to a shared high-dimensional vector.

A crucial and underappreciated point: the residual stream has no privileged basis. There is no reason the network should respect the coordinate axes corresponding to individual embedding dimensions. It is free to use any directions it finds useful. And empirically, it does: the directions that carry semantic information in large language models are almost never aligned with the standard basis axes. They are oblique, spanning many nominal dimensions at once.

This is why analyzing language models by looking at individual embedding dimensions is almost always unrevealing. You are looking at projections onto arbitrary axes. The information lives in directions the training process chose, which are invisible to dimension-wise analysis.

When geometry breaks: representation collapse and feature suppression

If clean geometry enables generalization, then failures of geometry explain failures of generalization.

Representation collapse is what happens when a self-supervised learning objective — like contrastive learning without carefully designed constraints — causes all inputs to map to nearly the same region of activation space. The geometry degenerates. Every point crowds toward a small manifold. Linear probes fail because the directional structure that would enable decoding is gone.

Feature suppression is subtler. In overparameterized networks trained on data with correlated features, gradient descent can choose to encode only the most predictive features and actively suppress the rest — not just ignore them, but represent inputs in subspaces orthogonal to the suppressed features’ directions. This is now understood to be one mechanism behind shortcut learning and distributional sensitivity: the network geometrically erases the directions it decided were irrelevant.

Both failures are invisible to loss curves. The model trains, the loss goes down, the validation accuracy looks fine — until you test on inputs that require the suppressed features, or that live in the collapsed region of the space. The geometry was wrong, and the loss never told you.

The unanswered question at the center of all of this

Understanding that representations are geometric raises an immediate and deeply unsettling question: which directions does the network actually use, and why?

We know the network has immense freedom in choosing its directional encoding. For any given task, there are infinitely many geometric configurations of the activation space that would achieve zero training loss. The network picks one. But the one it picks is determined by a combination of architecture, initialization, optimizer dynamics, and data distribution — not by any explicit constraint toward interpretable or clean geometry.

Sometimes the geometry is miraculously clean: we find linear subspaces that correspond to human-legible concepts like gender, number, tense, sentiment. But sometimes it is not. We find entangled directions, superimposed features with high interference, representations that are linear for some probes and nonlinear for others.

We do not yet have a theory of when networks learn clean directional geometry and when they do not. We do not know whether the geometry of a large language model’s internal representations is “mostly clean with noisy edges” or “mostly opaque with islands of linearity.” We cannot yet look at a network’s activation geometry and predict whether it will generalize well, be robust to distribution shift, or be vulnerable to adversarial perturbation.

We have learned to build extraordinarily capable geometric reasoning machines. We have not yet learned to read the geometry we built.

Why this changes how you should think about networks

The practical implications of this picture are significant.

When you ask why a network fails on out-of-distribution inputs, the geometric answer is: because the test inputs landed in a region of activation space where the learned directional structure no longer holds. The concepts encoded as directions in training-distribution space may point in different relative directions for OOD inputs. The arithmetic breaks.

When you ask why larger networks generalize better, one geometric answer is: they have higher-dimensional activation spaces, which allow more concepts to be encoded in near-orthogonal directions, reducing superposition interference and enabling cleaner arithmetic. Width is not just capacity. It is the dimensionality of the geometric workspace.

When you ask why fine-tuning on a small dataset can catastrophically overwrite a large model’s knowledge, the geometric answer is: fine-tuning shifts the principal directions of the activation space. If the fine-tuning signal is strong enough relative to the pretrained geometry, it can rotate the directional structure, replacing encoded concepts with new ones without the original directions being recoverable.

And when you ask what it would mean to truly understand a neural network — not just predict its outputs, but understand its internal reasoning — the answer is geometric: find the directional decomposition of its activation space that maps onto the causal structure of the task it is solving. Every other form of understanding is incomplete.

We built networks that think in directions we cannot see, in spaces we cannot visualize, using arithmetic we can only partially decode. The geometry is always there. The question is whether we are looking at the right projections to find it.

The neuron was never the right unit. It was just the only handle we had.

Elhage et al., “Toy Models of Superposition” (Anthropic, 2022) — the clearest empirical treatment of superposition in small networks. Mikolov et al., “Linguistic Regularities in Word Representations” (2013) — the original word vector arithmetic paper. Tenney et al., “BERT Rediscovers the Classical NLP Pipeline” (2019) — systematic linear probing of BERT representations. Goh et al., “Multimodal Neurons in Artificial Neural Networks” (Distill, 2021) — on polysemanticity in vision models.

Stories by Aryanjotwani on Medium