Understanding the “Scaling of Monosemanticity” in AI Models: A Comprehensive Analysis

Published in

The Deep Hub

11 min readJul 10, 2024

A comprehensive analysis of “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonn~et” by Templeton, A. et al.

Extracting Features from Production Models: The image showcases the sycophantic praise feature, highlighting dataset examples and a prompt completion influenced by this feature. This demonstrates how sparse autoencoders can identify and steer specific behaviors in language models.

In the rapidly evolving landscape of AI, understanding the intricacies of large language models (LLMs) is crucial. A particular aspect of AI is called monosemanticity, where parts of an AI system (neurons) respond to specific, single ideas or features. Monosemanticity helps in making AI systems easier to understand and interpret. Within the last year, Anthropic’s interpretability team successfully extracted high-quality features from Claude 3 Sonnet, Anthropic’s medium-sized production model. This approach aims to understand the model’s complex representations, focusing on its internal mechanics and safety considerations. They discovered that sparse autoencoders could recover monosemantic features from a small, one-layer transformer.

However, there’s a gap in understanding how increasing the size of AI models impacts their specialization and interpretability.

The main question the researchers wanted to answer is whether the methods used for smaller models will also be effective for larger models like Claude 3 Sonnet.

A recent study delves into this very challenge by employing sparse autoencoders (SAEs) to extract interpretable features from the Claude 3 Sonnet model. The paper, “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet,” discusses an approach to understanding Claude 3 Sonnet through the lens of dictionary learning and sparse autoencoders (SAEs). The successful application to Claude 3 Sonnet demonstrates that it is feasible to use sparse autoencoders for more complex models.

By training a two-layer SAE on the model’s activations, researchers are able to uncover hidden patterns and structures within Claude 3 Sonnet, shedding light on how it processes information. Researchers utilize scaling laws to guide the training process, ensuring that the SAEs can handle the magnitude of data processed by such a large model.

Key Results

Interpretable Features in Large Models: Sparse autoencoders can produce interpretable features even in large models like Claude 3 Sonnet.
Safety Concerns: A broad range of safety-relevant features were observed, including those related to deception, sycophancy, bias, and dangerous content.
Scaling Laws for Training: The study shows that scaling laws can guide the effective training of sparse autoencoders.
Highly Abstract Features: The features extracted are highly abstract, generalizing across languages, modes (text and images), and between concrete and abstract references.
Systematic Relationship: There is a systematic relationship between the frequency of concepts and the dictionary size needed to resolve features for them.
Steering Large Models: The identified features can be used to influence the behavior of large models.

So, let’s get into some background information.

The primary objective of this work is to decompose the activations of Claude 3 Sonnet, a model, into more interpretable components. This is achieved by training a Sparse Autoencoder (SAE) on the model activations.

Sparse Autoencoder: A type of autoencoder neural network where the hidden layer is constrained to be sparse, meaning that only a few neurons are active at a time.

Structure and Function of the SAE: Two-Layer Structure

Encoder Layer: Maps the model’s activations to a higher-dimensional space using linear transformation and ReLU nonlinearity.
Decoder Layer: Attempts to reconstruct the original model activations from the high-dimensional features.
ReLU (Rectified Linear Unit): Activation function used in artificial neural networks that introduces nonlinearity in models to help the network learn complex patterns in data

ReLU function

“SAEs are an instance of a family of ‘sparse dictionary learning’ algorithms that seek to decompose data into a weighted sum of sparsely active components” (Templeton et al.). Dictionary learning is recently suggested to be effective for transformer language models, particularly when implemented as sparse autoencoders.

Dictionary Learning: A standard method for learning a set of basis vectors such that any input can be represented as a sparse combination of these basis vectors.

The approach is based on two key hypotheses: the linear representation hypothesis and the superposition hypothesis. These hypotheses provide a theoretical foundation for the methods used in the study.

Linear Representation: suggests that neural networks represent meaningful features as directions in their activation
Superposition Hypothesis: extends the linear representation hypothesis by proposing that neural networks use almost-orthogonal directions in high-dimensional spaces to represent more features than there are dimensions

In Claude 3 Sonnet, as in other deep models, features such as specific concepts or patterns may not be confined to a single layer. Instead, these features could be represented in a distributed manner across several layers. This cross-layer superposition makes it difficult to interpret and isolate individual features, as their representation is smeared across layers.

Cross-layer superposition: Where features are not isolated to a single layer but are distributed across multiple layers. It makes it difficult to interpret and isolate individual features, as their representation is smeared across layers.

To explore how scaling affects monosemanticity, the researchers used a systematic approach with the following techniques:

Activation Maximization: This technique finds what kind of input maximizes a neuron’s activity, helping to understand what concept or feature the neuron is most responsive to.
Interpretability Tools: These tools help visualize and interpret the patterns of neuron activity, making it easier to identify specialized neurons.
Scaling Models: The researchers experimented with AI models of different sizes, increasing the number of layers and parameters to see how it affects the development of monosemantic neurons.
Criteria for Monosemantic Neurons: Neurons were considered monosemantic if they consistently responded to a single concept or feature.

Golden Gate Bridge Feature: This image illustrates how the feature strongly activates on English descriptions and associated concepts of the Golden Gate Bridge. It also shows activation in multiple other languages and relevant images, demonstrating the feature’s ability to generalize across different languages and modalities.

The study found a wide variety of highly abstract features within Claude 3 Sonnet. These features not only respond to abstract behaviors but can also cause them. The presence of a feature does not necessarily imply that the model will exhibit harmful behavior.

A few examples include:

Famous people
Countries and cities
Type signatures in code

Abstract and Concrete Instantiations

Features can represent both abstract ideas and their concrete examples (e.g., code with security vulnerabilities and discussions about security vulnerabilities). This includes features that are multilingual and multimodal, which indicates that the models can generalize concepts across different languages and modes of data.

Multilingual: Some features respond to the same concept across different languages.
Multimodal: Certain features respond to the same concept in both text and images.

Safety-Relevant Features

The presence of features related to safety concerns highlights the potential risks associated with advanced AI models. Further investigation is required to understand how these features impact real-world behavior.

The research highlighted several features of particular interest due to their potential implications for AI safety:

Security Vulnerabilities and Backdoors: Features related to security flaws in code.
Bias: Features capturing both overt slurs and subtle biases.
Lying, Deception, and Power-Seeking: Includes features related to deceptive behaviors and treacherous actions.
Sycophancy: Features indicating behaviors that involve excessive flattery.
Dangerous/Criminal Content: Features associated with producing harmful content, such as bioweapons.

Systematic Relationships and Scaling Laws

The study identified a systematic relationship between the frequency of concepts and the size of the dictionary needed to resolve them. This finding can help optimize the design of sparse autoencoders for different applications.

Methodology Overview (SAE Experiments)

The methodology described in this work provides a detailed framework for applying SAEs to large models, demonstrating their scalability and effectiveness. The training steps were optimized using scaling laws analysis to minimize training loss.

Training Objective

The training of an SAE involves minimizing a combination of reconstruction error and an L1 regularization penalty on the feature activations, promoting sparsity. Once trained, the SAE provides an approximate decomposition of the model’s activations into a linear combination of “feature directions” with coefficients equal to the feature activations. This sparsity ensures that for many inputs, only a small fraction of features will be active, making the model’s activations more interpretable.

In the experiments with Claude 3 Sonnet, SAEs were trained on residual stream activations at the middle layer of the model.

Three SAEs of varying sizes were trained:

1,048,576 Features (~1M)
4,194,304 Features (~4M)
33,554,432 Features (~34M)

Preprocessing

Normalization: Apply scalar normalization to model activations to ensure their average squared L2 norm matches the residual stream dimension, 𝐷.

Decomposition of Normalized Activations

The normalized activations are then decomposed using a specified number of features. The process involves adjusting the encoder and decoder weights to minimize the loss function, which combines the L2 penalty on reconstruction loss and an L1 penalty on feature activations.

Vector Decomposition:

Encoder Output:

Feature Vectors and Activations:

Loss Function

By normalizing the activations and applying a combination of L2 and L1 penalties, this methodology ensures that the learned features are interpretable and the model maintains a balance between reconstruction accuracy and feature sparsity.

Combination of Penalties:

Once the SAE is trained, it decomposes the model’s activations into a linear combination of “feature directions” with coefficients corresponding to feature activations. This makes the activations easier to interpret.

Feature Activation and Reconstruction

Active Features: For all three SAEs, the average number of active features on a given token was fewer than 300.
Variance Explained: The SAE reconstruction explained at least 65% of the variance of the model activations.

Dead Features

Dead features are components within a model that are not active or do not respond during the processing of a given input. They refer to those features that do not get activated (ie. their activation values remain zero) over a substantial sample of input data.

The proportion of dead features increased with the size of the SAE (suggesting that improvements to the training procedure could reduce the number of dead features in the future experiments).

Proportions of dead features in different SAEs:

1M SAE: Approximately 2% dead features
4M SAE: Approximately 35% dead features
34M SAE: Approximately 65% dead features

Results & Analysis

The research demonstrates that sparse autoencoders can scale to larger models like Claude 3 Sonnet, producing high-quality, interpretable features and provide a foundation of future research aimed at improving AI interpretability and safety.

The experiments provided valuable insights into how scaling impacts monosemanticity:

Emergence of Monosemantic Neurons: Larger AI models tended to develop more monosemantic neurons, meaning more neurons specialized in responding to specific, single concepts.
Improved Interpretability: With more monosemantic neurons, larger models became easier to understand, as each neuron had a clearer, specific role.
Efficiency: Monosemantic neurons made the models more efficient by reducing overall complexity and enhancing processing capabilities.

Graphs and tables illustrated these findings, showing the relationship between model size and the number of monosemantic neurons.

What does this mean for safety?

The understanding of safety-relevant features is still in its early stages and expected to evolve rapidly, along with all things AI.

For future research, these are some key points to think about:

Activation of features: When and why do safety-relevant features activate?
Shortcomings and Methodological Cautions: Issues like messy feature splitting and divergence between expected and actual downstream effects.

Refer to “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet”, Templeton et al. for an extensive list of questions and future research of interest.

Interpretability can act as a “test set for safety”, helping to ensure that models are safe during training and remain safe during deployment. Some strong observations were made during the experiment that left the researchers with hope.

Despite being trained on solely text activations, the Sparse Autoencoder (SAE) features successfully generalize to image activations, indicating great performance even with off-distribution data.
Features usually respond to abstract discussions and specific concepts. This suggests that training on abstract discussions of safety concerns may help the model identify and understand safety issues.

Some limitations, challenges and open problems encountered are as followed:

Data Distribution: Current dictionary learning is performed on a text-only dataset.
Inability to Evaluate: The current objective (reconstruction accuracy and sparsity) is a proxy for interpretability.
Cross-Layer Superposition: Features may be represented across multiple layers, complicating interpretation.
Incomplete Feature Discovery: Likely far from discovering all features in Sonnet.
Other Barriers: Addressing challenges like attention superposition and interference weights.

Examples of Interpretable Features

The article talks about the variety of features in depth. For more in-deoth information on the variety of features, refer to the “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet”, Templeton et al. paper.

Feature activation distributions for each image show that higher activation levels, although less frequent, correspond to more specific and significant inputs. The density and conditional distribution plots indicate how the feature responds to varying degrees of relevance, with examples illustrating text and images that activate the feature within outlined intervals. We can observe the remarkable generalization capabilities of SAEs, demonstrating their potential to enhance understanding and analysis across various domains by effectively bridging the gap between textual and visual data.

Their research includes a rubric they constructed to score how a feature’s description relates to the text it fires.

0 — The feature is completely irrelevant throughout the context (relative to the base distribution of the internet).
1 — The feature is related to the context, but not near the highlighted text or only vaguely related.
2 — The feature is only loosely related to the highlighted text or related to the context near the highlighted text.
3 — The feature cleanly identifies the activating text.

I’ve gathered a few key examples of the distrubtion of feature activations to check out below:

Feature activation distributions for ‘The Golden Gate Bridge’ (F#34M/31164353)

Feature activation distributions for ‘Popular Tourist Attractions’ (F#1M/887839)

Feature activation distributions for ‘Brain Sciences’ (F#34M/9493533)

“Feature activation distributions for ‘Transit Infrastructure’ (F#1M/3)

So let’s wrap it up.

This research on monosemanticity in AI models like Claude 3 Sonnet highlights how we can make these systems more understandable and safer. By using sparse autoencoders, we’ve been able to extract high-quality, interpretable features that reveal the inner workings of these complex models. The research from “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Templeton et al. demonstrates that sparse autoencoders can scale to larger models like Claude 3 Sonnet, producing high-quality, interpretable features. This not only enhances our understanding but also helps in identifying and mitigating potential risks associated with AI.

The study’s findings emphasize the importance of ongoing research to refine these methods and address the challenges of scaling and feature discovery. As we continue to explore and improve AI interpretability, we move closer to developing AI systems that are not only powerful but also transparent and safe for widespread use.