Claude 3: Extracting Interpretable Features

Published in

AIGuys

8 min readJun 10, 2024

Anyone who has worked with Deep learning models knows that Neural Networks are black boxes in nature. This is something that has really troubled scientists a lot. Trust is an important factor in making technology reach critical areas like medicine, autonomous driving, etc. In life-and-death situations, we do not want to relinquish our freedom of choice to a machine; we still trust humans more. The reason for trusting humans is not that humans are better decision-makers, but they can be penalized and made responsible in case anything goes wrong.

The ability to interpret and steer large language models is an important topic as we encounter LLMs on a daily basis. As one of the leaders in AI safety, Anthropic takes one of their latest models “Claude 3 Sonnet” and explores the representations internal to the model. Let’s discover how certain features are related to different concepts in the real world. So, without further ado, let’s delve deeper into LLMs' interpretability.

Table of Content

What is Mechanistic Interpretability?
What is Monosemanticity?
Sparse Autoencoders (SAE)
Assessing Feature Interpretability
Feature Neighborhood
Conclusion

I highly recommend reading the first three parts on Mechanistic Interpretability, before starting with this one.

Claude 3: Extracting Interpretable Features

Table of Content

Written by Vishal Rajput