Claude 3: Extracting Interpretable Features

Vishal Rajput
AIGuys
Published in
8 min readJun 10, 2024

--

Anyone who has worked with Deep learning models knows that Neural Networks are black boxes in nature. This is something that has really troubled scientists a lot. Trust is an important factor in making technology reach critical areas like medicine, autonomous driving, etc. In life-and-death situations, we do not want to relinquish our freedom of choice to a machine; we still trust humans more. The reason for trusting humans is not that humans are better decision-makers, but they can be penalized and made responsible in case anything goes wrong.

The ability to interpret and steer large language models is an important topic as we encounter LLMs on a daily basis. As one of the leaders in AI safety, Anthropic takes one of their latest models “Claude 3 Sonnet” and explores the representations internal to the model. Let’s discover how certain features are related to different concepts in the real world. So, without further ado, let’s delve deeper into LLMs' interpretability.

Table of Content

  • What is Mechanistic Interpretability?
  • What is Monosemanticity?
  • Sparse Autoencoders (SAE)
  • Assessing Feature Interpretability
  • Feature Neighborhood
  • Conclusion

I highly recommend reading the first three parts on Mechanistic Interpretability, before starting with this one.

--

--