Mechanistic Interpretability — first look

2 min readMar 2, 2024

I was recently looking into mechanistic interpretability. Here are some initial thoughts from my readings.

What is MI? Explain neural network behavior using its mechanisms (connection weights and neuron activations). See Chris Olah’s intuitions.

Why is it cool? I’m personally interested in it because some of the lessons we learn along the way might be building blocks to understanading the human brain.

What’s a cool result so far? On GPT-2, Wang et al., 2022 used MI techniques to derive an interpretable algorithm that the network is using to solve an NLP task. They argued that the algorithm is faulty, and shows that running adversarial samples does cause the network to produce the expected wrong results. See Neel Nanda’s twitter summary.

My interest

What kinds of explanations do I want to see out of this field? I don’t expect there to just be 1 kind of useful explanation. My guess is we have a favorite explanation “shape” for each big cluster of use cases. Here are some sub questions.

Explanation shapes. Chan et al., 2022 mentioned that the explanations we get might just be defeasible reasonings: “we expect that in the context of interpretability, we need to accept arguments that might be overturned by future arguments”. We also have many fuzzy terms in the field. It’s unclear to me what explanation shapes would eventually stick around in the field.

Criteria for good explanations. What makes some explanations “better” than others? Wang et al., 2022 proposed faithfulness, completeness and minimality. Rauker et al., 2023 has more too.

Techniques to inference. What exactly can you conclude from applying MI techniques? E.g. If you find that intervention A on component B causes performance degradation of C% on behavior D, then what can you conclude? — It’s unclear to me the shapes of the conclusions you can make. Do we say, there’s “X% probability that B is relevant for behavior D”? — and would you quantify “relevance”? … and so on.

Relevant quote from Rauker et al., 2023: “Mistaking hypotheses for conclusions is a pervasive problem in the interpretability literature”
I like how causal scrubbing at least solves the “ad hoc techniques” problem, but there’s still gap in solving the “what I can conclude after running this technique” problem.

If you want to learn more

Here are some of my favorite resources:

https://distill.pub/2020/circuits/zoom-in/
MI boot camp: https://github.com/callummcdougall/ARENA_3.0 https://arena3-chapter1-transformer-interp.streamlit.app/
Neel Nanda’s
A Comprehensive Mechanistic Interpretability Explainer & Glossary
Rauker et al., 2023 — Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Mechanistic Interpretability — first look

My interest

If you want to learn more

Written by Stephen Jonany