# Disentangling Neuron Representations with Concept Vectors — Key Takeaways

In this post, we present the key takeaways from our XAI4CV CVPR 2023 Workshop paper (CVF open access version) titled “Disentangling Neuron Representations with Concept Vectors” (arXiv preprint). The code is available on GitHub.

Breaking down the model into interpretable units allows us to better understand how models store representations. However, the occurrence of

polysemantic neurons, or neurons that respond to multiple unrelated features [2], makes interpreting individual neurons challenging. This has led to the search for meaningful directions, known as concept vectors, in activation space instead of looking at individual neurons.

We demonstrate our method to disentangle polysemantic neurons into concept vectors consisting of linear combinations of neurons that encapsulate distinct features.

We consider a trained convolutional neural network. As summarised in the method summarised in the figure, we cluster the activations of the images that maximally activate a given neuron in a model for a given hidden layer *l* in the model. The distance between images within the same cluster is lower than images in differing clusters as shown below. From these clusters, we calculate concept vectors that point toward the non-neuron aligned direction in activation space.

It can be seen that these directions now encode clean features corresponding to the multiple features originally entangled in polysemantic neurons by checking the images with maximal projection along the concept vectors found cleanly represent the originally entangled features. We also applied feature visualisation [1] to visualise what kind of image highly activates along the concept vectors.

Below is a UMAP[4] depiction of the latent space activations of the maximally activating images kept after clustering for a few neurons from layer *Mixed 7* in the Inception V3 convolutional neural network. We see separate clusters for the running example of polysemantic neuron 35, which activates for both apples and sports. Applying our method obtains two distinct clusters for apples and sports as illustrated above also. We get a single cluster and concept vector for neuron 16 which activates highly for elliptical shapes. We see neuron 1 activates highly for underwater images. With closer inspection, we can observe the single category of underwater images that maximally activate it can be broken down into the subcategories of scuba divers and general underwater images such as coral. The number of clusters obtained in this case will be two. However, by changing a hyperparameter in the method, we can get more coarse-grained concepts, leading to one concept in this case. This demonstrates how we can tune the method to fine-grain concepts more or less as desired. This also gives insight into how the model encodes related or multifaceted concepts.

When we compared the elements of a cluster with their corresponding concept vector vs with the neuron direction of interest we found that the cluster had a much higher projection along the concept vector direction than along the neuron. We can also see from the cosine similarities that the latent space representations have much higher similarities with the concept vectors discovered than with the neuron directions.

In our analysis, we found that monosemantic regions exist in activation space, and features are not axis aligned. Our results suggest that exploring directions, instead of neurons may lead us toward finding coherent fundamental units. We hope this work helps move toward bridging the gap between understanding the fundamental units of models as is an important goal of mechanistic interpretability, and concept discovery.

# References

[1] Olah, C., Mordvintsev, A., & Schubert, L. (2017). Feature visualization. *Distill*, *2*(11), e7.

[2] Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., … & Olah, C. (2022). Toy Models of Superposition. *arXiv preprint arXiv:2209.10652*.

[3] Lim Swee Kiat. Lucent, lucid library adapted for Pytorch,

2021.

[4] McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. *arXiv preprint arXiv:1802.03426*.