Unmasking Generative AI: Understanding Explainability Techniques

Published in

Version 1

5 min readAug 13, 2024

Generative AI models are often referred to as ‘Black Boxes’. And Explainable AI has been a long-standing point of discussion in the traditional AI models space. Have you wondered what Explainable AI means in GenAI space? Does it help to understand and interpret the complex decision-making processes of AI models?

Created using Microsoft Bing Image Creator

Let’s dig in. In existing literature, there are four primary categories of General AI explainability techniques. First, feature-based explainability techniques focus on the importance of individual features in the model’s predictions, highlighting model’s aspects of the data it considers most meaningful. Second, sample-based explainability uses representative samples to explain the model’s behaviour. Third, the mechanistic approach examines the inner workings of the model itself, often creating simpler, interpretable models that approximate the behaviour of more complex ones. Lastly, probing-based explainability involves examining the model’s responses to various inputs, providing insights into its internal representations and decision-making processes. Each of these approaches offers unique perceptions and using them together may steer to a broad understanding of GenAI models.

Feature-based

Feature attribution plays a vital role by providing a relevance score for each input feature, such as a word or pixel. This helps in understanding which features are most influential in the model’s decision-making process. Another approach is perturbation, which changes inputs partially, for example, by removing or changing features and then examining the output. This can provide insights into how sensitive the model’s predictions are to changes in specific features. The gradient-based method is illustrated by techniques like Grad-CAM which stands for Gradient-weighted Class Activation Mapping requires a backward pass from outputs to inputs to obtain derivatives, proposing a way to visualise the value of each feature.

The surrogate model approach uses much simpler models, such as LIME (Local Interpretable Model-agnostic Explanations), SHAP (SHapley Additive exPlanations), EAC (Effective Accelerationism), SAM or SAM 2 (Segment Anything Model 2) and others, to understand individual predictions. These models can approximate the complex model’s behaviour, providing a more explainable reason. The decomposition model approach, including techniques like LRP (Layer-wise Relevance Propagation), focuses on accrediting relevance from outputs towards inputs or decomposing vectors. It can also refer to decomposing the reasoning process and accrediting outputs to specific reasons. Lastly, attention-based approach provides an importance score for inputs, where inputs are not necessarily inputs to the network but those of a prior layer. The scores are commonly acquired between all input token pairs for a single attention layer, which can be visualised using a heatmap or bipartite graph. This approach is particularly useful in the context of neural networks, where it can provide insights into the model’s internal representations.

Sample-based

Sample-based techniques are methods that examine changes in outputs for different inputs. The focus is more on the sample in its entirety to understand the relationship between various inputs and their corresponding outputs.

Two key techniques in this category are the influence of training data and adversarial samples. The former measures the impact of a specific training sample on the model. The latter involves input changes that are small and invisible for people, that lead to a change in outputs.

The other two techniques are counterfactual explanations and contrastive explanations. Counterfactual explanations spot smallest changes to an input so that the model’s output changes from one class to another specific class. The most typical example is a loan approval model. A counterfactual explanation might show that increasing the applicant’s income by a certain amount could change the model’s decision from ‘deny’ to ‘approve’.

While contrastive explanations explore why a model made a certain prediction ‘A’ instead of another prediction ‘B’. These techniques provide a deeper understanding of the model’s decision-making process. For example, in a text generation model, a contrastive explanation might explore why the model chose to output one word over another, considering various aspects such as part of speech, tense, and semantics.

Mechanistic approach

The mechanistic approach is a method that studies neurons and their interconnections within a model. Primarily, it reverse-engineer the components of a model into algorithms that are understandable to people i.e. it provides a deeper understanding of the inner workings of AI models, making them more transparent and trustworthy.

First technique in this approach is circuit discovery. Typically, its manual workflow that includes the creation of the model, defining metrics, and dataset creation. The goal is to discover how different parts of the model, or circuits, interact and contribute to the final output.

Second is causal tracing which calculates the impact of intermediate activation on the output. In other words, it traces the effect of certain neurons or groups of neurons in the network on the final prediction, helping to understand which parts of the model are most influential in decision-making.

Lastly, the vocabulary lens technique is used to determine relations to the vocabulary space. This technique focuses on understanding how the model processes and uses different words or tokens. By examining these relationships, we can gain insights into how the model understands and generates language.

Probing-based

Probing-based methods helps to understand the knowledge that a Language Learning Model (LLM) has captured through queries. A common approach is to train a classifier, often referred to as a probe, on a model’s activation to distinguish different types of inputs and outputs.

There are few types of probing methods. First, knowledge-based probing involves training a classifier on the model’s outputs with the goal of identifying the presence of output properties or abilities that emerge from the inputs. For example, if the LLM is trained to translate languages, a knowledge-based probe might look for the presence of correct grammar in the translated output.

Concept-based probing provides relevance scores for a given set of concepts within the inputs. This method must be designed carefully, as it primarily examines interactions among input variables.

Lastly, neuron activation-based probing seeks to understand neurons using their activations for inputs. This method provides insights into how individual neurons contribute to the model’s overall performance.

Each of these methods offers a unique perspective on the inner workings of an LLM.

Click here to learn more about Version 1’s AI Webinar Series.

About the Author:

Rosemary J Thomas, PhD, is a Senior Technical Researcher at the Version 1 AI Labs. You can read more of her publications here.

Unmasking Generative AI: Understanding Explainability Techniques

Feature-based

Sample-based

Mechanistic approach

Probing-based

Written by Rosemary J Thomas, PhD