CAUSAL CONCEPT-BASED EXPLANATIONS

11 min readOct 3, 2025

Introduction

Over the years, we have evolved from using simple, often rule-based algorithms to sophisticated machine learning models. These models are incredibly good at finding patterns in large datasets, but due to their complexity it is frequently challenging for a human to understand why a certain input leads to its respective output. This is especially problematic in areas where high-stakes decisions are being made and where human-AI collaboration is critical.

This is why model explainability has gained traction in recent years. The aim of explainability methods is to shed light on what properties of the data contribute to a machine learning model’s output.

Ideally, such explanations should be similar to how a human would explain its decisions to a peer. In particular, humans make use of high-level concepts that are easy to grasp for other people. For example, a human would explain, “there was a sudden burst of transactions in a short time” as opposed to, “the average delta time feature is much smaller in a recent short time-window compared to a long time-window, and the count of transactions feature increased substantially in the same recent time-window.”

Moreover, when explaining choices and decisions, humans are free to reflect on these decisions by reasoning about hypothetical alternatives. For example, we could ask, “if this email address would be considered suspicious, would I trust this request?” Current explainability methods fail to incorporate both properties: the explanations are often much more complex than typical concepts humans would use; and they don’t allow for reasoning. As a result, these limitations impede the usefulness of current explainability methods in many practical applications.

In this blog post, we will describe a new explainability method that addresses these issues. We’ll try to refrain from discussing too many technical details and focus on the intuition behind the method, but readers interested in the technical justifications are referred to our publication of this work at the 2024 CLeaR conference.

Background

Feature-based explanations

The vast majority of explainability methods can be grouped under the “feature attribution” umbrella. Feature attribution techniques assign an importance score to each feature in the input, where a higher score typically reflects a feature having a large contribution to the model output. In this way, human analysts can gain insight into the model decision-making process. We previously wrote about the application of these methods at Feedzai in an earlier blog post.

However, sometimes the feature attribution methods themselves are not very helpful. Especially when the number of features is large and there is little knowledge on their connection to higher level concepts, attributions may be spread over a large number of such features and it becomes difficult for a human to meaningfully understand and correct model decisions.

A motivating example

Consider a hypothetical machine learning model that is tasked to predict the probability of a person having a heart attack in the following year. The model considers many features, such as blood exam data, tracking data from a health app, etc. An explainability method based on feature attributions, such as SHAP (Lundberg and Lee, NeurIPS, 2017), would produce results as shown in Figure 1. In this example, levels of LDL cholesterol, BMI, and blood glucose levels are seen as the most important features for the model. But many other features contribute with slightly lower scores. The more features there are, the less clear it is for a human how to interpret their impact.

**Figure 1. Example of global feature-based explanations**. In this dummy example, the feature importance is given by the mean absolute SHAP value, where larger values signify more important features. While these types of explanations give some insight into which features are contributing most to model scores, they do not provide easy human interpretation when the number of features is large and when features themselves are more abstract. Moreover, the interactions between the features are not apparent, making reasoning impossible.

Especially when human-AI collaboration is time sensitive, it is important to provide more interpretable explanations. One step in that direction is to offer explanations based on human-understandable, high-level concepts instead of on individual features.

Concept-based explanations

As discussed in the previous section, explainability methods that assign importance to individual features can be opaque themselves. What if, instead of basing explanations on features, we directly assign attributions to higher-level concepts? In other words, we are interested in developing a method that assigns importance to human-defined concepts, where higher importances indicates a stronger contribution to the model output. We therefore move from feature-based explanations to concept-based explanations.

While concept-based explanations address some feature-based method issues, new problems emerge. First of all, we need to define the concepts, which is typically a task for domain experts but may be automated eventually when concept extraction methods become more powerful. Secondly, we need to connect the inputs, concepts, and model outputs in our explainer.

One recent approach achieves this by designing a specific neural network architecture, known as the Concept Bottleneck Model (CBM, Koh et al, ICML, 2020). The CBM essentially connects the inputs to an intermediate layer where each unit is forced to represent a concept (the bottleneck layer), from which connections flow to the outputs. Since information has to flow through the bottleneck layer representing the concepts by construction, one can simply read out the activation of each unit of this layer to know how much a concept is ‘active’ for a certain input. Returning to our hypothetical example of the model to predict cardiac arrest, such a CBM could be schematically depicted as in Figure 2. In this network, the bottleneck layer represents concepts such as “smoking,” “drinking, etc.

**Figure 2. Representation of a Concept Bottleneck Model**. These types of neural networks have an intermediate layer where each neuron encodes a specific concept. The networks are then trained using two objectives simultaneously, namely learning the correct concept(s) present at each instance and learning to classify the instances correctly from the concept layer.

However, there are still issues with this method. Firstly, since the concepts may not be perfect predictors of the model’s objective, there is typically a trade-off between the accuracy on the original task and the accuracy on the concept predictions. In other words, a CBM that is better at predicting concept labels will be worse at predicting cardiac arrest, and vice versa.

Secondly, because the concepts may not be independent of each other, it is not possible to reason about alternative outcomes by simply “intervening” in the bottleneck layer. For example, we may be tempted to activate the “Exercise” neuron maximally in the bottleneck layer to answer the question, “How would my likelihood of cardiac arrest be impacted if I exercised more?” Doing so, however, fails to take into account that exercise may also affect the “Weight” and “Cholesterol” concepts.

To solve the first problem, we will explain what we mean by post-hoc explanations. After that, we will talk about the causality aspect.

Post-hoc explanations

As we saw in the previous section, the CBM learns two tasks simultaneously: predicting concepts from inputs and predicting outputs from concepts.

This typically incurs a trade-off in performance between both tasks. Instead of this self-explanatory property, one can also develop post-hoc (Latin for “after the event”) methods, which are applied after the model is trained. In other words, we have a first model to learn the mapping from inputs to outputs, and after that model is trained, we construct a second model to explain the first.

In this way, we do not interfere with the performance of the first model. We typically do not require any specific knowledge of how the first model is constructed, as long as we can pass it an input and receive the respective output. The first model is therefore usually referred to as the black-box model.

One way to achieve this is to train the explainability model to mimic the black-box model’s input-output relationship. We can then impose any constraint on the explainability model, without affecting the performance of the original black-box model. The technique of using a new model to learn from the behavior of a black-box model is known under the name of “model distillation” Importantly, to distill a black-box model, we only need access to input-output pairs, but no further details of the black-box are needed. The explainer is therefore free to use any other algorithm or architecture.

Now that you have learned about post hoc methods and model distillation, let’s look into why causality is important when we want to reason about our model predictions.

Causal explanations

As explored in the previous sections, post-hoc, concept-based explanations can yield more interpretable yet highly performing models. However, we’d like to add one more property to our explainer: the ability to reason about alternate situations.

For example, a model could predict that a patient has a high probability of a cardiac arrest because of high cholesterol, increased weight, and high alcohol consumption. We could ask the question: what would the model predict if we lowered the alcohol consumption data? In a CBM model, all concepts are assumed to be independent; and hence, if we manually lower the alcohol consumption concept we obtain output that ignores the interactions between alcohol consumption and other concepts such as weight. This type of reasoning about alternate situations is called “counterfactual reasoning.”

The solution to this is to learn a Structural Causal Model (SCM) of the concepts, which learns how much each concept causally depends on another. In the language of SCMs, a causal dependency of B on A is denoted as

A → B

When we connect multiple concepts through their causal dependencies, we must adhere to the following properties. First, since cause and effect are directed, there is always a single direction to move between two concepts that depend on each other. Second, since one concept can never be a cause of itself (even through intermediate concepts), following the directions in the graph, one can never return to its starting point. The first property of a graph is called “directed,” and the second is called “acyclic” (no cycles). Hence, an SCM has a Directed Acyclic Graph (DAG) at its heart.

Given the DAG encoding the causal dependencies between concepts, we can train a model to learn the complete SCM, extending the DAG with concept attributions (how much each concept is present for each input), and attributions for the connections (how strong the concepts depend on each other). Without going into too many details, the training involves casting the SCM into a neural network language, after which these attributions are learned from a training dataset. The SCM then allows us to perform actual counterfactual reasoning. Returning to our example above, let’s consider the SCM in Figure 3.

**Figure 3. Example of an SCM-based explanation**. In this dummy example, the instance explanation is given in the form of an SCM. Each concept represents a node in the SCM, and causal relations are encoded via the directed edges. A higher likelihood of a concept being present is denoted by a darker color of the respective node, while a positive or negative causal relation is denoted by a blue or red directed edge, respectively.

Here, one can see how, for example, alcohol consumption causes both higher cholesterol and higher weight. If we want to test the counterfactual statement “what if we lower our alcohol consumption,” we would manually lower the concept attribution for “drinking,” which in turn would affect the “weight” and “cholesterol” concepts before changing our prediction of “cardiac arrest.” This reasoning, which takes into account the dependencies between concepts, is completely lost in the previously described bottleneck models.

DiConStruct Explainer

In the previous sections, we discussed the properties that an ideal explainer should possess. We would like it to be:

concept-based, to provide better interpretability to humans;
post-hoc, to not affect the black-box model performance. As discussed, we will use model distillation to train our explainer;
causal, to allow us to reason about counterfactual situations (e.g., alternatives to what we observed). This will take the form of a Structural Causal Model (SCM) within our explainer.

From these three properties, it also becomes clear why our method is named DiConStruct (Distillation, Concept-Based, Structural Causal Model).

On a high level, the DiConStruct explainer is organized as depicted in Figure 4.

**Figure 4. Schematic representation of the DiConStruct method**. The required inputs are a DAG containing the concepts and an instance to be explained (left). Then, the black-box model is used to produce the output score Yᵦ. The exogenous model is tasked with predicting concept-specific weights, which are then used with the DAG to produce the causal graph for that instance. Finally, given the causal graph, one can extract concept attributions representing the importance of each of the concepts in producing the score Ŷᵦ.

The inputs are the data (X) and a DAG of the concepts (which we assume to be created by domain experts and/or extracted from data using specific methods). The DiConStruct explainer then contains two components. The first component, the Exogenous Model, is a neural network that was trained to predict concept-specific weights from the inputs. These weights encode the extent to which certain concepts are present, without considering the causal contributions from other concepts.

They are then combined with the DAG into the second component, the Concept Distillation SCM, to incorporate the causal interactions between concepts. Given the SCM, we can extract causal explanations in the form of a causal graph and concept attributions. It is important to note that the mapping from inputs to concepts is learned in a supervised manner, meaning that data annotated with concept labels is necessary.

Finally, the inputs will also be ingested by the black-box model and give rise to the outputs (Yᵦ). The SCM from our DiConStruct model also contains the prediction Ŷᵦ, which is trained to mimic the black-box output Yᵦ.

Results/Examples

We tested the DiConStruct explainer on two real-world datasets. One is a publicly available dataset for classifying bird images (CUB-200–2011), the other is an in-house fraud detection dataset (Merchant Fraud).

On both datasets, we train feedforward neural networks as our black-box models. In Table 1, we report the performance of multiple variations of DiConStruct on both datasets, compared to multiple baselines. We report the main classification performance (which we chose to be recall at a 5% false positive rate), the concept performance, i.e. how accurately are we predicting the concepts for each instance (quantified by the average accuracy over the concepts) and the fidelity, i.e. how good is DiConStruct mimicking the black-box model decisions (using 1-MAE, the mean absolute error).

Since DiConStruct methods are not affecting the black-box model, we can see that the task performance is equal to the black-box performance. Moreover, the fidelity of DiConStruct methods is always very high, meaning that it learned very similar input-output relations as the black-box model. Finally, the concept performance of DiConStruct is on par with the state-of-the-art baselines, but of course with the added advantage that our method provides causal explanations.

**Table 1. Experimental results.** The performance of DiConStruct and various baselines on the test set for the CUB-200–2011 dataset (top) and the Merchant Fraud dataset (bottom). *CBM* denotes a Concept Bottleneck Model baseline where the classification task and the concept task receive equal weight. *Task Baseline* and *Concept Baseline* denote models that were trained on a single task only. The *Task Baseline* corresponds to the black-box model.

**Figure 5. Example explanation from the DiConStruct method**. (a) Learned SCM for an instance of the Merchant Fraud dataset. The blue edge color denotes positive interactions, while the red edge color denotes negative interactions; the intensity represents the interaction’s strength. A positive/negative interaction increases/decreases the value of the destination node, respectively. Concept likelihood (nodes) is encoded from white (low) to black (high). (b) Concept attribution plot for the same instance.

In Figure 5, we show one example of a causal explanation from our DiConStruct method; for an instance in the Merchant Fraud dataset that was predicted to be suspicious by the model. We can see that Suspicious Device is the concept that is deemed most important for the decision. We can also observe that Good Customer History has a very low likelihood and therefore increases the likelihood of downstream concepts related to fraud. This is indeed expected, since Good Customer History is a concept related to legitimate events. With these causal explanations, one can not only understand which concepts are most relevant but also how they causally relate and interact with each other.

Conclusions

In summary, we introduced a novel method for model explainability that addresses some limitations of current approaches, namely, providing explanations in the form of human-conceptual terms and incorporating causal principles to enable counterfactual reasoning based on the provided explanations. With our method, we aim to improve human-AI interactions, by providing explanations that are more aligned to how humans would explain decisions to each other.

Feedzai Techblog