Counterfactuals and their Evaluation

Urja Pawar
5 min readFeb 4, 2022


You must be interested in the domain of Causal Inference or xAI/iML (you know what they mean right? ) to be wanting to know about counterfactuals. So, this article will cover that and we will also be talking about some evaluation metrics that are discussed in the related research works for assessing how good or bad a counterfactual example is. Let us begin!

Basics — What are counterfactuals?

The term counterfactual actually came from the domain of Causal Inference.

In brief, causal inference talks about causation which means analysing factors that are actually the causes of a certain outcome and are not just “appearing” causes because of some spurious correlation (correlation existing because of third factors). This is done by conducting studies and analysing the data from those studies (ex: giving a certain drug to treat patients and observing the outcomes). A counterfactual is an imaginary example (counter + factual) that represents a situation of the form what would have happened if a factor X was different? . If the causes (causal factors) are properly identified, we can answer such counterfactual questions.

In the domain of machine learning, we can provide explainability in a similar way. We try to change a row/record by some minimal amount to create potential counterfactual examples and as we can get the answer of counterfactual records by just pinging the model, we can say If feature X had a value B instead of original value A, classification would have been M instead of N, you know what I mean? In this way, people can understand how the model differentiates between classes M and N.

How to generate/construct Counterfactual explanations?

The classic way to generate counterfactual is given by Watcher et. al that aims to minimise the following loss function:

Equation 1

where f is the model and x is the original instance (to be explained counterfactually) and x’ is the potential counterfactual instance. y’ is the desired output and d is the distance metric measuring some sort of similarity between the original and counterfactual instances.

Now, depending upon the domain, and the type of data involved; the ways we can optimise the above equation or generate counterfactuals can differ.

The techniques to generate Counterfactuals can be categorised as follows

  1. Instance-based:

These approaches try to find counterfactual instances by applying perturbation techniques on features such that it is close to the original instance. Equation 1 given by Watcher also comes under this category.

2. Probability-based:

Here we generate counterfactuals using probabilistic approaches like Markov sampling and variational autoencoders in order to learn the underlying data codings either by unsupervised learning or by probabilistic graphical models and then generate counterfactuals based on learned clusters and graphical models.

3. Constraint-based:

These are very generalisable approaches based on constraint satisfaction problems and can be utilised for satisfying different properties of a counterfactual [discussed in next section].

4. Feature-importance based:

The techniques to generate feature importance ranges from the game theory-based approach — SHAP to local regression models like LIME. If we use these feature importance techniques to score features based on their impact on the model’s output, we can easily tweak the most important feature and then the second most important, and so on for getting a counterfactual instance.

Properties/Evaluation Metrics for Counterfactuals

So far, the following properties have been discussed in the literature:


This refers to the number of counterfactual explanations possible for a given instance’s classification. Now the more ways I have to tell you why a certain thing was classified as…say, pig! the more understanding you will gain about the model and its learned knowledge.


This refers to the number of features that had their values changed in the counterfactual explanation. As per the classic definition, the lesser the number of features changed, the better. But we can’t always rely on this metric right? What if the counterfactual suggests one big feature change instead of minor changes in multiple features? Also, what exactly is a good number for sparsity? Should it be always 1?2? Its debatable and kinda ambiguous.


This metric represents the possibility of a counterfactual instance to be actually legit. Let’s say for some classification problems, features like age and gender are shown to be changed in counterfactual explanations. This won’t make sense when counterfactual explanations are used for suggesting changes for the future.


It is similar to plausibility but is more related to how feasible a feature change is. The counterfactual instance should not reflect changes that are either difficult to achieve or are not strongly representative of the desired class ( the classification we want from the counterfactual instance) as required.


This represents the classic property of counterfactual explanation — being close to the original instance. Proximity is optimised based on different distance metrics like L1/L2-norm.


This metric is discussed using different names like robustness or stability in the literature. It essentially means: how robust are the explanations to the slight input perturbations. Explanations should not reflect drastic or irregular changes on a slight modification of input and so if explanations are very sensitive, there might be a problem in the way they are generated.

My take on counterfactuals

Although counterfactual explanations could be very useful for recommending suggestions to get the desired output from a model but they can also be used simply to understand the ML model better, without caring whether or not the suggested feature changes are plausible/feasible. They quantify the degree of change required in a feature and so can be used to surface granular insights on what the model learned instead of high-level insights such as feature importance.

The evaluation criteria that have been discussed in the literature is related to the functional evaluation. Obviously, the usefulness of an explanation is the first and foremost metric to be considered but as we have different domains with varying datasets and end-users, the usefulness and its definition vary. It is always a good way to conduct user studies to assess the usefulness but can we go a step further and try quantifying it in some sense? — after all user studies will tell us discrete answers ;)

Recommended reading —



Urja Pawar

ML Researcher. I write about whatever in ML/AI, explainable AI, and Statistics