Data Science Basics — (3) Decision Trees and the Simpson’s Paradox

Mamdouh Refaat
5 min readMar 16, 2024

--

The Simpson’s paradox is a well-known problem that exists in many datasets which results in conflicting interpretations of the trends that the data demonstrate. Decision trees, by default, don’t discover this problem, but because of their visual nature they make inspecting and demonstrating it most clear.

We will explore the problem of Simpson’s Paradox using a simple example and we will use the decision trees in Altair Knowledge Studio to explain it.

The dataset we will use to demonstrate the paradox contains the records of 900 extraterrestrial patients who contracted a viral infection while visiting our planet (Earth). We (humans) tried to treat them using the drugs we have. Some of the aliens accepted the treatment, some did not. Because some of those aliens were from Mars and some were from the Moon, their survival rates varied.

The objective of our investigation of this dataset is to determine whether the treatment had a positive impact on the rate of survival.

Figure 1 shows the root node of a decision tree for this dataset showing that 50.5% of the alien patients have survived the infection (the portion of the node shaded in red).

Figure 1: root node of the decision tree of the Alien survival data

When we split the data using the variable representing whether the alien patients were treated, we get the split shown in figure 2.

Figure 2: the alien survival data split by the variable “Treated”

Figure 2 shows that the node 3 (the segment that was treated) shows a survival rate of 55%, which is more than the segment of node 2 (untreated patients). At its face value, this picture makes us believe that the treatment increased the change of survival of the aliens from 41.6% to 55%. So that, from the point of view of the entire population of infected aliens, it would be beneficial to treat all of them with the drugs we have.

Let’s now try a different approach. Let’s first split the dataset using the “Origin” variable, which represents whether the alien patient is from Mars or the Moon. In the second level of splits, we will then split each of the two groups by the “Treated” variable. This is shown in Figure 3.

Figure 3: A tree with splits on Origin, then Treated variable

The tree shown in Figure 3 shows that in the second level split the treatment results in the reduction of survival rate for both groups of aliens.

The result of figure 3 poses the important question: if in both groups of aliens, either from Mars or the Moon, the treatment with the drug reduced the chance of survival, how could the treatment help the entire population as shown in figure 2?

Figures 2 and 3 clearly show contradictory conclusions. This is the Simpson’s Paradox.

The Simpson’s paradox has been studied for long time and there are many explanations and interpretations of what it means. My take on it is that the distribution of the survival data in terms of the treatment is borderline stable so that it can be interpreted either way. In other words, we cannot, in a robust clear way, infer the effect of the treatment on the survival rate.

An important observation about datasets that exhibit the paradox is that the groups in question show very different behaviour in terms of the distribution of the variable of interest. For example, the root node of figure 3 shows a big difference in the survival rates between groups of nodes 2 and 3. Therefore, whenever we see such disparity in distribution between different groups, we should check for this data issue.

How common/important is the Simpson’s Paradox?

I have been a data scientist for a long time, and I have encountered this problem several times in real life data. As a rule, we should suspect the presence of this problem when there are many categorical variables in the data with some of them leading to very different distribution of the dependent variable.

The importance of discovering the problem early on is that we develop predictive models for two purposes: (1) prediction, and (2) understanding trends. By trends I mean we develop predictive models to explain how a dependent variable may be associated with the values of a certain independent variable. In the case of the extraterrestrial patients, it was the relationship survival rate and the treatment. This is where the paradox could popup. The last thing we want, is to develop a predictive model, and when we try to use it to demonstrate a specific trend, we discover this issue!

How about Regression Problems?

The Simpson’s Paradox data problem could also exist in regression problems. Take for example the data shown in figure 4, where we fitted simple regression model relating the weight of the extraterrestrials to their age (normalized to Earth years!).

Figure 4: regression model for weight vs age for extraterrestrials

Figure 4 shows that the weight of aliens increases with age. However, if we separate the two clusters, which happened to be the data for aliens from Mars and the Moon and find the regression line for each one of them, we get the results shown in figure 5.

Figure 5: regression lines for each of the two groups of aliens

Figure 5 clearly shows that the trend of weight vs. age reverses direction for both groups of patients to indicate that their weight decreases as they get older. Again, this is a case of Simpson’s Paradox, where segmenting the data with some variable leads to reversing the observed tend in the whole dataset.

Conclusions

In this article, we explored the Simpson’s Paradox by using the decision tree as a visualization tool. I hope that I managed to show how decision tree makes the demonstration of this data problem easy to understand. We also showed that the paradox could also exist in cases of regression problems and showed how it could also lead to contradicting trends to explain the results of a model.

--

--

Mamdouh Refaat

I am the Chief Data Scientist in Altair Engineering. I have been working in ML and AI for over 25 years in business and engineering applications.