Modeling customer conversion with causality

Published in

Analytics Vidhya

5 min readMar 18, 2021

Exploring causality in feature selection

The curse of dimensionality is a well-known issue when modeling dataset with many features, particularly model which requires statistical significance. Features, however, could not be our enemy because they provided a description of customers, for example, to enrich our understanding of a business. Therefore, a reasonable way to select features for a particular task became the question that will be discussed in this article.

Popular dimension reduction tools

Auto-encoder or Principle Component Analysis (PCA) are some popular methods to capture the essence of the original set of features and produce a new set but with less number of features. However, they fall short in interpretation, as each of the converted features become a mixture of the original ones thus lose their physical meaning. My approach with causality, however, does not “compress” features but to offer a graphical view of relationship between the features and the predicting variable, such that one might select features that affect immediately the predicting variable as their modeling predictors.

Causality

A causal model has a directed graph representation to describe causal relations between the features, such as one below:

A graphical causal model between 4 features

If such model is established with some rich statistics, then one might argue that instead of modeling the “income” with all other 3 features, perhaps using only “experience” and “education” would have been sufficient as they are the factors that causes the predicted variable “income”.

Building a causal graph with the IC* algorithm

While the details may be referred to the book [1] by Pearl, the IC* algorithm begins with a graph where all features are inter-connected, then disconnects two features whenever a conditional independence for the two features is discovered, leaving a partially connected and undirected graph.

Simply connected does not speak which is causing which, therefore, the algorithm next discovers causal relationship.

Firstly, as long as there are three connected features (A — B — C) but A and C are conditionally independent given a set of features excluding B, then we can change the connections into arrows as (A -> B <- C, i.e. a collider)

After discovering all colliders in the first step, we keep iterating the following two steps until no more new arrows can be drawn.

whenever there is a (A -> B — C) structure, change it into (A -> B => C), because it could not be a collider structure, otherwise it would have been discovered. The symbol => is distinguished from -> to indicate that B => C is a true directed relationship, whereas B -> C could either mean a true directed relationship (i.e. B => C) or the possibility that there is a latent variable (not observed in the dataset) U, such that B and C are related in this way: B <- U -> C, which is a fork structure.
whenever there is a chain a true directed path from A to B, i.e. A => F1 => F2 … => B, and A is connected with B (i.e. A — B), then the connection should be changed into A -> B

Having mentioned about the latent factors, it points to a possibility (actually a hugh one) that the produced graph might not be unique. This, however, should make sense because we may just not be able to include all decisive factors into the model. The existence of a latent variable itself offered us hint on where to look at and what to plan for in data collection.

Lastly, for those undirected, connected features, their relationship remained undetermined and probably would require more data.

Demonstration with a telemarketing dataset

A bank telemarketing dataset[2] which contained 20 features for each of the 41,188 customers was used for demo. The process flow of which could be illustrated in below.

Casual models were then constructed by the IC* algorithm. We had more than one model because of the reason discussed in the last section. However, if we focus on the predicting variable (which is conversion in this telemarketing dataset), we could find 9 features being closest to it, though not directly connected. While this is not a business that I can intervene to clarify the relationship, I made chose the 9 features (labelled as scm) on a naive Markov assumption as my modeling parameters.

To make a comparison with a modern dimension reduction technique, PCA was applied to also created a set of 9 embedded features (labelled as pca). Together with the original set of 54 features (labelled as basic), there were 3 datasets for comparison.

In the comparison below, features from the causal model (red line) performed consistently better than that from PCA (green line).

On the other hand, the causal model features (red line) endured better to variation of training dataset than the PCA’s (green line). The green line dropped very quickly as more data was hidden from the training process.

Stability against visible data in training

Conclusion

While there is no guarantee which reduction technique always perform better than the others, the causal model provided a interpretable way to reason our data while allowing selection of features. In the telemarketing dataset, different combinations of any 9 features were tested, and the one that produced by the causal model ranked top 3%, which, at least, in this dataset, was not a merely coincidence.

Reference

Pearl, J. (2000). Causality: models, reasoning, and inference. Cambridge University Press.
S. Moro, P. Cortez and P. Rita. (2014) A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22–31. Retrieved from UCI Machine Learning Repository . http://archive.ics.uci.edu/ml/datasets/Bank+Marketing