Automating interpretable feature engineering for predicting CLV

Will a user become a top-tier customer?

Jan Osolnik
Inside BUX
12 min readNov 20, 2018

--

Source: Pixabay

Introduction

In the process of doing the MSc in Data Science program at the University of Amsterdam I was thinking about how to apply the newly acquired knowledge in the industry. Considering how much I previously enjoyed working in a startup environment in various roles, I knew that I would like to continue on a similar path. Fast-forward a couple of months, I was warmly welcomed by the BUX team to do an internship on the topic of using machine learning to predict customer lifetime value. This also clearly resonated with my interest of using data science to understand some aspect of human behavior and using it to make better decisions in a business.

The post provides the intuition behind what I've been working on during the 3-month internship so I haven't included any code. First I will write about the business value of customer lifetime value prediction and extend it into the classification of users into top-tier customers and non-top-tier customers with the definition of each. This will be followed by the machine learning methodology and show the results of how well the pipeline performed on this specific problem.

The project largely focused on applying Deep Feature Synthesis (DFS) algorithm as a method to automate the feature engineering process. If you haven’t yet, please read this post by William Koehrsen providing the explanation of DFS before continuing. If you are only interested in the problem and the results, then you can simply skip the methodology section. All of the code used in the project can be found on Github.

Problem definition

Having a clear picture of how well a business is performing is crucial and can be measured in various ways.

One of the basic metrics used to measure that in a SaaS business is the CLV (customer lifetime value) and CAC (customer acquisition cost), which are — when combined — a useful heuristic for determining the profitability of a business.

CLV measures how much revenue a particular customer will generate in their entire lifetime for the business, while CAC measures how much the business spends to acquire that customer. CAC is easier to measure as it becomes quantifiable in most businesses at the beginning of the funnel (the Acquisition stage). CLV becomes clear only after a certain period of time further down the funnel (in the Revenue stage). For example, a year of usage can be a proxy of how much the customer will spend in their entire lifetime.

The general rule of thumb is that CAC shouldn’t exceed 1/3 of CLV, but there is little general consensus on that. It’s a business decision and if the focus is profitability, then minimizing the threshold is a priority (1/4 and lower) by relying more on organic growth (which lowers the CAC). On the other hand, if the focus is high growth, then a company can tend to go for a 1/1 or even higher.

The generated revenue in the user base for a SaaS business often follows a power-law distribution: a small percentage of users (defined as top-tier customers) generate the majority of the revenue. (Pareto principle).

Power-law distribution of generated revenue per user

So how can we predict a customer’s lifetime value? Or to go even further, how can we predict which user will become a top-tier customer?

Methodology

That’s the problem I’ve worked on during the master thesis which was done during an internship at BUX which has built a mobile app that enables simple and affordable trading. We face a similar problem as many other SaaS businesses do — prediction of customer lifetime value based on the behavioral data that’s captured by customers/users using their product.

As mentioned above, the pipeline was applied to the users at BUX. There are two stages of using the product. The first stage is when a user is in the funBUX stage and trades virtual money in order to learn how to trade. When a user converts into a seriousBUX stage, he becomes a customer and starts trading with real money. As the business generates revenue through commissions and financing fees on seriousBUX trades the conversion also acts as a condition of a user becoming a top-tier customer. This is the distinction of a user and a client- only a client generates revenue and when users start using the product it’s not known if they will eventually convert and become customers. This will be useful in the results section when interpreting the features built by Deep Feature Synthesis.

The main idea behind the thesis was to build a generalizable machine learning pipeline that automates interpretable feature engineering. So let’s decompose this into two components:

  • Generalizable machine learning pipeline: the pipeline can be reused on other problems with the same data structure (more on that below, described as an entity set).
  • Automation of interpretable feature engineering: a large number of features is built automatically based on the underlying data tables and the relations between these tables. Two algorithms are used to be able to interpret the predictions both on the model level as well as the individual level, rather than using a black-box algorithm which provides much less interpretable predictions.

When all of the features are built, the most relevant features are chosen to avoid overfitting (to minimize out-of-sample error). When different questions in the business arise (such as predicting churn, predicting conversion, etc), the target values can be changed by which the pipeline chooses different features. In this case, this is the main value proposition of a generalizable pipeline.

The problems that data scientists nowadays often face is not being able to answer all the business questions. This way the energy and time invested to answer “Can we predict this using our data?” can be minimized as there is less human labor invested in the process. It is far from replacing data scientists, it is making them more productive in their work. When it’s clear that the underlying data can answer a business question, additional time can be invested in building the features manually with the intuition that is not embedded in the existing features built by DFS. The authors of the algorithm go much more in-depth into this in their paper.

It is often more valuable to the business to answer 5 questions with 80% accuracy as opposed to answering 1 question with 95% accuracy

This depends, of course, on the business case, but sometimes even simple heuristics might work — see Rule #1). There are often diminishing returns of additional time invested in optimizing the pipeline combined with the opportunity cost of investing energy and time on other questions. Still, there are also exceptions to this rule.

In order to build the features automatically, the data was structured into three entities (a required input into DFS): Cohorts, Users and Transactions.

Entity set (input into Deep Feature Synthesis)

These entities define the problem space which can be described as a prediction of human behavior. It is clearly a vague definition and it can solve only a subset of behavior prediction problems. In order for the system to be generalizable, the data needs to be structured this way, otherwise, a different entity structure needs to be built (which is not that difficult to change as it’s seen in the documentation of DFS).

The main underlying algorithms used were Deep Feature Synthesis (DFS) and Local Interpretable Model-Agnostic Explanations (LIME). While DFS focuses on automating feature engineering and producing interpretable features, LIME focuses on creating interpretable explanations of individual predictions.

For DFS, the two sources mentioned above definitely provide a sufficient intuition behind it. For LIME, there is a post providing the explanation and you can also find the original paper. LIME is model-agnostic and provides a local approximation — instead of trying to understand the whole model, it only provides the explanation of a prediction on an individual instance.

Explaining individual predictions to a human decision-maker (source)

Random Forest algorithm was used as the predictor as there was little difference in the performance compared to more complex tree-based algorithms such as Gradient Boost Trees and XGBoost.

There are two parts of the system — a generalizable and a custom one. Whereas the generalizable part can be used on any problem with the mentioned entity set structure, the custom one enables the system to be used to change the source of the data, the target values, adding the manual features, etc. Both parts are represented as utility functions, building blocks to build the entire end-to-end pipeline. There are quite a few more details in the whole implementation, so the thesis is available here for more insight into that.

The features built by the DFS were extracted from 3 weeks of behavioral data to predict a customer value after 6 months of usage which is used as a proxy for the actual customer lifetime value.

The 6-month time span was chosen instead of a longer one because this way more recent customers can be included in the learning process (as every customer needs at least 6 months of data, not more). The Pearson correlation coefficient between 6-month customer value and a 1-year customer value is 0.95 which shows that a 6-month period provides enough information to use it as a proxy for CLV.

Results

Performance evaluation

The pipeline was tested on a classification of top-tier customers — the customers that are in the 99th percentile based on generated revenue. As is often the case with classification problems that means a high class imbalance. Accuracy is not a useful metric to evaluate the pipeline (a useless model that classifies all customers into non-top-tier customer provides a 99% accuracy).

That’s why precision (the fraction of correctly classified top-tier customer among all of the top-tier customer), recall (the fraction of the correctly classified top-tier customer that have been classified over the total amount of top-tier customers) and F1 score (the harmonic mean between the two) are used.

Based on the features built, the probability of a user becoming a top-tier customer is produced. The default decision boundary for that is 0.5. This means that if the probability of a user becoming a top-tier customer is 0.6, it will be classified as a top-tier customer. If it’s 0.4, it will be classified as a non-top-tier customer.

The pipeline was evaluated with a 5-fold cross-validation to also take into account the variance of the performance over different folds in the data.

These metrics are:

AUC score: 0.83

F1 score: 0.56 +- 0.03

Precision: 0.73 +- 0.05

Recall: 0.48 +- 0.05

The small variance of different performance metrics shows that the pipelines perform equally well over different folds.

Confusion matrix before thresholding (threshold = 0.5)

The nature of the business problem gives more importance on identifying as many top-tier customers as possible, so minimizing the false negative rate (identifying a top-tier customer as a non-top-tier-client) is of higher importance than a false positive rate (identifying a non-top-tier customer as a top-tier customer).

In order to do that, we maximize recall. By using a thresholding function the threshold was changed to 0.2. This produced a recall of 0.65 and a precision of 0.5. This intuitively means that we are able to correctly identify 65% of top-tier customers while being 50% certain that we are correct when classifying one.

This is definitely a good performance considering that the behavioral data is not particularly granular (daily summaries of interactions in the product) with the vast majority of all of the features built automatically. Besides that, the full capability of DFS is not explored as the maximum depth of features is 2 (as there are 3 entities), while there are a lot of examples of more complex entity sets that can explore a much broader feature space.

Confusion matrix after thresholding (threshold = 0.2)

Interpretation

When it comes to interpretability of the pipeline there are two parts:

  • Interpretation of the features (built with DFS)
  • Interpretation of individual predictions (built with LIME)

292 features are built with the pipeline, of which 20 most relevant are chosen and the final predictions are made based on these features (an arbitrary number that can be changed as a parameter of the pipeline).

DFS enables all of the features to be interpreted in natural language as it’s shown in the example below for the 5 most relevant features:

5 most relevant features for predicting whether a user will become a top-tier customer

As we can see, the features are explainable in natural language and incorporate intuition that would otherwise be built manually by a data scientist, often with the help of a domain expert. The only feature built manually in this list is Conversion_Completed_hours_till_event as this was a clear low hanging fruit that wasn’t captured by the DFS.

The time to a certain event was one group of manually built features. There were also additional manually built features from Cohorts entity describing the environment in which the users started using the product — mainly the volatility of the market of different product types (such as currencies, indices or stocks). On the Users entity, the manual features described the trading segment of the user, such as a crypto trader or a forex trader (determined by the trades in various product types). While these features were initially expected to provide some valuable information gain, the aggregate DFS features provided the most value.

Such a clear explainability of the features hasn’t been (yet) possible using black-box algorithms such as neural networks which are often better performant. LIME and also SHAP are some of the methods that are aiming to change that. The argument for using a technique like DFS is when the speed of execution, interpretability and lower computational complexity overweigh the importance of additional performance.

Besides the interpretation of the features, the output of the pipeline also provides interpretation of probabilities of individual predictions. For example, below we can see a user that was given a probability of 0.74 of becoming a top-tier customer. It also includes the contribution of individual features, such as the level of dispersion of the value of trades, the maximum amount traded and the aggregate invested amount.

Interpretation of an individual prediction (using LIME)

Conclusion

Considering that the vast majority of features were built using DFS, this shows the possibility to quickly answer the question of: “Can we predict this using our data?”. If the question is yes, then the pipeline can be applied to solve the problem.

As already mentioned, the target value is the pipeline applied to produce a classification for depends on the business use-case. Whereas it was applied to classify top-tier clients it can also be used on any dimension of the behavioral data (such as which user will churn, which user will purchase a particular product, etc.) This use-case was particularly interesting because of the business dynamics of power-law distributed generation of revenue in the user base (which causes a high class imbalance of the target values in the dataset).

When predicting new incoming users on a regular basis, this can be done by scheduling the pipeline scripts as a job and loading the predictions into a database to visualize it.

It’s worth mentioning that it took 6 weeks to build the entire pipeline which would definitely not be possible without the libraries for DFS and LIME, both providing great examples of how to use each on a variety of datasets. So I’d like to thank the authors for all the value they’re generating in the open-source community.

I would also like to thank the whole BUX team, especially the Business Intelligence team, for the support I've been given during the internship. It has been a fun ride.

--

--