Predictive analytics for customer acquisition with MADE.com

Nicolas LISCH
MADE.COM Engineering and Data Science
8 min readAug 28, 2019

How predictive analytics can support customer acquisition in digital channels?

I’d like to share our latest experience at MADE.com…

As a Data Scientist at MADE.com, I am passionate about web analytics: I use Data and Machine Learning to improve customer conversion rates which in turn helps increase revenue.

If you are a marketer or just digital savvy, this article is made for you.

I want to demonstrate how we incorporated predictive revenue as the main metric to drive qualified traffic on our website and lower customer acquisition costs.

As a marketer, Return On Investment or Return On Ad Spend is probably the most important metric you use to judge the effectiveness of your digital campaigns and adjust your budget per channel accordingly.

Cumulated first transaction share for customers over 90 days since the first visit

However, in most cases you have to wait for 30–90 days to start to see which channel brings in high-value customers. Especially if you have a long sales cycle like at MADE.com (only 35% of transactions happened on the first session day of a new visitor).

That being said, predicting an accurate long term ROAS is a key challenge for marketing teams as they have to work against performance budget/target daily. Indeed, 45% of spending doesn’t lead to immediate revenue.

Of course, some proxy can be used to evaluate the quality of new visitors (eg: bounce rate or attribution modelling) but as we know time is precious ” and it’s already too late to make real-time accurate decisions.

This is why long term revenue needs to be predicted.

How the Analytics team tackled this challenge?

Start with your proof of concept …

Before you can test and justify to stakeholders whether your predictive revenue model is worth building, you might have to visualize or hypothesize beforehand the expected outcome of this model.

In that case, I strongly recommend starting with a proof of concept based on Google session quality metric.

This turnkey metric will allow you to predict overall revenue, but that accuracy declines as granularity level of analysis increases.

For example, session quality does not take into account important traffic features like the landing page, channel, markdown. This is why you will need to create your predictive model to predict at session-level.

When it comes to predict the SEM or Direct revenue, you need to learn in-depth patterns from the first session of the new visitors.

You will have to shift from using standard Google historical data (eg: bounce rate, time on site) to more advanced in-house visitors data (eg: cities, landing pages) and aggregate data to fill in gaps in your model.

We defined the predicted revenue as the conversion probability multiplied by the expected revenue of a new visitor ( Average Order Value) following the function below :

Proof of concept

If Google Analytics is your web analytics solution, as it is at MADE.com, you’re lucky. You already have a native KPI called ‘Session Quality’ available in Google Analytics & BigQuery.

This KPI provides you with a session quality indicator, with a score from 0 to 100.

Based on ML Google model, session quality seems built on most common web analytics metrics such as ‘hits’ or ‘time on site’ as you can observe on the correlation matrix below. It does not correlate with custom internal tracking points such as the one we developed (‘newsletter subscription’).

Correlation matrix to check session quality

This KPI which has, of course, its limits, is an ideal starting point to implement your baseline model to support your digital growth performance.

Predicted revenue for a given session will be calculated as follows :

Where

The adjustment coefficient to calibrate session quality to transactions in 90 days period is :

Session Quality is the session conversion probability KPI.

The average order value 90 days is :

Let’s now have a quick look at results from our first model!

The trend in Predicted Revenue Versus Measured Revenue Over Time

It seems promising as it catches pretty well the reality of the sales.

This model provides you with a better vision of predicted Return On Ad Spend:

But reach its limits when it comes to predict accurately revenue at session-level.

Custom Predictive Model

Can we challenge Google ML session quality to improve performance?

Ok, if you don’t have Google resources, don’t forget that you’re an expert in your own business and that you have much more data points to understand your new customers!

Now remember our initial equation:

To solve it, we implemented two models:

1. Classification model to predict conversion probability

2. Regression model to predict AOV for potential customers.

1. Classification model: Will your new sessions convert?

This one is binary and aims to predict ‘1’ (Has converted in 90 days) or ‘0’ ( Has Not converted in 90 days). The main idea is to train the classifier and then extract conversion probabilities predicted for each new session.

For illustrative purposes only, let’s assume only ~ 2,5% of new visitors converted. In other terms, we’re facing a dataset with an unbalanced distribution.

Dataset Distribution

We trained different models, and plot ROC curves and Precision-Recall Curves as main errors KPIs to validate our predicted probabilities.

ROC curves show how the number of correctly classified positive examples varies with the number of incorrectly classified negative examples.

Precision-Recall Curves are interesting in our case as we have to deal with the unbalanced dataset.

On the results below, Random Forest and Google model seem comparable in ROC space, but we can see a clear improvement for Random Forest looking at Precision-Recall curves:

AUC for ROC curves and Precision-Recall curves

Light GBM model has many benefits as it handles natively categorical features, maximizes Area Under Curve for both curves, and is trained on a huge amount of data in only 5 min. As a comparison, it took more than one hour for Catboost to obtain similar results…

Now training a complex model on an unbalanced dataset can lead to uncalibrated probabilities. And we want the best results for our model!

We applied the beta calibration to our predicted probabilities to calibrate results to the total measured transactions.

Probabilities Adjusted for Light GBM model trained until 29/11/18 and predictions on March 19

Let’s have a deeper look at the model’s performance.

Light GBM, a black-box model?

Complex models can lead to transparency and governance issues, due to the explainability of the function to model data. Nevertheless, new methods help to provide insights, to enhance improve your algorithm.

Applying such a method will also provide you with rich customer behaviour journey insights.

I used Shapley Additive exPlanations (SHAP) as a way to understand the impact of the model features.

This approach uses SHAP values that you can interpret as additive impacts coefficients for each feature of the model.

You can first retrieve a robust view of the model features importance on the left plot below.

The right chart indicates a positive or negative impact by feature for each session. As the model is not linear, you can expect different SHAP values for each session because of the interactions with other features.

How to interpret this in a business term?

Red points represent high values for a given observation.

The farther the point on the right the more positive is the impact on the conversion.

Based on the chart numerical features: time on site, add to cart, newsletter subscription or search use have a positive impact on the conversion. In other words, a session with high time on site is more likely to convert.

Shapley visualisation allowed us to focus on the outliers and exclude them from the dataset where it was necessary. For example, we have identified sessions where conversion happened with just one click due to a tracking issue.

Note: Shapley visualisation is not useful for categorical features as they are label encoded.

Let’s now move to the second step to build our predictive model.

2. Regression model: Which is the order revenue for future customers?

This one aims to predict revenue 90 days for new sessions and is trained on 90 days purchasers. We reached a positive R-Squared score with only 5 features such as promotion markdown, Product AVG price view on a product page or AVG list impressions ( first session-level).

3. Predicted revenue: Generate value for the business

You can now multiply conversion probability 90 days by expected revenue 90 days to get Predicted revenue for each new session on your website.

Predicted revenue for 2 different users

It is critical to measure the performance of the model via an error KPI. We use Absolute Mean Error:

Suggested Absolute Mean Error ( in Local Currency )

Finally, we can predict 90 days of revenue daily:

Revenue performance for new visitors from the first day of acquisition

Conclusion:

Now it is time to apply this model to digital campaigns executions.

All these findings provide us with a better understanding of onsite user experience. Therefore, marketing teams can benefit from predicting the likelihood of a user converting by eliminating/reducing allocation budget on inefficient acquisition campaigns (see the new campaigns performance distribution below, scale adjusted for illustrative purposes).

Also, they can achieve major performance management improvements by using this algorithm to automate bidding strategies or expanding audiences based on each visitor’s potential value.

Predictive analytics: Future and improvements

Phase two of the project will involve improving the model’s accuracy by implementing Customer Lifetime Value estimates to our model to learn how profitable are new customers and prevent high volatility.

I hope you enjoyed this article, any feedback or suggestions will be greatly appreciated!

Ressources

  1. Session Quality :

https://support.google.com/analytics/answer/7303153?hl=en

2. Light GBM / Catboost : https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db

3. ROC / Precision-Recall Curves: https://www.biostat.wisc.edu/~page/rocpr.pdf

4. Shapley additive explanations: https://arxiv.org/pdf/1802.03888.pdf

5. Beta probabilities adjustment: https://projecteuclid.org/download/pdfview_1/euclid.ejs/1513306867

--

--