Customer Lifetime Value Prediction Using Embeddings
This paper was presented during the RE·WORK · Deep Learning in Retail & Advertising Summit (London). It describes the Customer Life Time Value (CLTV) prediction system deployed by ASOS.com, an online clothing shopping website. For e-commerce companies, being able to better predict the CLTV delivers a huge benefit for the business. This paper provides a detailed explanation of related work, system architecture, and model improvement with embeddings enabled feature learning.
What is CLTV?
CLTV, or Customer Life Time Value, is a prediction of the net profit attributed to the entire future relationship with a customer. In other words, it represents how much each customer is worth in monetary terms. This information can be used to judge the appropriate costs of customer acquisition, as well as retention spending on existing customers.
The concept of CLTV can be defined differently based on different needs. At ASOS, the CLTV is defined as the net spend which is sales minus returns over a year. Providing the predicted CLTV in the one year span delivers an actionable insight for business. Hence, the problem of CLTV prediction is defined. The training and prediction timescales for CLTV is established as:
Fig 1. Training and prediction time-scales for CLTV.
The model is retrained every day using customer data from the past two years. Labels are the net customer spend over the previous year. Model parameters are learned in the training period and used to predict CLTV from new features in the live system.
As shown in Fig. 1, the labels of the training are the CLTV, defined as net spend in the past year (past 12 months). The training features are in between -1 year (-12 months) and -2 year (-24 months), which is a disjoint period to the labels. The past year (past 12 months) features were used to make the predictions.
Related Work of CLTV Modeling
The studies on customer behavior have started decades ago. Early models were constrained by the lack of data and were often required to fit simple parametric statistical models with strict assumptions. It was at the turn of the century, with the data provided by large-scale e-commerce platforms, that new methods were developed and tested on empirical data.
Distribution Fitting Approaches
“Buy ’Til You Die” (BTYD) models were known as the first statistical models of CLTV. It uses parametric distributions to model the CLTV. Well-known Pareto/NBD  assumes an exponentially distributed active duration and Poisson distributed purchase frequency. Two more improvements was made to make the approach more usable .
Recency-Frequecy-Monetary Value (RFM) models  is an expansion of the BTYD. It captures the time of last purchase (recency), the number of purchases (frequency), and purchase values (monetary) to perform CLTV estimation. Still, it was based on Pareto/NBD for recency and frequency with purchase values following an independent gamma/gamma distribution.
Machine Learning Methods
Despite the success in distribution fitting, it is difficult to incorporate a large number of customer data available to modern e-commerce platforms, such as web browsing data, into RFM/BYTD. Thus, we step into machine learning oriented methods.
Current Model & Architecture at ASOS
Currently, ASOS deploys a random forest model by Apache Spark. It has gathered customers’ demographics, purchases, returns, product information, and utilizes manually created features. In the machine learning pipeline, two models were trained: churn classification and CLTV regression. After calibration, the whole system delivers predictions to business stakeholders.
Recent developments by ASOS is to use embeddings to capture information from web/app sessions as features in the current model. This part is discussed later.
The figure below illustrates the CLTV system:
In the current random forest model, the author shares the importance of features. Many interesting and surprising insights have been discovered by this:
- the standard deviation of the order and session dates
- number of items purchased from the new collection
Without the embeddings enabled feature, the ASOS’ CLTV system already delivers a nice result:
- For CLTV Model, 0.56 as Spearman rank-order correlation coefficient  which assesses the monotonic relationships between the predicted and actual value (the higher the better, +1 suggests a perfect monotonic relation).
- 0.795 as AUC for churn predictions
Improving CLTV Model with Feature Learning
The objective is to supplement the current handcrafted features. The automatic feature learning such as deep learning and dimensionality reduction help overcome some of the limitations of handcrafted engineered features. Thus two approaches tried at ASOS:
- The author applies unsupervised neural embedding on customer product views to generate latent features, then use them to supplement the feature set of random forest model.
- Author trains a Deep Neural Network (DNN) on top of the handcrafted features to learn higher order feature representations.
Embeddings of Customer Using Sessions
The methodology extends an NLP (Natural Language Processing) neural embedding method: SGNS (SkipGram with Negative Sampling). This method is used in word2vec. Several applications in the related domains can be found, such as item2vec, prod2vec, bagged-prod2vec.
The intuition behind the scene is that high-value customers tend to browse products of higher value, less popular products, and products that may not be at the lowest price on the market. On the contrary, low-value customers tend to appear together in product sequences during sale periods, or for products priced below the market. This is why NLP’s SGNS comes in, as it tries to capture certain context and what the next (correlated) word(s) should be.
In practice, three key design decisions should be made:
- How to define a context
- How to generate pairs of customers from within the context
- How to generate negative samples
As we can see in the graph above, here comes the answer: the context is the customer’s buying sequence on each product. The pairs of customers will be generated within the context window. Also, negative customer samples are drawn. Then, the weighted matrices will be learned.
One problem ASOS encountered is that the embeddings learned won’t match on training and prediction due to randomness. To solve the problem, ASOS performs the matrix initialization differently:
- For customers that were presented during the training period: initialize is done with training embeddings.
- For new customers: initialize is done by drawing uniform random values in a relatively small scale compared to the training embeddings.
The following plot shows the improvement of the embeddings on random forest model:
Fig 2. Maximum area under the receiver operating characteristics curve achieved on a test set of 50,000 customers in deep feed-forward neural networks and hybrid models with different numbers of hidden layer neurons.
Three error bars represent the 95% confidence interval of the sample mean. The number of hidden-layer neurons are recorded in the following format: [x,y] denotes a neural network with x and y neurons in the first and second hidden layer respectively, [x,y,z] denotes a neural network with x, y, z neurons in the first, second, and third hidden layers.
Embeddings of Handcrafted Features
The reason for replacing random forest by Deep Neural Network is due to recent successes of DNNs in vision, speech recognition and recommendation systems. However, the results indicate that DNNs may improve the performance, but the monetary cost of training the model outweighs the performance benefits.
Fig 3. Maximum Area Under the receiver operating characteristics Curve (AUC) achieved on a test set of 50,000 customers in hybrid models against the number of neurons in the hidden layers (in log scale).
The error bars represent the 95% confidence interval of the sample mean. The bottom (green) and top (red) horizontal line represent the maximum AUC achieved by a vanilla logistic regression model (LR) and our random forest model (RF) on the same set of customers. The dashed lines in the shaded region represent different forecast scenarios for larger architectures.
The Figure 2 shows the benchmark (AUC) of DNN (2 hidden layers) compared to LR (logistic regression) and RF (random forest). It’s possible that with more neurons, DNN can outcompete the RF.
Fig.4. Mean monetary cost to train hybrid models on a training set of 100,000 customers against the number of neurons in the hidden layers (both in log scale).
The training cost shown is relative to the cost of training our random forest (RF) model. Here we only consider hybrid models with two hidden layers, each having the same number of neurons. The bottom (green) and the top (red) horizontal line represents the mean cost to train a vanilla logistic regression model (LR) and the RF model on the same set of customers.
But unfortunately, the cost of training DNN is increasing tremendously while the number of neurons increases, as showed in the figure above.
CLTV modeling provides a very useful insight for decision makers. ASOS makes it actionable to business (stakeholders) by only predicting the CLTV for the next year. The modern approach to machine learning provides the way to measure the CLTV without any assumption (compared to distribution fitting approaches). It also adapts the huge quantity of data and delivers more accurate results. Furthermore, it extends the idea from NLP’s embedding models (word2vec) to capture customers’ behavior with the browsing session data. This is a very interesting paper for online retail companies’ data science team.
 David C. Schmi lein, Donald G. Morrison, and Richard Colombo. 1987. Counting Your Customers: Who Are They and What Will They Do Next? Management Science 33, 1 (1987), 1–24. DOI:h p://dx.doi.org/10.1287/mnsc.33.1.1
 Albert C. Bemmaor and Nicolas Glady. 2012. Modeling Purchasing Behavior with Sudden ”Death”: A Flexible Customer Lifetime Model. Management Science 58, 5 (5 2012), 1012–1021. DOI: http://dx.doi.org/10.1287/mnsc.1110.1461
 Peter S. Fader, Bruce G. S. Hardie, and Ka Lok Lee. 2005. Counting Your Cus- tomers? the Easy Way: An Alternative to the Pareto/NBD Model. Marketing Science 24, 2 (2005), 275–284. DOI: http://dx.doi.org/10.1287/mksc.1040.0098
 Peter S. Fader, Bruce G. S. Hardie, and Ka Lok Lee. 2005. RFM and CLV: Using Iso-Value Curves for Customer Base Analysis. Journal of Marketing Research XLII, November (2005), 415–430. DOI: http://dx.doi.org/10.1509/jmkr.2005.42.4.415
 Spearman’s rank correlation coefficient: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
Author: Yi Jin | Editor: Joni Chung | Localized by Synced Global Team: Xiang Chen