Road Map For Predicting Next Purchase Day of Customers

13 min readOct 8, 2020

https://www.redeye.com/insights/blog/making-the-first-to-second-purchase

With growing impact of the digital world, e-commerce companies are building a customer-centric organizational and management model. The companies want to know more about their customers and predict the everything about the customer in order to take suitable actions to not to loose their customers and increase the number of loyal customers. Therefore, predicting the next purchase day of the customer is a growing need to complete predicting everything about the customer idea. By this way, the companies can prepare special offers to their customers and can take step ahead to make them as their loyal customer by offering a personalized experience.

Predicting next purchase day is also important for identifying the possible customers who can be churned in the future. Because, the large delays between the purchases can be sign off the churned customers which will cause less revenue. Therefore, taking actions like sending promotional mails and notification or giving discount codes or coupons to these customers can bring back these customers and increase the loyalty.

Customer data can be divided into four groups such as;

Demographic: gender, age, marial status, occupation
Geographic: location, region, urban/rural
Behavioral: spending, consumption habits, product/service usage, previously purchased product
Psychographic: social status, lifestyle, personality characteristics

Since all customers can have different kind of needs and different characteristics, it is becoming really hard to understand the unique requirement for each customer. Therefore, this is the point that customer segmentation comes into play. Customer segmentation is a method of dividing customers into groups or clusters on the basis of common characteristics. Segmenting or basically grouping customers according to their behavior can provide a chance to take actions after predicting their next purchase day and offer them to personalized campaigns.

Customer segmentation can be done based on the RFM metrics. RFM is stands for Recency, Frequency and Monetary. It groups customers on the basis of their previous purchase transactions, like how recently, how often and how much did a customer buy.

Recency (R): Who have purchased recently? Number of days since last purchase (least recency)
Frequency (F): Who has purchased frequently? It means the total number of purchases. (high frequency)
Monetary Value (M): Who have high purchase amount? It means the total money customer spent (high monetary value)

After building RFM metrics, the customer data can be divided into groups according to these three RFM dimensions. In general, division is done using quartile which means dividing the customer into four tiers for each dimension.

From the table, it is seen that each of the three RFM metrics consists of four groups which produce a 64 distinct customer segments.

As a note, more grouping can be used as in [2] and grouping is summarized in the below table.

Since the quartile method is the best practice for RFM, the below grouping according to the RFM metrics can be used,

Best Customers: Customers who are found in R-Tier-1, F-Tier-1 and M-Tier-1 (in short 1-1–1), meaning that they transacted recently, do so often and spend more than other customers. Communications with this group should make them feel valued and appreciated.
High-spending New Customers: Customers in segments 1–4–1 and 1–4–2 meaning that they transacted only once, but very recently and they spent a lot. It is always a good idea to carefully “incubate” all new customers, but because these new customers spent a lot on their first purchase, it’s even more important.
Lowest-Spending Active Loyal Customers: Customers in segments 1–1–3 and 1–1–4, meaning that they transacted recently and do so often, but spend the least. These repeat customers are active and loyal, but they are low spenders. Marketers should create campaigns for this group that make them feel valued
Churned Best Customers: Customers in segments 4–1–1, 4–1–2, 4–2–1 and 4–2–2 meaning that they transacted frequently and spent a lot, but it’s been a long time since they’ve transacted. These are valuable customers who stopped transacting a long time ago.

Until this point, a brief introduction for the main problem is given. Now, a methodology for the prediction of next purchase days can be provided. It should be noted that the project is based on the historical data of the company and the data is time series historical data.

As in classical data science projects, the steps of the problem are

Data Wrangling
Feature Engineering
Selecting a Machine Learning Model
Model Tuning

1.1 Data Wrangling

In the data wrangling part, suppose that the dataset consists of below features;

ID: unique id of the transaction
CustomerID: unique id of the customer
ProductID: unique id of the sold product
Date: product sold date
Sold_quantity: sold quantity of the product for the customer
Stock: stock count of the product when the customer bought
Category: category of the product
BrandID: procudt brand id
Price: price of the product
Device: the ordered device like mobile, desktop, and tablet including IOS or Android

In order to make predictions about the next purchase day, it is needed to split to dataset into two parts, one part is used for making predictions for the second part. Let’s assume, the dataset has data for 1 year, then it can splitted into two parts like,

By using the Date feature in the dataset, we will split the data as first 9 months and last 3 months. Since, we will use the customer data in the df_1 dataset, it is important to find the unique customers in this dataset and identify them in the df_2 dataset. Because, we will perform the prediction for those customers.

By using the unique customerIDs in the df_1 dataset, we will find their last transaction date and the corresponding first transaction date for those customers in the df_2 dataset and we will create a new dataframe consists of the customerIDs and corresponding day interval for those. If the unique customerID in df_1 does not exist in the df_2 dataset, day interval will be seen as NaN values and it means that those customers does not purchase anything in the last 3 months. As a note, we can drop these customers or we can replace NaN values with some large value to identify those customers in the prediction process.

1.2 Feature Engineering

Now, we have a small dataset, df_new, with 2 features; customerIDs and their day interval for the next purchases.

Next step is segmentation of the customers in the df_1 dataset with RFM metrics. As we discussed in the RFM metrics definition part, we will use quartile method and it will create 4 class based on the RFM metrics. In order to have 4 class, we will calculate the day intervals between next four purchases. Since we have the day interval for the first purchase in the df_new dataset, we can identify the next three purchases of those unique customers in the df_1 dataset by matching them with customerIDs and transaction dates in the df_2 dataset. As in the first calculation, if the customer does not make any second/third/fourth purchases in the df_2 dataset, day interval for those purchases will be NaN.

Now, in the df_new dataset, we have customerID and 4 day interval for those customers. As a second step, we can calculate statistical values for those customers by using the mean and standard deviation. Mean is calculated by average of these 4 day interval and standard deviation is also calculated by using these 4 day interval.

Before moving the ML model part, we should create our final dataset by aggreting the df_1 dataset and df_new datased with customerID feature. Since in our final dataset, we will have NaN values for the customers who have made purchases less than four, we can decide to drop them and it will make us focusing only the customer who will made at least four purchases.

1.2.1 Scaling for categorical features

Since we have categorical features like productID, customerID and brandID, we should encode them for ML model. Typically, learning algorithms expect input to be numeric, which requires that nonnumeric features (called categorical variables) be converted. One popular way to convert categorical variables is by using the one-hot encoding scheme. One-hot encoding creates a “dummy” variable for each possible category of each non-numeric feature. For example, assume someFeature has three possible entries: A, B, or C. We then encode this feature into someFeature_A, someFeature_B and someFeature_C.

Even if we might perform one-hot encoding for large dataset, bases on the experience we can say one-hot encoding will be a very rude technique for features with lots of unique values. For example, if a feature has 1000 unique values it means that one-hot encoding will create 1000 new features and it blows out the data significantly.

Therefore, we can say that using algorithms, which can perfectly deal with categorical features: tree-based algorithms, such as Random Forest or Gradient Boosting methods can be a perfect candidates for such dataset.

On the other hand, for the linear or neural-based models, we have found a better technique called as mean encoding which is based on grouping the observations by each unique value of the feature and substitute the feature values with the mean values of target variable for each group. However, this technique does lose some information, containing in the feature, however, it does not require the swell of the dataset. So, it would be good to observe the result of both methodologies with/without scaling.

1.2.2 Scaling for numerical features

If we decided to use tree-based algorithms such as Random Forest, we can use them straightaway without performing any scaling. Because any tree-based model creates M folds (a hyperparameter, the default value usually equals 256) of values, where all values in each fold are treated equally. However, for features which have great number of unique values, thus, some values will come upon the same fold and hence get the same value from the tree’s point of view. Thereby it should be better to thoroughly produce label encoding since similar values should be encoded with similar numbers.

However, if we decide to use NNs or boosting algorithms, standardization (scaling) should be beneficial, so let’s perform scaling for that option.

1.3 ML Model

Before performing the ML model, we need to decide the classes according to the next purchase day which we have calculated first in the feature engineering part. In order to decide the classes, we can use statistical features like mean, standard deviation, minimum, 25%, 50%, 75%, and the maximum value. By using these values, we can create classes at a first glance like;

Class_1 is in the interval minimum and 25%
Class_2 is in the interval 25% and 50%
Class_3 is in the 50% and maximum
Class_4 is greater than maximum

After deciding the class, we will replace the day interval values with the classes.

As a first step, we can calculate the correlation matrix and observe the relationship between features and our target label which is the calculated class.

In the selection of ML model, we will perform training with three different model and select the best one to go with hyperparameter tuning. We have chosen Decision Tree Classifier, Random Forest, and AdaBoost.

1.3.1 Decision Tree

In decision analysis, one of the visual and explicit representation of decision and decision making procedure can be performed by using decision trees. The decision tree is using a tree-like model of decisions such that a flowchart like tree structure in which each node internal node denotes a test and each branch represents an outcome of the test and each leaf node holds a corresponding class label. Decision tree is the one of the most powerful and popular tool for classification and prediction.

Strengths

Decision trees are able to handle both continuous and categorical variables.
Decision trees provide a clear indication of which fields are most important for prediction or classification.
Decision trees are able to generate understandable rules.
Non-linear relationships between parameters do not affect tree performance.
It can also be used in data exploration stage, such as we can create new variables / features that has better power to predict target variable.
It requires less data cleaning compared to some other modeling techniques. It is not influenced by outliers and missing values to a fair degree.
This means that decision trees have no assumptions about the space distribution and the classifier structure.

Weakness

Over fitting is one of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model parameters and pruning.
While working with continuous numerical variables, decision tree loses information, when it categorizes variables in different categories.
Decision trees can be unstable because small variations in the data might result in a completely different tree being generated.
Decision trees are prone to errors in classification problems with many class and relatively small number of training examples.
Decision tree can be computationally expensive to train. The process of growing a decision tree is computationally expensive. At each node, each candidate splitting field must be sorted before its best split can be found.

1.3.2 Random Forest

RandomForest are always a safe bet as they generally have high average accuracy rate for most cases and work well for complex classification tasks as well.

Strengths

Scale quickly, have ability to deal with unbalanced and missing data
Generates an internal unbiased estimate of generalization error as forest building progresses.
Provides an experimental way to detect variable interactions.

Weakness

Less effective on noisier-larger datasets with overlapping classes.
Large number of trees may lead to slow real-time prediction in some cases.

1.3.3 Adaboost

AdaBoost, which is the abbreviation of the Adaptive Boosting, is the first practical boosting algorithm proposed by Freund and Schapire in 1996. AdaBoost is an algorithm for constructing a “strong” classifier as linear combination of “simple” “weak” classifier. In other words, the main working principle is to convert a set of weak classifiers into a strong one. Weak classifier is described as less than 50% error over any distribution and strong classifier is thresholded linear combination of the weak classifier outputs.

Strengths

The implementation of the algorithm is simple.
The feature selection is done which resulting in relatively simple classifier.
Good generalization can be provided.

Weakness

Algorithm is sensitive to the noisy data and the outliers.
The provided solution can be the suboptimal solution.
Algorithm may not handle with the increasing complexity.

Using Adaboost with Decision Tree Classifier can improve the performance of the model.

To properly evaluate the performance of each model we’ve chosen, it is important to observe below features,

metrics — F score on the testing when 100% of the training data is used,
prediction/training time
the algorithm’s suitability for the data.

According to these three metric, best model can be chosen.

1.4 Model Tuning

In the machine learning applications, we need to optimize the various hyper-parameters such as learning rate, regularization factor, etc. In many settings, we cannot try all of the combinations because the exponentially growing number of combinations. Therefore, the particular optimization procedures cannot be used to solve and find the optimal points. The goal of hyper-parameter optimization is to find the set of hyper-parameter values that minimizes the validation error function. For each pair of hyper-parameters, we need to solve the optimization problem. The set of hyper-parameters should be selected carefully to find the right combination of their values which can provide the minimum loss/error or the maximum accuracy according to the defined problem. One of the most used hyperparameter tuning method is Grid Search.

The grid search can be defined as finding the parameters which provides the best performance. The grid search technique is based on trying the different set of hyper-parameters to find and validate the best performance and accuracy for the dataset. In the dataset, the hyper-parameters are defined as the parameters that are not directly learnt within the estimators. In the first step, for each hyper-parameter the grid search algorithm investigate a small set of values and then trains the model for each sets related to hyper-parameters. Lastly, the best parameter values are selected if not the procedure is repeated. In this way, grid search can be used to optimize the learning algorithms to find out the best classifier for different combination of the parameters. For every set of hyper-parameters, grid search evaluates the validation error function and picks the pair that gives the minimum value of the validation error function. There is no guarantee that the search will produce the perfect solution, as it usually finds one by aliasing around the right set.

The overfitting can be described as the major problem for machine learning. In order to prevent the overfitting and to use the model in general, the splitting is performed. The dataset is splitted into test and train data. However, the overfitting is still a risk to model. Therefore, another data partitioning is performed inside the validation data. However, directly splitting the train set into train and validation set will reduce the number of samples used for learning. In order to keep the sample property, the training dataset is split into k smaller set which the procedure is called as cross validation. The k-fold cross validation will get the training dataset and seperate them into k groups of data. And in each data group, the training and validation will take place. In this way, the training and testing of each group will use different regulation parameters to maximize the model performance. And in the end, the average of this parameters are used to evaluate the model performance. Grid technique is used for every sub group despite for overall dataset. And optimization process can be done in more easier way.

The feature importance is an important task when performing supervised learning on a dataset like the census data we study here is determining which features provide the most predictive power. By focusing on the relationship between only a few crucial features and the target label we simplify our understanding of the phenomenon, which is most always a useful thing to do.

After tuning the model, we have the final model and ready to make prediction about the next purchase days.

References

[1] https://www.optimove.com/resources/learning-center/rfm-segmentation

[2] https://www.putler.com/rfm-analysis/

[3] https://www.saksoft.com/predicting-customer-next-purchase-day/

[4] https://essay.utwente.nl/74808/1/seippel_MA_eemcs.pdf

[5] https://community.alteryx.com/t5/Alteryx-Designer-Discussions/predicting-when-next-saleswill-take-place/td-p/46524

[6] https://www.kdnuggets.com/2020/05/hyperparameter-optimization-machine-learningmodels.html

Road Map For Predicting Next Purchase Day of Customers

References

Written by cdumen