Customer Loyalty Classification

Zhaoze
4 min readFeb 24, 2019

--

The third project at Metis data science boot camp was to identify loyal customers to run a loyalty program for one of the largest credit card company — Elo.

Almost all most successful companies have a same goal — to increase company growth. Growth can be achieved in two ways: acquire more new customers or retaining old ones. However, since the cost of acquiring a new customer is in general really high, the ability to retain users is worth a lot of money in the long run.

To boost growth, Elo starts considering a loyalty program. Loyalty program is a reward program offered by a company to reward customer loyalty. Why loyalty program is important? Based on the analysis, existing customers are much easier to sell to than new customers. A customer is typically more likely to spend 10 times more if they are loyal to your credit card. Given that, the target of a loyalty program is LOYAL CUSTOMERS.

The analysis consists of four parts: data aggregation, exploratory data analysis, model selection and visualization. Data for this project is sourced from Kaggle, with approximately 20 million transaction observations and around 30 features.

The question I am trying to explore is — how to identify loyal customers? The original data doesn’t contain much information other than user ID, merchant ID, Datetime of each transaction. Since I am exploring a binary classification here, ideally I need an adequate amount of features to raise precision in my final predictions; therefore, the first challenge is to expand features.

A lot of efforts are made in feature engineering — transaction profile (e.g. # of transactions, frequency of transactions per month, accumulated transactions to date, etc.), merchant profile.

I have trained 5 models on the training dataset, including Gaussian Naive Bayes, K-nearest Neighbor, Logistic Regression, Random Forrest, and SVM. Based on the ROC curve drawn on the validation results, Random Forrest performed the best since it generates the largest area under the ROC curve, meaning it did the best job differentiating positive values from negative values among the 5 models.

Recommendations:

Knowing which days of the week/month or even which hours of the day customers are likely to use credit card may help the company to schedule marketing campaigns, staff up for increases in customer service requests, or otherwise, optimize operation. Here are some interesting stories I found from the exploratory data analysis.

  1. Customer buy more on 1st, 10th, 20th and 30th of a month. This may be connected with the date when people receive payroll — a noticeable peak of customer expenditure on a payday (some are paid semimonthly and some are biweekly)

2. Shoppers love Thursday and Sunday — the fourth day of the typical workweek produced more sales. On average, Sundays and Thursdays are the highest trafficked days respectively.

3. Just like knowing that Thursday and Sunday are great days for purchase should inform your marketing decisions, so too can understanding when consumers are most likely to shop during the day. Average purchase amount started to rise as early as 6:00 am, peaked at about noon, and leveled off until about 8:00 pm before falling again as most shoppers turn in for the night.

Feature importance analysis identified two important features. The features are monthly purchase frequency and monthly purchase amount. In the plot, I grouped two non-loyal categories for easy interpretation. The plots show that loyal customers have higher purchase frequency and purchase amount than customers in other category.

The loyalty program has limited number of winners. So I choose scoring metric to be precision because i want to have more real loyal customers out of all I predicted as loyal. The results table shows random forest has highest precision rate. Besides algorithms listed in the table, I also tried other algorithms such as logistic regression, K-nearest neighbor and SVM. However, they’re computational expensive or hard to converge.

To improve final model performance, I also tried under-sampling and oversampling, but only had minor difference.

Now, let me illustrate how random forest model works. I run model with flask and D3 visualization:

Tools I used are flask, D3, Javascript, Python and Tableau.

--

--