Improving Customer Retention using Machine Learning

Paul Wanyanga
Analytics Vidhya
Published in
4 min readApr 13, 2021

Business Perspective

Having a model that can help you retain customers that you have spent money acquiring is vital for the success of any business. Given that the cost of attracting a new customer is five times the cost of keeping an existing one, businesses need to pay as much as attention to retaining customers as they do to acquiring new ones. In fact, according to Bain & Co, increasing customer retention by just 5% increases profits by 25% to 95%.

So how do business improve customer retention? While there are numerous ways business can use to retain customers, sometimes it can be hard to predict when a customer is likely to leave the business. This is attributed to the fact that most disgruntled customers don’t speak but just leave the business. In fact, statistics show that 96% of unhappy customer don’t complain and 91% of those will simply leave and never come back. As a result, depending on physical attributed to predict churn can be quite challenging. However, with advance in technology, business are increasingly adopting machine learning to create models that can be able to predict customer churn based on customer data.

This post aims to illustrate how businesses can leverage machine learning in reducing churn by predicting customers who are likely to leave.The classification problem uses Telco sample data from IBM to create a model that can predict customer churn based on given features. Several classification models are compared and the best is chosen for classification purposes. The target variable ‘Churn’ is a binary variable with two possible outcomes; ‘Churn’, ‘No Churn’. You can access the complete code from my Github repository

Exploratory Data Analysis

The tenure represents the number of months the customer has been with the company. The histogram reveals that most customers have been with the company for less than 5 months or for more than 70 months. The total amount charged to the client was skewed to the right with most charges below 2000. There was no evidence of multi-collinearity.

Majority of customers who had churned had lower length of tenure while those who hadn’t churned had significantly higher length of tenure.

Respect to the payment method, accounts billed using electronic check accounted for the highest amount of churn. Fiber optic mode of internet service had the highest level of churn. Failure to provide technical support accounted for the highest level of churn.

Modelling

In modelling a baseline model based on a random forest classifier was used and an accuracy of 0.796 was achieved.

There was no change in accuracy after using stratified cross-validation and the analysis sort to use a different classifier to see if any improvement in accuracy would be achieved.

A gradient boost classifier was used with cross-validation and an accuracy of 0.80 was obtained. This classifier was adopted for the final model.

Feature Importance

A feature importance analysis was conducted and the variables, contract (month to month), tenure, internet service (fiber optic), total charges and monthly charges were the top most important variables to the model;

The final model

Grid search cross validation was used to select the final model. The best selected model had an accuracy of 0.814

Model Evaluation

An area under the curve was used to evaluate the final model together with a classification report. The overall accuracy was 0.814 and the area under the curve was 0.89.

Conclusion

The accuracy achieved in this write-up was just about what most kaggle competitions have achieved and slightly below Adaboost which achieved an accuracy of 0.82. We can thus conclude that this would be a good baseline upon which to build churn prediction models in future.

Sources

--

--