Predict Customer Churn with Machine Learning

Denis Richter
Into Advanced Procurement
6 min readFeb 5, 2019

Written by Chidi Prince John.

Managing customer churn is one major challenge facing companies, especially those that offer subscription-based services. Customer churn (aka customer attrition) can be defined as the loss of customers, and it is caused by a change in taste, lack of proper customer relationship strategy, change of residence and several other reasons.

In this article, I will employ the superpowers of machine learning to assist a hypothetical company in predicting customer churn. If businesses can effectively predict customer attrition, they can segment those customers that are highly likely to churn and provide better services to them. In this way, they can achieve a high customer retention rate and maximize their revenue.

Our dataset contains demographic details of customers, their total charges and they type of service they receive from the company. In developing this model, I will use R Studio, Kernel-SVM machine learning model and some business insights.

First, let’s import and view our dataset.

R Code for Importing and Viewing Data

After importing the data, we can view it to understand the variables.

Dataset

Our data contains details of our customers. However, we can see that the data lacks numbers which might be an issue while making computations. To solve this issue, we will apply our business insights and create numeric variables.

Before creating numeric variables, let’s check for missing values in our data. The code below checks for missing values in each column. We can easily check for missing values in the entire dataset but I decided to check for individual columns so I can identify those columns that have missing values.

Identifying Missing Values

After running the code, we will see the sum of missing values for each column.

11 Missing Values for TotalCharges

There are 11 missing values for TotalCharges and that column is the only one in our data that has missing values.

There are several ways of handling missing values in data science. Sometimes, we fill up the spaces with the means of the columns. However, in this case, we will omit the missing values. The code in the image below omits the missing values from the dataset.

Omitting Missing Values

The code above removes all missing values from the dataset and leaves us with data devoid of NAs.

Next, we will apply our business insights to create dummy variables for most of our non-numeric variables. The image below shows the variables and the dummies. I used the ifelse function in R to create the dummies. You can also use the factor function in R to create dummies.

Creating Dummy Variables using ifelse in R

After creating the dummy variables, we might want to perform exploratory analysis on our data to check for outliers, distribution and other insights. There are several ways to perform this operation but I prefer the sqldf package on R for exploratory analysis. Most Data Scientists use charts for exploratory analysis.

Exploratory Analysis using sqldf on R

The above image shows the exploratory analysis I performed on each column on the dataset. Using this analysis, I understood the variables and checked for outliers and dirty data.

After performing exploratory analysis, we might want to drop unwanted columns in the dataset and also encode our remaining variables to dummy variables.

Drop Unwanted Variables and Encode Dummy Variables

Let’s view our dataset

Numeric Dataset

We can see that we have gotten rid of our non-numeric variables. Now, we can perform our analysis on this dataset.

Next, we can see that the range of values in our dataset are far apart. Some columns are encoded as binary variables while others have large numbers. To reconcile this range and produce a dataset that has a narrow range, we will apply feature scaling and then split our data into the training set and the test set. The training set will be used to train the model while the test set will be used to test the model. We will split the training set and test set in a ratio of 3:1.

Feature Scaling and Data Splitting

We can also notice that our dataset has numerous variables. In fact, it has 26 variables with 25 predictors. To reduce the dimensionality of the data and produce variables that are highly statistically significant, we will perform dimensionality reduction through Principal Component Analysis (PCA).

Principal Component Analysis (PCA)

Let’s view our data with two predictors and one dependent variable.

2 Predictors and 1 Dependent Variable

Now, it’s time to fit our model. For this problem, we will train the model with Kernel-Support Vector Machines and predict the test set result.

Kernel-SVM Model with Prediction

To ascertain the false positives and false negatives in our model, we will use a confusion matrix.

Confusion Matrix
Confusion Matrix Result

The confusion matrix shows that we accurately predicted customer churn in 1384 out of 1758 cases. We had 133 false positives and 241 false negatives. However, we should be careful not to determine the accuracy of our model through the confusion matrix. A better way to determine the accuracy of our model is through validation. In this case, we will use the k-fold cross-validation.

k-fold Cross Validation

k-fold cross-validation limits the effect of chance occurrences while measuring the accuracy of your model.

Model Accuracy ~80%

We can see that the accuracy of our model is ~ 80%. While this is a decent result, it shows that we can still make improvements on the model.

The above code creates a chart with a high resolution that shows our false positives and false negatives.

Kernel-SVM Chart

The red portion of the chart shows the region for not-churn (0) while the green portion shows the region for churn (1). The red dots shows our predictions for not-churn while the green dots show our prediction for churn. What insights can we then derive from this chart?

We can see that we have more green dots in the red area than we have red dots in the green area. The green dots in the red area shows our false negatives meaning that we wrongly predicted not-churn for those customers who churned. On the other hand, the red dots in the green area shows our false positives meaning that we predicted churn for those customers that did not churn.

The area separating the two portions is a curved line (almost like a circle). This line shows that our Kernel-SVM model is not a linear model. For a linear model, the separating line will be a straight line. Using a non-linear model, in this case, is better because it is more flexible in making predictions in the model.

I hope you enjoyed my article. Don’t forget to give me a clap and share your comments below.

--

--