How to create a churn prediction model

Luis Eduardo
Neuronio
Published in
6 min readJun 19, 2019

Note: a Portuguese version of this article is available at “Como criar um modelo para predição de churn

Image source

Losing customers is not good, and if you can predict when a customer will stop using the service, you will have an opportunity to make decisions to retain the customer. Our goal in this article is create and compare Churn Prediction Machine Learning models.

The data is the most important thing in a Machine Learning model. A good model can’t do miracle with poor data, and to solve it is important to prepare the data to the model to get better results.

This article use the Telco Churn Customer Dataset, available on Kaggle. And all the code for this example is in python and is available here.

Let’s start.

The Dataset

To work with the dataset we will use the pandas library. First, let’s read the dataset.

We can see some informations about the dataset with the function pandas.DataFrame.info(). This function gives us the overall information about the DataFrame like all the columns, the type os columns and the number of non-null values of the columns.

The dataset info

We don’t need to handle missing values and we just have a few numerical features. In this dataset we already have the label we want to predict, the Churn column, but if it’s not included we need to create it based on its definition. If the definition is not available is necessary to define it before labelling.

The dataset refers to a Telecommunication company. There is information about customers, like Partner, Dependents, etc, and about their contracts, like PhoneService, MonthlyCharges, etc. Some features show how long the customer have been using the service, such as tenure and TotalCharges, respectively, mean/represents time in months that the customer has the service and total amount paid by the customer.

Now, let’s see the churn distribution with seaborn library.

The label is unbalanced. We have 27% of Churn and 73% of non-Churn in this dataset. This can be a problem to get the best results from the model.

We can use the head() function to see the first 5 rows from the table to see its data.

Data Preprocessing

In some Machine Learning problems we have to do some Data Pre-processing. This is an important step to prepare the data, so the ML model will produce better results. Also you can gather knowledge about the data.

Let’s apply pre-processing now.

Looking to the data we can see some features have three unique values redundant values. For example, in feature MultipleLines we have the “No phone service”, however we already have the feature PhoneService with that information, so we can just change it to “No” and stay with two unique values. We can apply this to OnlineBackup, OnlineSecurity, DeviceProtection, TechSupport, StreamingMovies and StreamingTV features.

We have some numerical features, we can use functions like countplot() from seaborn and hist() from pandas to have an overview about these data distribution and the relation with the label.

Tenure count in left and tenure x Churn in right
MonthlyCharges distribution in left and MonthlyCharges x Churn in right
TotalCharges distribution in left and TotalCharges x Churn in right

Is possible to see different patterns from the tree features. These features are continuous numerical values and have a little skewed distribution, and to change it we can apply some processes. One option is to convert in categorical values like split the values in intervals predefined or quantiles, we can do that to avoid the skewed distribution from the data.

Tenure_cat count in left and tenure_cat x Churn in right
MonthlyCharges_cat count in left and MonthlyCharges_cat x Churn in right
TotalCharges_cat count in left and TotalCharges_cat x Churn in right

Now we have all the features categorical and we can use this features in that format, but some algorithms accepts just numbers and to try them we need to change to numerical values. To do that we have some options like One-Hot Encoding and Label/Integer Encoding.

Integer Encoding is a simple category labelling that map the categories to numbers with a natural ordered relationship. The One-Hot Encoding have a different approach for the features that don’t have a natural ordinal relationship. In this method the categories are separated in columns and are filled with binary values.

To apply the changes we can use LabelEncoding from sklearn, to create the Integer Encoding, and we can use get_dummies() from pandas to get the One-Hot Encoding.

Now we can see the relationship between the features in our cleaned data with the heatmap() function from seaborn library.

Feature correlation heatmap

The Models

Now we have our data prepared to the model, we just need to split the data to train and validate the model. To do that we have train_test_split() from sklearn.

Let’s create three models with Logistic Regression, Random Forest and Gradient Boosting algorithms, we will user sklearn library to get the models.

Now we have the model trained and we can just validate with the predict() method from models.

And we can see the results.

In our case the Accuracy is not a good metric to evaluate the models, because it measures the overall accuracy of the model, and as the dataset is unbalanced, it causes an increase that does not reflect our positive outcome. So it’s good to choose other metrics that are available, such as Precision and Recall

Precision is the percentage of correct predictions for the label from the total that was predicted with that label, that is, how well the model predicts the label. And Recall is the percentage of correct predictions for the label from the total entries that actually have that label, that is, how well the model predicts the label when that is the correct label.

We can see for this example we have very similar results for Logistic Regression and the Gradient Boost algorithms, but if we have big datasets the Gradient Boost has advantages in performance.

Conclusion

For a prediction problem, we need to spend some time in Data Pre-processing step. When analysing the dataset it’s possible to make some assumptions that can improve the performance of the model.Depending on the size of the dataset this can take a long time, however, it’s a very important step for the quality of the model.

The choice of the model may end up not having as much impact on the results as the pre-processing of the data. In addition, cases where the dataset is unbalanced is interesting to choose other metrics other than Accuracy.

--

--