Analyzing Customer Churn Data with Python

Aashiq Reza
Geek Culture
Published in
7 min readMay 24, 2022


What is Customer Churn?

The term “Customer Churn” refers to the loss of customers. That is, if a customer or a client stops taking services from a company, it is said that he/she has churned.

Churn is intimately connected to a company’s financial performance. The more one learns about buyers’ behavior, the more money one can make. Analyzing customer churn also aids in finding and improving the shortcomings of services provided by the company.

Collecting and cleaning the data

I have used the famous telco customer churn data in this experiment. The data can be found in different sources. You can download the data from here as well.

After downloading the data, it should be loaded into python and then, get an overview of the data. Before that, all the necessary libraries should be imported into the environment.

# Import libraries
import sklearn as sk
import pandas as pd
import matplotlib as plt
import xgboost as xgb
import seaborn as sn
import matplotlib.pyplot as plt
import os
import numpy as np# ML algorithmsfrom sklearn.model_selection import train_test_split, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, plot_roc_curve, confusion_matrixfrom sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier## For hyperperameter tuningfrom sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV

Now read the data and take a look at it.

# Reading data
data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv', sep=',')
# Overview of the data
Data overview

The dataset has 21 variables with 7032 observations. The first column represents customerID, I will consider dropping this column for further analysis. I have checked the missing values and data types in this dataset using the following codes.

# Checking the summary of missing values
data.isnull().values.any() # The result false implies there is no missing values in the data
Data types

Most of the variables are of the object type. As I advanced in the analysis, some problems were raised as TotalCharges is an object, which should have been in floating format. Again, even if the code above shows that the dataset does not have any missing values, I have noticed some missing values occurred in terms of blank spaces in the dataset. The following code was executed to convert those blank spaces into NA’s and omit the rows of the dataset containing NA’s and convert relevant columns into floating types.

# Removing variables we are not interested in
data.drop(data.columns[[0]], axis = 1, inplace = True)
## Missing values occured in terms of blank spaces in this dataset
print (data[pd.to_numeric(data.TotalCharges, errors='coerce').isnull()])## Replace all the blank spaces to NA's
nan = float("NaN")
data.replace(" ", nan, inplace=True)
data.dropna(subset = ["TotalCharges", "MonthlyCharges"], inplace=True)
data.MonthlyCharges = pd.to_numeric(data.MonthlyCharges)
data.TotalCharges = pd.to_numeric(data.TotalCharges)

Data Visualizations

Since we have completed data collection and preliminary data cleaning, now it is time to visualize the dataset to obtain useful insights from the dataset. The figure Gender vs Churn shows the number of churned customers for both males and females are almost equal. The next figure, SeniorCitizen vs Churn shows younger people are more like to churn in terms of number, but if we consider the ratio, then senior citizens show a higher ratio of churn. It can be said people with month-to-month contract lengths are more likely to churn from the figure Contract length vs Churn. People who have churned have less duration of tenure than those who are still considering services provided by the company. The last figure shows the relations between the numerical variables in the dataset.

Some visualizations

The code for generating the above figures:

# pairplots
sn.pairplot(data = data, hue='Churn') Average time to churn
sn.boxplot(data['Churn'], data['tenure'])
plt.title('Tenure vs Churn')# Effect of Contract length on customer attrition
counts = (data.groupby(['Contract'])['Churn']
sn.barplot(x="Contract", y="Count", hue="Churn", data=counts)
set_title('Contract length vs Churn')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0)# Effect of age on customer attrition
counts = (data.groupby(['SeniorCitizen'])['Churn']
sn.barplot(x="SeniorCitizen", y="Count", hue="Churn", data=counts)
set_title('SeniorCitizen vs Churn')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0)
plt.savefig('fig4.png')# Effect of Contract length on customer attrition
counts = (data.groupby(['gender'])['Churn']
.reset_index())sn.barplot(x="gender", y="Count", hue="Churn", data=counts).set_title('Gender vs Churn')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0)

Modeling and Accuracy Test

In this section, I will show some models to make predictions on the dataset and test their accuracy. To do so, firstly, the dataset will be split into train and test sections, and 30% of observations of the dataset will be treated as test data as the rest of the data will be used for training the model.

# Splitting into test and train sets
x = data.drop('Churn', axis=1)
y = data['Churn']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=1)
x_train = pd.get_dummies(x_train)
x_test = pd.get_dummies(x_test)

All the required libraries and algorithms have been loaded at the beginning of the code. In this experiment, I am considering logistic regression, random forest, decision tree, and MLP classifiers to make predictions. The models are trained in the training data and performance metrics are evaluated on the test dataset.

  • Firstly, the logistic regression model has been trained in the dataset and we have obtained 80% accuracy on the test set.
logmodel = LogisticRegression(), y_train)predictions = logmodel.predict(x_test)
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))
Results from logistic regression
  • Next, I have trained the MLP classifier and the confusion matrix shows that the accuracy on the test set is also 80% for this model.
mlp = MLPClassifier(), y_train)
predictions = mlp.predict(x_test)
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))
Results from MLP classifier
  • The decision tree shows 72% accuracy on the test set.
dtree = DecisionTreeClassifier(), y_train)
predictions = dtree.predict(x_test)
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))
Results from the decision tree
  • Lastly, the random forest model was trained and I have found around 79% accuracy with this model on the test sets.
rand = RandomForestClassifier(), y_train)
predictions = rand.predict(x_test)
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))
Results from random forest

Hyperparameter tuning

Hyperparameter tuning is very important for improving the predictive ability of the model. Hyperparameters are parameters whose values influence the learning process and affect the model parameters that a learning algorithm learns. And the tuning of these hyperparameters means choosing the optimal set of values of the hyperparameters. There are many different ways to tune hyperparameters. Grid search is one of the simplest methods for tuning. The following code can be used for tuning hyperparameters for any model. Only the list of parameters is needed to be changed for different models. Each model has many different parameters. The selection of the parameters to be tuned depends on use cases and problems.

solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l2']
c_values = np.logspace(-4, 4, 50)
# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values, max_iter = [1000])
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=logmodel, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result =, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)Result:
Best: 0.807123 using {'C': 0.013257113655901081, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'liblinear'}

The improved accuracy is not much for this case. A very naive approach has been taken to tune hyperparameters in this case. This can be developed much more and the accuracy should raise up to 90–95%.


  • In this article, I have shown how to analyze customer churn with telco churn data in python.
  • Visualizations can show some useful insights from the data. For example, we can find the influencing factors behind customer churn with the help of visualization.
  • Predictive analysis has been conducted and different machine learning algorithms have been compared for solving this particular problem.
  • Finally, hyperparameter tuning shows how to optimize the parameter values of a learning model to get the best predictive accuracy.

Notes: The full code can be downloaded from here.



Aashiq Reza
Geek Culture

Data Science, ML, Image processing. Good hands in R, MATLAB, Python, SPSS, C/Cpp. Always free to connect :