Predicting customer churn for a telco

Suhaib Ali Kamal
Jun 21 · 4 min read

Customer churn is a term associated to those customers who is lost after purchase. The cost to acquire a customer is significant enough for companies to analyse and understand the reasons behind the churn.

For this exercise we are gong to use a dataset from a telecom company he data can be found on the following link.

The dataset has close to 21 columns with 18 independent variables , a customer id variable and a dependent variable in the form of customer churn.The first part of the code is to analyse the data and see the data types of the columns

df=pd.read_csv('Customer Churn.csv')

Almost all of the variables in the dataset are categorical variables barring the total charges and the monthly charges.We can first check whether there are any missing values


After confirming that there are no missing values we will perform EDA to identify any relationships or patterns in the data. Let us first try to find the distribution of the tenure of the telco customers.

Tenure vs Churn

Let us further analyse the type of services telco customers avail and how does it relate to their churn rates.Let us first see the relationship between the type of Internet service and their churn rates.

Intenet coonnection vs churn rates

As can be seen from the title, people with a fibre optic connection are far more likely to churn than people with a DSL connection. More importantly we can also deduce that the type of Internet service has asignificant impact on the customer churn rate.Let us see how the duration of the contact affects customer churn

Contract duration vs churn

People with longer contacts have less churn as opposed to people who sign up for month-to month contracts.

Feature Engineering

For this section we will have to generate additional columns for our categorical variables in the shape f dummy variables.

df.drop([ 'Churn_No', 'gender_Female', 'Partner_No',
'Dependents_No', 'PhoneService_No', 'PaperlessBilling_No'],axis=1,inplace=True)

We now have a total of 41 columns in our dataset. There are multiple ways to deal with data that has these many dimensions . One of the way is to reduce the dimensionality of the dataset through a principal component analysis. Whether the principal component analysis can be conducted on a categorical dataset is another question. This link on Stack Exchannge provides a different alternative called a multiple correspondence analysis (MCA).

However, for the scope of this project we are going to skip this process and proceed to building our machine learning algorithm.The first part of the process will be to divide the datae into test and the training set


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

Once the data is segregated we can proceed to use logistic regression for our data.


The results from the logistic regression are shown below

precision    recall  f1-score   support

0 0.91 0.84 0.87 1107
1 0.54 0.70 0.61 302

accuracy 0.81 1409
macro avg 0.72 0.77 0.74 1409
weighted avg 0.83 0.81 0.81 1409

The model has an accuracy score of 81% with precision of 54% and a recall of 70% for the positive customer churn value. Let us try to us a decisio tree classifier to see whether we can further improve the result

'Confusion Matrix with Decision Tree', confusion_matrix(y_pred,y_test))

The results are shown below

Accuracy Score with Decision Tree 0.7381121362668559 
Confusion Matrix with Decision Tree [[843 194]
[175 197]]

The accuracy score has fallen from 81% to 73%.We will now use a GridSearchCV on a random forest classifier to see whether the score of the model can be improved.

from sklearn.model_selection import RandomizedSearchCVestimators=range(50,100)
max_features = ['auto', 'sqrt']
max_depth = range(4,12)
min_samples_split = range(2,8)
min_samples_leaf = range(1,8)
bootstrap = [True, False]
random_grid = {'n_estimators':estimators,

We used a Randomised Search CV to identify the best estimates for the random forest classifier and then using those estimates we built a random forest classifier. Let us see the results below

precision    recall  f1-score   support

0 0.92 0.83 0.87 1122
1 0.51 0.70 0.59 287

accuracy 0.80 1409
macro avg 0.71 0.77 0.73 1409
weighted avg 0.83 0.80 0.81 1409

The score is still as good as logistic regression. We can also use the feature importances method to find out the key factors affecting customer churn

The three most important features are

  1. Monthly Charges
  2. Type of contract(month-to-month, one-year,two-year)
  3. Online security

Strategists can better focus on thesefactors to reduce customer churn and increase customer satisfaction.

Nerd For Tech

From Confusion to Clarification

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit

Suhaib Ali Kamal

Written by

Passionate about data and tech Linkedin:

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit