From Data to Decision: Gain hands-on experience predicting customer churn.

Learn to use the customer churn dataset to improve customer retention strategies.

Published in

AnalyticSoul

5 min readJun 11, 2024

Welcome back! In our previous lesson, we introduced the concept of customer churn and its various types. In this lesson, we will dive into the practical aspect of churn prediction by exploring a telecom dataset (telco_customer_churn.csv) that we’ll use for our analysis. This dataset contains a wealth of information about customers, their usage patterns, and whether they have churned. Understanding the structure and nuances of this dataset is a crucial step before building our predictive model. Let’s get familiar with the dataset.

Loading the dataset

Let’s load the telco_customer_churn.csv dataset and observe a sample of records to get a better understanding of the data.

import pandas as pd

df_telco = pd.read_csv('data/telco_customer_churn.csv', header=0)

print('Dimension of the dataset:')
print(df_telco.shape)

print('\nSample records:')
df_telco.head()

Understanding Churn relationship

To build an effective churn prediction model, we need to understand the relationships and patterns within the dataset. Let’s explore some key aspects and uncover inter-feature relationships:

Influence of gender and partner on customer churn

Let’s check if Gender or Partner have any influence over customer churn. In our dataset the feature Partner indicates the relationship status of the customer.

fig, ax = plt.subplots(1, 2, figsize=(16, 8))
s1 = sns.countplot(data=df_telco, x='Gender', hue='Churn', ax=ax[0])
s2 = sns.countplot(data=df_telco, x='Partner', hue='Churn', ax=ax[1])

# write count labels in the plot
for plot in [s1, s2]:
    for rect in plot.patches:
        plot.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 20, rect.get_height(), horizontalalignment='center', fontsize=11)


s1.legend(['No Churn', 'Churn'])
s2.legend(['No Churn', 'Churn'])
s2.set_xticks([0, 1])
s2.set_xticklabels(['With partner', 'Without partner'])
plt.show()

We draw two count plots (Gender and Partner) segmented by Churn. Looks like out of 3,488 female customers, 939 have churned and out of 3,555 male customers, 930 have churned, resulting in a churn rate of approximately 27% and 26% respectively.

In the second chart, out of 3,402 customers with a partner, 669 have churned, which is about 20%, and out of 3,641 customers without a partner, 1,200 have churned, representing a churn rate of approximately 33%. Apparently customers without partners are more likely to churn.

Influence of tenure length on customer churn

Let’s see if tenure impacts customer churn.

fig, ax = plt.subplots(1, 2, figsize=(16, 8))
s1 = sns.histplot(data=df_telco, x='Tenure', hue='Churn', multiple="stack", ax=ax[0])
s2 = sns.boxplot(data=df_telco, x='Churn', y='Tenure', ax=ax[1])

s1.set_xlabel('Tenure (months)')
s2.set_ylabel('Tenure (months)')

s1.legend(['No Churn', 'Churn'])
s2.set_xticks([0, 1])
s2.set_xticklabels(['No', 'Yes'])

plt.show()

This time we draw a histogram and a box plot showing the relation between Churn and Tenure. The left chart (histogram) plots the number of customers as their tenure increases. Interestingly enough, we see that newly subscribed customers are more likely to churn than customers who have had subscriptions for longer periods. The right chart represents the same information in a different view.

Relationship with Contract types and Monthly charges

Let’s examine whether contract type and monthly charges impact customer churn.

fig, ax = plt.subplots(1, 2, figsize=(20, 10))
s1 = sns.countplot(data=df_telco, x='Contract', hue='Churn', ax=ax[0])
s2 = sns.boxplot(data=df_telco, y='MonthlyCharges', x='Churn', ax=ax[1])

s1.legend(['No Churn', 'Churn'])
s2.set_xticklabels(['No', 'Yes'])

plt.show()

We can see from the left chart that, customers who have monthly subscriptions are more likely to leave than yearly contract holders. Besides, customers probably switch vendors because of subscription fees. The right chart shows that leaving customers pay higher subscription fees than non-churning customers.

Relationship with additional services

Finally, we check if customer churn is correlated with additional services. For this, we consider only the churned customers and examine how many of them subscribed to the additional services.

# consider only churned customers
df = df_telco[df_telco['Churn'] == 1]

# additional services
services = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport']
df = df[services]
# little cleanup
df = df.apply(lambda x: x.apply(lambda y: 'No' if 'No' in y else 'Yes')).reset_index(drop=True)

# draw count plots per additional services
fig, ax = plt.subplots(1, len(services), figsize=(len(services)*8, 8))
for i, col in enumerate(services):
    sns.countplot(data=df, x=col, ax=ax[i], order=['Yes', 'No'], palette="tab10", hue=col)

plt.show()

This time, we filter out non-churn customers. We want to consider only the customers who have churned. We also select the additional services only. Then we draw count plots for each service feature. We observe in each plot that customers who didn’t pick additional services are prone to leaving.

We hope you understand the dataset and its different features. Try playing with the notebook in the github repository.

In the upcoming lessons, we will use this dataset to build a logistic regression model to predict customer churn. By analyzing patterns and creating a robust model, we’ll be able to identify customers at risk of leaving and develop strategies to retain them.

Stay tuned!

What’s next?

Lesson 4.3 — Feature Engineering: One-Hot Encoding: Transforming categorical features into a format suitable for machine learning models.
Lesson 4.4 — Customer Churn Prediction: Implement logistic regression to predict whether a customer will churn.
Lesson 4.5 — Predicting Customer Churn with Decision Tree: Use decision tree algorithms to improve prediction accuracy and interpretability.