Unlocking the Secrets of Customer Churn: Predicting the Future for Business Success.

Isaac Rambo a.k.a Data Rambo
15 min readAug 20, 2023

--

Click here to see image source

Click here to view dashboard

Introduction

In the dynamic and ever-evolving landscape of the telecommunications industry, customer churn has become a critical challenge for service providers. The ability to predict and understand customer churn can significantly impact business success, customer retention strategies, and ultimately, the bottom line.

In this project, we embark on an exciting journey to explore and analyze customer churn within the Telecom network service using the CRISP-DM (Cross-Industry Standard Process for Data Mining) framework. Our aim is to leverage data-driven insights to identify key factors influencing churn, build predictive models, and develop actionable recommendations that can help Vodafone proactively retain valuable customers and enhance their overall service offerings.

As we delve into the world of customer churn prediction, we will uncover the hidden patterns and trends that can make all the difference in retaining customers and ensuring long-term success in the competitive telecommunications market. Join us on this captivating journey as we unravel the secrets of customer churn and pave the way for a brighter future for Telecom companies and its valued subscribers.

Business Understanding

The telecommunications industry is grappling with a high churn rate, prompting Telco companies to utilize data mining for churn analysis. The objective is to identify vital churn factors and devise effective retention strategies. Churn can lead to significant financial losses, making customer retention crucial. Data mining enables the identification of factors like demographics, usage patterns, and satisfaction levels that influence churn. Armed with this insight, businesses can implement strategies such as discounts, enhanced customer service, and innovative features to curb churn and bolster customer loyalty.

HYPOTHESIS

Null Hypothesis (H0)

The monthly subscription cost (MonthlyCharges) has no significant effect on customer churn (Churn) from the Vodafone network service.

Alternate Hypothesis (H1)

The monthly subscription cost (MonthlyCharges) has a significant effect on customer churn (Churn) from the Vodafone network service.

ANALYTICAL QUESTIONS

UNIVARIATE:

  1. Distribution of Churn: What is the proportion of customers who have churned versus those who have not churned?
  2. Distribution of Tenure: How is the duration of customer subscriptions (tenure) distributed among all customers?
  3. Distribution of Senior Citizen Status: What is the distribution of senior citizen status among all customers?

BIVARIATE:

  1. Distribution of Monthly Charges across Churn: How does the monthly charges vary between customers who have churned and those who have not?
  2. Relationship between Churn and Customer Tenure: Is there a correlation between customer tenure and the likelihood of churning?
  3. Distribution of Churn across Contract Types: How does the proportion of churn vary across different contract types (month-to-month, one year, two-year)?
  4. Impact of Online Security Service on Churn: Does the presence or absence of online security service influence customer churn rates?
  5. Impact of Online Backup Service on Churn: How does the presence or absence of online backup service affect customer churn rates?

MULTIVARIATE:

  1. Impact of Internet Service Type on Monthly Charges and Churn: How does the type of internet service (DSL, Fiber Optic, None) influence monthly charges and customer churn? Are Fiber Optic customers paying significantly higher charges compared to DSL customers, and does this affect their likelihood of churning?

IMPORTATION

To initate my analysis, I imported certain necessary libraries. These libraries collectively provide you with the tools you need to connect to data sources, manipulate and clean data, visualize insights, perform statistical analysis, build and evaluate machine learning models, and handle class imbalance issues in my customer churn prediction project.

  1. dotenv: Helps load environment variables from a .env file to keep sensitive information secure.
  2. numpy: Provides support for arrays and mathematical operations on them, making data manipulation faster and easier.
  3. pandas: Offers data manipulation tools to analyze, clean, and transform data efficiently using dataframes.
  4. matplotlib.pyplot: A popular library for creating static, interactive, and animated visualizations in Python.
  5. plotly.express: Simplifies the creation of interactive visualizations using a high-level API.
  6. plotly.io: Provides functions to export Plotly figures to various file formats.
  7. missingno: Visualizes missing values in datasets, helping you understand the data quality.
  8. scipy.stats: Offers a wide range of statistical functions, including hypothesis testing, probability distributions, and more.
  9. sklearn.preprocessing: Contains tools for preprocessing data, such as label encoding and standard scaling.
  10. sklearn.compose: Allows you to combine preprocessing steps using a pipeline.
  11. sklearn.model_selection: Provides tools for splitting data into training and testing sets, as well as cross-validation.
  12. sklearn.linear_model.LogisticRegression: Implements the logistic regression algorithm for classification tasks.
  13. sklearn.ensemble.RandomForestClassifier: Implements the random forest algorithm for classification tasks.
  14. sklearn.ensemble.GradientBoostingClassifier: Implements the gradient boosting algorithm for classification tasks.
  15. sklearn.naive_bayes.GaussianNB: Implements the Gaussian Naive Bayes algorithm for classification tasks.
  16. sklearn.neighbors.KNeighborsClassifier: Implements the k-nearest neighbors algorithm for classification tasks.
  17. sklearn.tree.DecisionTreeClassifier: Implements the decision tree algorithm for classification tasks.
  18. sklearn.svm.SVC: Implements the Support Vector Machine (SVM) algorithm for classification tasks.
  19. xgboost.XGBClassifier: Implements the XGBoost algorithm for classification tasks.
  20. sklearn.metrics: Contains functions to evaluate the performance of machine learning models, including classification reports and confusion matrices.
  21. imblearn.over_sampling.SMOTE: Implements the Synthetic Minority Over-sampling Technique (SMOTE) for handling class imbalance in datasets.
  22. warnings: A built-in module in Python that helps you control how warnings are displayed or ignored.

LOADING DATASET

In this project, the dataset has been sourced from three distinct locations. The initial 3000 records are stored in a remote database, requiring remote access to retrieve the data. The subsequent 2000 records are contained in an Excel file named “Telco-churn-second-2000.xlsx,” conveniently accessible from OneDrive. Lastly, the final portion of the dataset can be located within a GitHub Repository, specifically stored in a CSV file titled “LP2_Telco-churn-last-2000.csv.” This diverse collection of data sources underscores the complexity of real-world scenarios, where data analysts must adeptly navigate multiple platforms to gather comprehensive datasets for analysis.

UNDERSTANDING THE DATA

The dataset comprises three distinct segments, each contributing to a comprehensive understanding of customer churn within the telecommunications industry. The first dataset, denoted as “Telco_1,” consists of 3000 rows and 21 columns, housing vital information that sheds light on the diverse attributes of customers and their interactions with the Vodafone network service. Similarly, “Telco_2” encompasses 2000 rows and 20 columns, focusing on key insights pertinent to customer behavior and service engagement.

Additionally, the third dataset, aptly named “LP2_Telco_3,” encapsulates a comprehensive dataset with dimensions of 2043 rows and 21 columns. This dataset showcases essential features that play a pivotal role in shaping customer behavior, preferences, and the propensity to churn. Together, these datasets form a holistic foundation for conducting an in-depth analysis aimed at uncovering insights into customer churn patterns and facilitating the formulation of effective retention strategies.

The dataset encompasses customer details sourced from the Telco network service, encompassing attributes such as MonthlyCharges, Tenure, SeniorCitizen status, and a variety of service subscriptions (e.g., OnlineSecurity, OnlineBackup). The target variable, ‘Churn,’ signifies whether a customer has undergone churn (‘Yes’) or not (‘No’). In preparation for model development, preliminary steps involve data preprocessing, addressing missing values, and mitigating class imbalance to ensure the dataset is optimized for subsequent analysis.

Among the three datasets, Telco_1 exhibited the highest prevalence of missing values.

CHECKED DATA QUALITY

During the preliminary dataset exploration, several data quality issues were identified, and they are summarized below:

  1. Accuracy:
  • Most of the categorical columns in the first DataFrame use True to represent Yes and False to represent No, creating inconsistency among the three datasets.

2. Validity:

  • In the second and third datasets, the TotalCharges column is of object data type instead of float, which should contain numerical values.

3. Consistency:

  • Inconsistent values in the categorical columns across the three data sets.

4. Completeness:

  • Missing data points in the Churn column (Target) and other columns in the first dataset.

HYPOTHESIS TESTING

In our hypothesis, we are interested in comparing the means of Monthly Charges between two independent groups: customers who have churned (Churn = “Yes”) and customers who have not churned (Churn = “No”).

The independent samples t-test is suitable for our analysis because we are comparing the mean Monthly Charges of two distinct groups (churned and non-churned customers) where the Monthly Charges of each customer are unrelated and independent of whether or not other customers have churned.

The t-test allows us to test if there is a significant difference in the average Monthly Charges between these two groups, providing insights into the impact of Monthly Charges on customer churn.

EVALUATION DATA ANALYSIS

  1. Distribution of Churn: What Is the Proportion Of Customers Who Have Churned Versus Those Who Have Not Churned?

The distribution of churn in the dataset shows that out of 5043 customers, 3707 (approximately 73.51%) have not churned (Churn = “No”), while 1336 customers (approximately 26.49%) have churned (Churn = “Yes”).

This indicates that the majority of customers in the dataset have not churned, with a smaller proportion of customers having churned from the service.

2. Distribution of Tenure: How Is the Duration of Customer Subscriptions (Tenure) Distributed Among All Customers?

The customer tenure in the dataset ranges from 0 to 72 months, with an average tenure of approximately 32.58 months and a spread of approximately 24.53 months around the mean. The majority of customers have tenures concentrated around 0 to 20 and 60 to 72 months, with 25% having tenures of 9 months or less, and 75% having tenures of 56 months or less.

3. Distribution of Senior Citizen Status: What Is the Distribution of Senior Citizen Status Among All Customers?

The bar chart provides a visual representation of the distribution of senior citizen status among all customers. The distribution of senior citizen status among all customers is skewed towards non-senior citizens and shows that 819 customers (16.3%) are senior citizens, while 4224 customers (83.7%) are not senior citizens.

4. Distribution of Monthly Charges Across Churn: How Does the Monthly Charges Vary Between Customers Who Have Churned and Those Who Have Not?

Overall, the box plot illustrates that the price of the monthly subscription (MonthlyCharges) plays a significant role in influencing customer churn. Customers with higher monthly charges are more likely to churn compared to those with lower charges. This discovery holds important implications for devising effective customer retention strategies and making informed pricing decisions within the Vodafone network service.

5. Relationship Between Churn and Customer Tenure: Is There A Correlation Between Customer Tenure and The Likelihood of Churning?

This graph above shows that the number of customers who churn with the Vodafone network service is highest in the first few months of being with the company. This is likely because customers are more likely to be dissatisfied with their service in the early months, when they are still learning about the company and its offerings.

6. Distribution of Churn Across Contract Types: How Does the Proportion Of Churn Vary Across Different Contract Types (Month-To-Month, One Year, Two-Year)?

The graph shows that the churn rate is highest for customers with month-to-month contracts. Of the 2744 customers with month-to-month contracts, 1184 have churned, for a churn rate of 43.1%. The churn rate is lower for customers with one-year contracts (11.6%) and two-year contracts (2.4%).

7. Impact of Online Security Service on Churn: Does the Presence or Absence of Online Security Service Influence Customer Churn Rates?

The grouped bar graph displays the relationship between the presence of Online Security service (Yes or No) and the customer churn status (Yes or No) within the Vodafone network service. Let’s focus on examining the churn rates between customers with Online Security service enabled (OnlineSecurity = Yes) and those without it (OnlineSecurity = No).

  • For customers with Online Security service (OnlineSecurity = Yes), the graph indicates that 1242 customers did not churn (Churn = No), while 214 customers churned (Churn = Yes). This results in a churn rate of approximately 14.70%.
  • On the other hand, for customers without Online Security service (OnlineSecurity = No), the graph shows 1461 customers who did not churn (Churn = No), and 1046 customers who churned (Churn = Yes). This yields a churn rate of around 41.72%.

8. Impact of Online Backup Service on Churn: How Does the Presence or Absence of Online Backup Service Affect Customer Churn Rates?

The grouped bar graph visually represents the relationship between the presence of the Online Backup service (Yes or No) and the customer churn status (Yes or No). Let’s focus on analyzing the churn rate between customers with Online Backup service enabled (OnlineBackup = Yes) and those without the service (OnlineBackup = No).

  • For customers with Online Backup service (OnlineBackup = Yes), the graph shows that 1363 customers did not churn (Churn = No) while 369 customers churned (Churn = Yes). This results in a churn rate of approximately 21.30% for customers with Online Backup service.
  • On the other hand, for customers without Online Backup service (OnlineBackup = No), the graph displays 1340 customers who did not churn (Churn = No) and 891 customers who churned (Churn = Yes). This yields a churn rate of around 39.93% for customers without Online Backup service.

Impact of Internet Service Type on Monthly Charges and Churn: How Does the Type of Internet Service (DSL, Fiber Optic, None) Influence Monthly Charges and Customer Churn? Are Fiber Optic Customers Paying Significantly Higher Charges Compared to DSL Customers, And Does This Affect Their Likelihood of Churning?

The group bar graph shows that there is a significant difference in monthly charges between DSL and fiber optic customers. Fiber optic customers are paying an average a very high amount per month than DSL customers. This difference is likely due to the fact that fiber optic is a faster and more reliable internet service than DSL.

But the graph also reveals an interesting pattern, there is a significant difference in churn rates between DSL and fiber optic customers. Fiber optic customers are more likely to churn than DSL customers. This is likely because fiber optic customers are paying more for their service, so they are more likely to be dissatisfied with their service if it is not meeting their expectations.

DATA PREPROCESSING

HANDLING OF MISSING VALUES

During our analysis, we identified missing values within the TotalCharges variable. To address this, we explored two different imputation approaches: mean imputation and median imputation. This allowed us to fill in the missing values and ensure the integrity of our dataset.

HANDLING CLASS IMBALANCE

Addressing the issue of class imbalance, we utilize the Synthetic Minority Over-sampling Technique (SMOTE). This technique involves creating synthetic samples for the minority class (churned customers) to equalize the distribution of classes in the dataset.

TRAIN-TEST SPLIT

# Split the data into training and validation sets with a test size of 20% and random state of 12
X_train, X_val, y_train, y_val = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=12)

# Print the shapes of the training and validation sets to check the split
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

Splitting the dataset into training and testing sets is a crucial step in building and evaluating machine learning models. The training set is used to train the model, while the testing set is used to assess its performance on unseen data. This helps us understand how well the model generalizes to new data and avoids overfitting. We typically use a common split ratio, such as 80% for training and 20% for testing.

MODELLING

MODEL SELECTION

We consider seven (7) models for churn prediction:

  1. Logistic Regression
  2. Gaussian Naive Bayes
  3. Random Forest Classifier
  4. KNeighbors Classifier
  5. Decision Tree Classifier
  6. Gradient Boosting Classifier
  7. Support Vector Classifier (SVC)

We evaluate each model’s performance using cross-validation to ensure reliable metrics. The key evaluation metrics used are accuracy, F1-score, ROC AUC, recall, and precision.

 Model                          Accuracy   F1      ROC_AUC   Recall  Precision
0 Logistic Regression 0.779801 0.787858 0.860812 0.822161 0.756586
1 GaussianNB 0.753160 0.776349 0.833820 0.861109 0.706820
2 Random Forest Classifier 0.851120 0.850465 0.922498 0.851272 0.850214
3 KNeighbors Classifier 0.783849 0.807185 0.858068 0.909219 0.725965
4 Decision Tree Classifier 0.780307 0.780452 0.780377 0.784550 0.776528
5 Gradient Boosting Classifier 0.845220 0.848448 0.927025 0.870592 0.827573
6 SVC 0.807957 0.812783 0.888384 0.837735 0.789360

Upon thorough analysis of the performance metrics, the Random Forest Classifier (RFC) and Gradient Boosting Classifier emerged as the top-performing models.

Random Forest Classifier (RFC):

With an accuracy of 85.13% and a well-balanced F1 score of 85.02%, the RFC demonstrates strong performance in terms of both recall and precision.
The ROC AUC of 92.21% reflects its excellent discriminatory power, indicating its ability to distinguish between classes effectively.
The recall of 84.82% and precision of 85.27% ensure a harmonious identification of churn cases while minimizing the occurrence of false positives.

Gradient Boosting Classifier:

Achieving an accuracy of 84.52% and an F1 score of 84.84%, the Gradient Boosting Classifier also showcases a commendable balance between recall and precision.
The high ROC AUC of 92.70% underscores its prowess in effectively discerning between different classes.
Furthermore, the recall of 87.06% and precision of 82.76% contribute to the targeted identification of churn cases while upholding precision levels.

MODEL EVALUATION

A thorough evaluation of the chosen models was to be conducted using the validation set to determine the best-performing model.

The evaluation process will primarily involve an in-depth analysis of the models’ performance using the confusion matrix. This matrix offers a detailed breakdown of the models’ predictions, enabling us to make well-informed judgments about their effectiveness.

MODEL IMPROVEMENT

HYPERPARAMETER TUNING FOR RANDOM FOREST CLASSIFIER

In this phase, we enhance the Random Forest Classifier model through hyperparameter tuning using GridSearchCV. By exploring a predefined parameter grid, the algorithm selects the optimal combination of hyperparameters to maximize the F1-score. This ensures a balanced trade-off between precision and recall.

          precision    recall  f1-score   support

Not Churn 0.87 0.85 0.86 728
Churn 0.86 0.87 0.87 755

accuracy 0.86 1483
macro avg 0.86 0.86 0.86 1483
weighted avg 0.86 0.86 0.86 1483

GRADIENT BOOSTING CLASSIFIER TUNING

Here, we perform hyperparameter tuning using GridSearchCV on a Gradient Boosting Classifier model. It searches over a defined parameter grid and uses the F1-score as the scoring metric to balance precision and recall.

precision    recall  f1-score   support

Not Churn 0.87 0.85 0.86 728
Churn 0.86 0.87 0.87 755

accuracy 0.86 1483
macro avg 0.86 0.86 0.86 1483
weighted avg 0.86 0.86 0.86 1483

Based on the evaluation results from the validation phase (Classification Report and Confusion Matrix) after performing the hyperparameter tuning of both models, taking into account the trade-off between accuracy and interpretability, along with the business goal of identifying churn customers and achieving high precision, the Gradient Boosting Classifier (GBC) emerges as the optimal model for future churn predictions.

  • Accuracy: Both models showcase similar accuracy rates, with the GBC slightly ahead (86% vs. 85% for Random Forest).
  • Precision: GBC exhibits better precision in predicting churn customers (87% for GBC vs. 85% for Random Forest), indicating a reduced chance of falsely identifying non-churn customers as churners.
  • Recall: GBC demonstrates a recall of 87%, outperforming Random Forest’s 86%, which means GBC identifies a higher proportion of actual churn cases. This, coupled with the significant precision gain, ensures a more precise and targeted identification of churn customers.
  • F1-Score: GBC attains a marginally higher F1-score (87% for GBC vs. 86% for Random Forest), signifying a better balance between precision and recall.

To Summary Up

Our analysis and evaluation of various classification models for predicting customer churn within the Telco network service have yielded valuable insights. After careful consideration of accuracy, precision, recall, and the F1-score, the Gradient Boosting Classifier (GBC) emerged as the optimal model for future churn predictions. Its superior precision, recall, and F1-score demonstrate its ability to effectively identify and target potential churn customers while maintaining a balanced performance.

Based on these findings, we recommend implementing the Gradient Boosting Classifier (GBC) as the primary model for customer churn prediction. However, it is essential to continually monitor and fine-tune the model’s performance as new data becomes available. Additionally, exploring the potential impact of other external factors and incorporating them into the model could further enhance its predictive capabilities.

By deploying the GBC model and leveraging the insights gained from our analysis, Telcos can proactively identify customers at risk of churn and implement targeted retention strategies. This will not only lead to improved customer satisfaction and retention but also contribute to the company’s overall business success and bottom line.

References

Check My GitHub Page for Dataset and more:

Let's Connect on My LinkedIn Profile below:

https://www.linkedin.com/in/isaac-agbogah/

You can also reach out to me on Instagram @fantasticrambo

Special Thanks

To God Almighty for Strength, To my Team Mate Solomon. I would also like to express my sincere gratitude to the Azubi Africa team for their support in this project. I would also like to thank all of my readers and everyone else for taking the time to read and react to this project. Your feedback has been invaluable, and I have learned a great deal from it. i appreciate you all.

--

--

Isaac Rambo a.k.a Data Rambo

Hi there! I'm Isaac, a Data Analyst, YouTuber, Python programmer, teaching assistant, web designer, and content creator. It's nice to meet you! Connect with me!