Predicting Customer Churn with Classification Models.

15 min readAug 20, 2023

Introduction

Customer churn stands as a significant financial concern for all companies as it represents the loss of revenue from departing customers and necessitates resource-intensive efforts to acquire new ones. Customer churn refers to the phenomenon where customers or clients of a business discontinue their relationship with that business by ceasing to use its products or services. For example, if a year starts with 500 customers but concludes with 480 customers, the departure rate equals 4%. If we could figure out why a customer leaves and when they leave with reasonable accuracy, it would immensely help the organization to significantly enhance its approach to retention strategies.

Project Structure

This analysis will adhere to the CRISP-DM (Cross Industry Standard Process for Data Mining) framework.

Business Understanding

The primary objective of this project is to accurately predict the likelihood of a customer leaving a telecommunication company, identify key indicators of churn, and devise effective retention strategies to mitigate this challenge.

Data Understanding

The following describes the columns present in the data.

Gender — It states whether the customer is a male or a female.
SeniorCitizen — It states whether a customer is a senior citizen or not
Partner — It states whether the customer has a partner or not (Yes, No)
Dependents — It states whether the customer has dependents or not (Yes, No)
Tenure — It states the number of months the customer has stayed with the company
Phone Service — It states whether the customer has a phone service or not (Yes, No)
MultipleLines —It states whether the customer has multiple lines or not
InternetService — It states the customer’s internet service provider (DSL, Fiber Optic, No)
OnlineSecurity — It states whether the customer has online security or not (Yes, No, No Internet)
OnlineBackup — It states whether the customer has online backup or not (Yes, No, No Internet)
DeviceProtection — It states whether the customer has device protection or not (Yes, No, No internet service)
TechSupport — It states whether the customer has tech support or not (Yes, No, No internet)
StreamingTV — It states whether the customer has streaming TV or not (Yes, No, No internet service)
StreamingMovies — Whether the customer has streaming movies or not (Yes, No, No Internet service)
Contract — It states the contract term of the customer (Month-to-Month, One year, Two year)
PaperlessBilling — It states whether the customer has paperless billing or not (Yes, No)
Payment Method — It states the customer’s payment method (Electronic check, mailed check, Bank transfer(automatic), Credit card(automatic))
MonthlyCharges — It states the amount charged to the customer monthly
TotalCharges — It states the total amount charged to the customer
Churn- It states whether the customer churned or not (Yes or No)

Hypothesis

Null Hypothesis: The monthly charges, total charges and tenure do not have a significant impact on the churn rate.

Alternative Hypothesis: The monthly charges, total charges and tenure have significant impact on the churn rate.

Questions

After the exploration of the data, the following questions have been derived for subsequent analysis.

Which gender exhibited the highest churn?
Is there a significant association between gender and churn?
Does the presence of dependents affect customer churn?
Which gender pays more monthly charges?
Does the presence of dependents affect monthly charges?
Do paperless billing and payment methods influence churn?
Is there a correlation between senior citizens and churn?

Exploratory Data Analysis

In this context, we will examine and explore datasets, outlining their attributes through data visualization.

EDA On Telco Churn 2000

Here is a brief overview of the structure and content of the data. There are 2043 rows in the data and 21 columns. We can see that there are no duplicates and we do not have any missing value in the data. However, the “Total Charge” column is currently stored as an object, it should be a float datatype to ensure accurate numerical representation. Also, the target variable(Churn) is in a “yes or no” attribute

EDA On Telco Churn 3000

Here is an overview of the structure and content of the data. There are 3000 rows in the data and 21 columns. We can see that there are no duplicates and we do have missing values in some of the columns. The target variable(Churn) is in a “true or false” attribute.

When we compare the two datasets we performed our exploratory data analysis, we will notice that some of the data types are not consistent. For instance, in the telco churn 3000 dataset, particular columns are represented in boolean form, whereas the same columns are depicted in object form in the other dataset. Additionally, the “SeniorCitizen” column takes on an integer data type with values of 0 and 1 in the telco churn 2000 dataset, while adopting a boolean format in the other dataset. Consequently, adjustments are necessary to ensure uniformity in data types.

Handling missing values in the telco churn 3000 dataset

Since there are no missing values in the “telco churn 2000” dataset, we are interested in investigating whether this dataset could potentially assist in addressing the missing values within the “telco churn 3000” dataset. To initiate the process, we begin by choosing the columns that contain missing values. Subsequently, we proceed to compare these specific columns, which have missing values, with the corresponding columns in the other dataset.

In the given code snippet, we can see that specific columns, namely “Total Charges” and “Churn,” have been intentionally excluded. This choice stems from the intention to manage missing values within these columns using a separate approach.

After analyzing the outcome related to the remaining columns, we can see a clear trend. The label “true” within the first dataset aligns with the interpretation of “yes” in the corresponding dataset. Similarly, the presence of “false” in the first dataset corresponds to the values “no” and “none” in the same manner. Consequently, the missing values, denoting “none,” were replaced with the corresponding appropriate value.

A box plot was employed to visualize the “TotalCharges” column. This helps to determine the appropriate way of handling the missing values in the column.

From the diagram above, we can see that there are no outliers, therefore the approach chosen for handling missing values involves imputing them with the mean value.

For the churn column, since there is only one missing value, we decided to fill it with the mode which represents the most frequently occurring value.

Having established a strategy to tackle the aforementioned concerns, we proceeded to concatenate the data and subsequently performed data cleaning on the merged dataset.

EDA on the test set

For the test set, modifications were applied solely to the “TotalCharges” column, involving conversion to a float data type from an object type, and to the “SeniorCitizen” column, the numeric values of 0 and 1 were replaced with corresponding labels of “no” and “yes.”

Following the exploratory data analysis, we proceeded to conduct the univariate, bivariate, and multivariate analysis of the columns. Subsequently, we progressed to hypothesis testing.

Hypothesis Testing

The hypothesis was examined using logistic regression. Logistic regression is a form of multiple regression employed to predict binary outcomes, it is a suitable approach used to examine how multiple predictor variables (independent variables) influence a binary categorical target variable (dependent variable).

The “P>|z|” column represents the p-value, which indicates the statistical significance of each predictor variable. The p-value for the “MonthlyCharges” variable is observed to be 0. This p-value, being below the 5% threshold, signifies that “MonthlyCharges” indeed exerts a substantial influence on churn. Similarly, the predictor “Tenure” also demonstrates a significant impact on churn. However, the “TotalCharges” variable exhibits a p-value that indicates insignificance in relation to churn prediction.

The sign of the coefficient(coef) indicates the direction of the relationship between the predictor variable and the probability of churn. A positive coefficient suggests that an increase in the predictor variable is associated with a higher likelihood of churn, while a negative coefficient suggests the opposite. Since the ‘MonthlyCharges’ is positive, it means that as the monthly charges increase, the likelihood of churn also increases. The same applies to ‘TotalCharges’. For the ‘Tenure’, the coefficient value shows that as the tenure increases, the likelihood of churn decreases.

The LLR(likelihood ratio test) p-value shows the overall impact of these three predictor variables on churn. Since this value gives a value lower than 5%, it shows that the predictor variables do have an impact on churn. With this conclusion, we reject our null hypothesis.

Answering Questions

Which gender exhibited the highest churn?

Heatmap showing the frequency of Churn by Gender

The results indicate that there is a slightly higher churn rate among females compared to males. This outcome prompted us to investigate further whether there is a statistically significant relationship between gender and churn behavior.

2. Is there a significant association between gender and churn?

To know whether the observed difference in churn rates between males and females is statistically significant, we use the Chi-Square test. This test helps determine if there’s a relationship between two categorical variables, in this case, gender and churn.

The Chi-Square test provides you with a p-value. If the p-value is below the 5% threshold, it suggests that there’s a significant association between gender and churn rates. If you obtain a high p-value greater than 0.05, it suggests that there isn’t enough evidence to conclude that there is a significant association between the two categorical variables being tested.

Since the p-value is greater than 5%, this suggests that any difference in churn rates between males and females is likely due to random variation rather than a true underlying relationship. As a result, gender is not a significant factor influencing churn rates.

3. Does the presence of dependents affect customer churn?

We also utilize the Chi-Square test in this scenario since the question deals with checking the relationship between two categorical variables, in this case, dependents and churn.

The result from the p-value shows us that the presence of dependents affects customer churn.

4. Which gender pays more monthly charges?

On average, female customers pay slightly higher monthly charges in comparison to male customers. The observation made above offers valuable insight into the spending behavior of male and female customers. This information can be leveraged by the telecommunications company to tailor its marketing strategies and service offerings based on the preferences and needs of different genders. By recognizing these spending patterns, the company can enhance its customer engagement and satisfaction while potentially increasing its revenue through more effective targeting.

5. Does the presence of dependents affect monthly charges?

On average, customers without dependents pay higher monthly charges compared to customers with dependents. This information shows that people who don’t have family members to take care of tend to pay more for their phone services each month compared to those who have family members. This means that people with families might choose cheaper plans or use their phones differently. For the company, this could help them make plans and prices that suit different types of customers better. It also provides an opportunity to create targeted marketing campaigns that address the specific preferences and requirements of customers with and without dependents, potentially enhancing customer satisfaction and retention.

6. Do paperless billing and payment methods influence churn?

Churn Rate By Paperless Billing And Payment Method

Based on the graph, we can see that customers who opt for paperless billings and pay through electronic checks experience a higher churn rate compared to those using credit cards, which have the lowest churn rate. Similarly, customers who don’t use paperless billings and pay via electronic checks also exhibit a higher churn rate than those using credit cards. Among customers using electronic checks, those who prefer paperless billings churn more frequently than those who don’t opt for paperless billing.

In conclusion, the telecommunication company should focus on promoting paperless billing with credit cards and enhancing the electronic check experience as key measures to mitigate the issue of high churn rate.

7. Is there a correlation between senior citizens and churn?

The result from the p-value shows us that the presence of senior citizens affects customer churn.

Power BI visualization

Once we addressed the questions mentioned earlier, we moved forward to implement these graphs in Power BI. We assumed the name of the telecommunication company was Vodafone.

Feature Selection

To train a model, we collect huge quantities of data to help the machine learn better. Having too much unnecessary data can cause a model to be slow and inaccurate as a result of training with irrelevant data. Therefore, it’s essential for us to carefully choose the features that provide value to our model. By doing so, we ensure that the model is efficient, accurate, and focused on the most relevant aspects of the data. For this selection, we used the phi-k correlation. Phi-k correlation helps identify the strength and direction of relationships between different categorical features and the target variable.

Based on the result above, we dropped features that had values below the threshold of 10%.

Feature Scaling

After completing the process of feature selection, we proceeded to divide our dataset into two subsets: the training set (80%) and the test set (20%). In many machine learning algorithms, the mathematical computations and calculations involved are based on numbers, so we employed label encoding to convert the labels within y_train and y_test into numerical values. For the remaining categorical columns, we employed one-hot encoding while for the numerical columns, we applied the standard scaler to ensure appropriate scaling, as some of our chosen algorithms require scaled input features.

Addressing Data imbalance

The pie chart reveals that a significant majority, accounting for 73.5% of the customers, remain loyal, whereas a minority comprising 26.5% of the customers have discontinued their association with the company. This clear disparity in the distribution between retained and churned customers illustrates an imbalance within the dataset as machine learning algorithms tend to be biased toward the majority class due to its larger representation. This bias can lead the model to perform well on predicting the majority class but poorly on the minority class which could potentially impact the modeling process.

To address this imbalance, we employ techniques such as class weights and SMOTE(Synthetic Minority Over-sampling Technique). SMOTE is a popular technique used to address data imbalance in classification problems, particularly when the minority class is underrepresented. It addresses the imbalance issue by creating new samples for the minority class so that the computer model can learn about it just as well as the majority class. This helps the model generalize better to the minority class and reduces the bias towards the majority class. Class weights help the computer program to learn about the less common things(minority class), making it better at telling them apart from the more common things(majority class).

Model Processing

For our binary classification task, we will utilize models such as Logistic Regression, Decision Tree, Random Forest, Gradient Boosting (XGBoost), and Support Vector Machines (SVM).

Logistic Regression: This straightforward algorithm is suitable for tasks like predicting whether customers will churn. It offers insights into the importance of different features by calculating a number called “coefficient.” This coefficient tells us how much that specific aspect influences the prediction.
Decision Trees: Decision trees are effective models that repeatedly divide data based on key features. They’re easy to comprehend and really good at figuring out complicated patterns in the data.
Random Forest: This technique combines multiple decision trees to enhance accuracy and minimize overfitting. It handles complex relationships and patterns in the data effectively.
XGBoost: This gradient-boosting algorithm is known for its adaptability and strong performance. It often delivers good results without extensive tuning and handles missing data and complex patterns effectively.
Support Vector Machines (SVM): SVMs excel when we have lots of details to look at and when things are really tangled up and connected. Their optimal performance occurs when the distinction between the main groups of interest is clear. In such cases, they demonstrate expertise in effectively differentiating between these groups.

Once the class weight was calculated using the class distribution in the training data and the pipelines were set up for all the models with those class weights and SMOTE, the next step we did was fitting the models using the training data and predicting our evaluation set(the 20%). After this, the classification report was generated to assess various metrics such as precision, accuracy, recall, f1 score, and f2 score. This report provides insights into the performance of the models in terms of their ability to predict both classes accurately and to handle imbalanced data effectively.

Model Selection

F2-score is an important evaluation metric for classification models. It’s similar to the F1-score but gives extra weight to the accurate classification of positive examples. This means it’s particularly good at recognizing the “positive class” while considering precision and recall.
F1-score is widely used to balance precision and recall in classification tasks.
Accuracy calculates the ratio of correctly predicted outcomes to all outcomes. But in imbalanced datasets, accuracy can be misleading.
Precision calculates the ratio of correctly predicted positive outcomes to all predicted positive outcomes. A higher precision indicates that the model is making fewer false positives that is, incorrect predictions where it wrongly predicts customer churn when, in reality, the customer did not actually churn.
Recall measures the ability of a model to correctly identify positive instances out of all actual positive instances in the dataset. A high recall score indicates that the model is good at finding the positive instances when they exist in the data.

Based on these observed metrics, we selected the model with the highest precision and F2-score. This was the logistic regression model with class weights.

Confusion Matrix For Chosen Model

A confusion matrix is a tabular representation of the performance of a classification model. It helps in understanding the accuracy and error types of the model’s predictions by comparing the predicted labels with the actual labels.

From this visualization, a high true negative(TN) count of 523 is seen. This shows that the model is successfully identifying customers who are likely to remain with the company.

K-Fold Cross Validation For Chosen Model

K-fold cross-validation is used to assess the performance and generalization ability of a model. It involves dividing the dataset into “k” subsets where the value of 10 was used for “k”. The model is then trained and evaluated 10 times, each time using a different fold as the validation set and the remaining folds as the training set. This technique helps in estimating how well the model will perform on unseen data and provides a more reliable evaluation of its performance compared to a single train-test split.

The model’s average accuracy over the 10 folds of K-Fold Cross Validation is seen to be 75.11%. This average accuracy showcases how well the model predicts churn or non-churn across various portions of the training data.

The graph allows you to see how the model’s accuracy changes across different folds of the dataset. It provides insights into how steady the model’s performance is and aids in spotting any potential fluctuations in accuracy among different portions of the data.

Hyperparameter Tuning

Hyperparameters are configuration settings that are set before the training process and it influences the behavior of the algorithm. The objective of hyperparameter tuning is to discover the combination of hyperparameters that leads to optimal model performance on unseen data. We applied the Grid search technique in this project. The Grid search is a hyperparameter tuning technique used to search through a predefined set of hyperparameter combinations for a model.

After performing the Grid search, we got the best model that will be used for our predictions. Following the prediction step, we generated a dataframe to hold the prediction results. Subsequently, we combined these results with the initial test dataset through concatenation.

Conclusion

In conclusion, the examination of customer churn offers insights that can steer businesses toward enhancing customer retention and overall operational success.

Reference

Here is a link to my GitHub repository to view the notebook.

Appreciation

I highly recommend Azubi Africa for its comprehensive and effective programs. Read More articles about Azubi Africa here and take a few minutes to visit this link to learn more about Azubi Africa’s life-changing programs