Using the Streamlit application and machine learning project, forecast and analyze customer Churn, to lower the rate of leaving the organization.

JustinJabo
12 min readSep 28, 2023

--

1.0 Introduction (What is Customer Churn?)

Customer churn:

When clients or subscribers stop using a company or service, it’s known as customer churn.

Simply said, it’s the moment a customer ceases to be one of yours (stops to be your customer.).

In the third quarter of 7,043 Californians clients received home phone and Internet services from a fake telecom company, according to telco churn data. Which clients have departed, remained, or registered for their services?

In this project, we seek to determine the probability that a client would leave the company, the primary signs of churn, and the retention tactics and strategies that may be used to prevent this issue.

1.1 The Data

This project’s data is in a CSV file format. The columns that are included in the data are described as follows.

. Gender : Whether the customer is a female or a male.

. SeniorCitizen: Whether a customer is a senior citizen or not.

. Partner: Whether the customer has a partner or not (Yes, No).

. Dependents: Whether the customer has dependents or not (Yes, No).

. Tenure: The duration of the customer’s relationship with the business in months.

. Phone Service: Whether the customer has a phone service or not (Yes, No)

. MultipleLines: Whether the customer has multiple lines or not

. InternetService: Customer’s internet service provider (DSL, Fiber Optic, No)

. OnlineSecurity: Whether the customer has online security or not (Yes, No, No Internet)

. OnlineBackup: Whether the customer has online backup or not (Yes, No, No Internet)

. DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)

. TechSupport: Whether the customer has tech support or not (Yes, No, No internet).

. StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet access service).

. StreamingMovies: Whether the client has streaming movies (Yes, No, No Internet access service).

. Contract: The length of the customer’s contract (Month-to-Month, One year, Two year)

. PaperlessBilling: Whether the client receives their bills electronically (Yes, No).

. Payment Method: The mode of payment used by the consumer (credit card(automatic), bank transfer(automatic), mailed check, or electronic check).

. MonthlyCharges: The monthly sum that the customer is charged.

. TotalCharges: The total sum (amount) that the client is being charged.

. Churn: Whether or not the customer churned (left) (Yes or No)

2.0 Ask Stage

Here, we list the queries that we want to have resolved by the conclusion of the analysis. In order to direct the analyses, the following hypothesis was put forth along with some questions.

2.1 Hypothesis

Null Hypothesis (H0): The sample has a Gaussian distribution in the numerical feautures.

Alternative Hypothesis (H1): The sample does not have a Gaussian distribution in the numerical feautures.

Statistical Normality Tests:

To find out if a dataset is regularly distributed around the mean value, normality tests are utilized. Values are supposed to follow a normal distribution with an equal number of measurements above and below the mean value for any measurement.

A continuous probability distribution with symmetrical sides surrounding its center is called a Gaussian distribution. It has an equal mean, median, and mode.

Popular normalcy tests include Anderson-Darling and D’Agostino’s K².

This dataset has three numerical features: MonthlyCharges, Tenure, and TotalCharges.

There is no Gaussian distribution for Monthly Charges.

There is no Gaussian distribution for Tenure.

There is no Gaussian distribution for TotalCharges.

Note: Our null hypothesis, which asserts that the sample has a Gaussian distribution in the numerical features, is thus rejected since we have discovered that the sample does not have a Gaussian distribution in the numerical features.

2.2 Questions

The following questions are necessary for our Exploratory Data Analysis (EDA):

  1. Does longer tenure increase churn?
  2. Is there any pattern in Customer Churn based on gender?
  3. Which type of contract keeps more customers?
  4. What’s the most profitable Internetservice type?

2.3. Bivariate Analysis

Observations:

The churn is unaffected by gender.

Clients who are most likely to leave (to churn):

1. who doesn’t have partner.

2. who doesn’t have dependants.

3. who has phone service.

4. who use fiber optic as internet service.

5. who didn’t subscibe to any extra services (Online Backup, Online Security, etc).

6. who has contract month-to-month basi.

7. who chose Paperless Billing.

8. who use Electronic check.

2.4. Multivariate Analysis

Tenure and TotalCharges have a strong correlation, but not with MonthlyCharges.

However, there is a correlation between MonthlyCharges and TotalCharges, although it is less than 0.8.

Observations:

Customers with shorter service subscription durations (smaller tenure) are more likely to discontinue (to churn).

Higher Monthly Charges customers are more likely to churn, while lower Monthly Charges customers are more likely to stay.

Surprisingly, turnover is more likely to occur among customers with smaller TotalCharges.

Customers who are constantly leaving have a significantly lower tenure median than non-churners.

In comparison to non-churners, consumers who engage in monthly transactions have higher median charges and a significantly smaller interquartile range.

3.0 Data Preparation and Processing

We now arrange the data such that it is ready for analysis. Here, the goals are data consistency and cleanliness.

3.1 Issues with the data

  1. A few of the columns don’t relevant.
  2. A few of the columns do not belong in the appropriate data types.
  3. The column payments method requires’ data values are difficult to read.
  4. There are gaps in the data ( some missing values in the data.).

3.2 Cleaning the Data

Since the purpose of this article is to provide an overview, I will concentrate on the main tasks completed on the DataFrames. The notebook, to which a link will be provided at the conclusion of the article, has the detailed functions.

  1. Eliminate (Remove) unnecessary columns.
  2. Modify the appropriate data types in the corresponding columns.
  3. To improve readability, rename the data values.

4.0 Answering the Questions

Here, I use the code and visuals to merge the “Analyze” and “Share” phases of the data analysis process.

4.1. Does longer tenure increase churn?

No, the rate of customer churn is lower for those with longer tenure.

4.2. Is there any pattern in Customer Churn based on gender?

The graphic below demonstrates how similar churn is for both genders.

4.3. Which type of contract keeps more customers?

Month-to-month contracts have a significantly higher churn rate than other contract durations.

4.4. What’s the most profitable Internetservice type.

FiberOptic InternetService retains a higher customers.

5.0 Feature Processing & Engineering

This section contains the cleaning, dataset processing, and feature creation steps.

5.1 Drop Duplicates

5.2 Creating new features

Since some columns are connected to one another, we will combine them to eliminate superfluous (unnecessary) features.

The original columns will then be dropped since they are no longer necessary.

5.3. Impute Missing Values

5.4 Data Imbalance Check

We now have balanced data.

5.5 Dataset Splitting

My train data was split (divided) into a train set and an evaluation/test set.

5.6 Features Encoding & scaling

I converted our target column’s numbers into something that ML models could comprehend using the LabelEncoder.

For your test set, perform the same encoding.

6.0 Machine Learning Modeling

A major challenge in any machine learning project is choosing an algorithm.

Since there isn’t one algorithm that works best for all machine learning projects, choosing an algorithm is a crucial difficulty. In general, we must assess a group of possible applicants and choose those who do better for additional assessment.

In this research, 6 distinct algorithms — all of which have already been implemented in Scikit-Learn — are compared.

  1. Logistic Regression
  2. RandomForest Classifier
  3. XGBoost Classifier
  4. K Nearest Neighbors
  5. Support Vector Machines
  6. DecisionTreeClassifier

6.1. Models Comparison

It is essential that all algorithms have been trained with the default hyperparameters. Many machine learning methods have a highly sensitive F1 score to the hyperparameters selected during model training. A more thorough examination will select a model (or models) for hyperparameter tuning once a larger range of hyperparameters (not just default values) have been evaluated. However, this is beyond the purview of this piece. In this case, we will solely use the default hyperparameters to further assess the model that has a better F1 score. This is consistent with the RandomForest Classifier, which displays an F1 score of 90%, as previously demonstrated.

6.2. Evaluation of the chosen Model

Cross-Validation with k-folds.

A resampling technique called cross-validation is used to assess machine learning models on a small sample of data.

The process takes a single parameter, k, which is the number of groups into which a given data sample should be divided. As such, k-fold cross-validation is a common name for the process. When a particular value for k is selected, it can be substituted for k in the model’s reference; for example, k=5 would represent 5-fold cross-validation.

In applied machine learning, cross-validation is mostly used for evaluating a machine learning model’s proficiency on hypothetical data. That is, to assess the model’s projected overall performance using a small sample size when it comes to making predictions on data that was not used for model training.

It is a well-liked strategy because, in comparison to other approaches, such a straightforward train/test split, it typically produces a less biased or optimistic estimate of the model performance and is easy to understand.

The highest performing model and one with the best average cross validation score is Randomforest, as can be seen in the graph.

6.3. Hyperparameter tuning.

For hyperparameter adjustments, the top three models were chosen.

a. RandomForest Classifier.

b. Support Vector Machines.

c. DecisionTree Classifier

Up to this point, we have divided our data into two sets: a testing set used to assess the model’s performance and a training set used to determine the model’s parameters. Hyperparameter tuning is the next phase in the machine learning process. The process of choosing hyperparameters involves evaluating the model’s performance with various combinations of hyperparameters, then choosing the ones that yield the best results based on a selected metric and a validation method.

We must divide our training data once more into training and testing sets (also known as validation sets) in order to do hyperparameter tweaking. K-fold cross-validation is a widely used technique for hyperparameter optimization. After dividing the training set into k equal-sized samples once more, k-1 samples are used for training the model and 1 sample is utilized for testing. This process is repeated k times. After that, an average of the k evaluation metrics — in this example, accuracy — is obtained to create a single estimator.

It is crucial to emphasize that, contrary to what the picture below suggests, the validation set is utilized for hyperparameter selection rather than for assessing the overall performance of our model.

a. RandomForest Classifier

Upon hyperparameter adjustment, the model’s performance does not significantly change.

b. Support Vector Machines

The model’s performance does not significantly change following hyperparameter adjustment.

c. DecisionTree Classifier

After adjusting the hyperparameters, the model performs differently. Once tuned, the model F1 score went up.

6.4. Future predictions

Check the performance of the model (best hyperparameters )

Using the confusion matrix and a few assessment measures, determine the model’s performance (optimal hyperparameters).

RandomForest Classifier

Based on confusion matrix:

  1. We successfully predicted 869 customers who don’t churn and 976 who churn
  2. There are 167 customers who are predicted to churn when they actually won’t
  3. There are 50 customers who are predicted to not churn when they actually churn

Support Vector Machines

Based on confusion matrix:

  1. We successfully predicted 881 customers who don’t churn and 947 who churn
  2. There are 155 customers who are predicted to churn when they actually won’t
  3. There are 79 customers who are predicted to not churn when they actually churn

DecisionTree Classifier

Based on confusion matrix:

  1. We successfully predicted 801 customers who don’t churn and 962 who churn
  2. There are 235 customers who are predicted to churn when they actually won’t
  3. There are 64 customers who are predicted to not churn when they actually churn

6.5. Model Feature Importance

RandomForest Classifier

DecisionTree Classifier

7.0. Conclusions and Recommendations.

We have used the Churn dataset from Telco customers to walk through an entire end-to-end machine learning project. The data was first cleaned before being visually analyzed. Next, we used feature engineering to convert the categorical data into numerical variables so that we could create a machine learning model. We experimented with six different machine learning algorithms using the default parameters after converting the data.

Based on model findings, RandomForest Classifier achieves the best Precision on the test set with 0.89.

To find the model with the best performance, it seems appropriate to compare F1 scores given the significant data imbalance in favor of non-churners.

score based on recall and precision combined. With an F1 score of 0.90, this would also be the RandomForest Classifier.

It is evident from the best-performing models’ results that F1 scores are not much higher than 90%.

More optimization work needs to be done in order to improve scores and, consequently, prediction power for more business value.

Based on our exploratory data analysis, it appears that this organization is experiencing some problems with its month-to-month customers.

What kind of promotions can this business provide to clients in order to persuade them to sign one- or two-year contracts?

What changes to month-to-month contracts could make them more customer-friendly without detracting from the allure of a one- or two-year contract?

You can see that TotalCharges has a favorable impact on the data and some attributes for the RandomForest Classifier.

Given their detrimental effects on the target column, InternetService_fibreoptics, contract_one year, and other similar terms should be carefully examined.

Request and Recommendation.

Customers that match the following characteristics deserve our undivided attention.

  1. Contract: Month-to-month
  2. Tenure: Short tenure
  3. Internet service: Fiber optic
  4. Payment method: Electronic check7.0 Conclusion

Find below a link to all the code on github.

[CLICK HERE]

--

--