Predicting Bank Customer Churn with Machine Learning

Introduction

Eulene
9 min readMay 18, 2024

Predicting customer churn is crucial for businesses aiming to retain customers and reduce losses. By identifying customers likely to leave, companies can implement strategies to improve retention. This project focuses on developing a predictive model using machine learning techniques to identify at-risk customers.

Photo by Nick Pampoukidis on Unsplash

Objectives

The primary objectives of this project are:

  1. Identify factors that contribute to customer churn.
  2. Develop a predictive model to classify customers as churned or not churned.
  3. Evaluate the model’s performance and refine it for better accuracy.
  4. Provide actionable insights to help the business reduce churn rates.

Project Outline

  1. Data Import and Exploration
  2. Data Cleaning and Preprocessing
  3. Exploratory Data Analysis (EDA)
  4. Feature Engineering
  5. Model Selection and Evaluation
  6. Model Deployment (GUI Application)

1. Data Import and Exploration

Data Source

The dataset used in this project was sourced from Kaggle: https://www.kaggle.com/datasets/barelydedicated/bank-customer-churn-modeling

This dataset comprises 10,000 rows and 14 columns, providing insights into customer interactions with a bank. Here’s a breakdown of the columns:

1. RowNumber: Represents the row number in the dataset.
2. CustomerId: Unique identifier for each customer.
3. Surname: Customer’s last name.
4. CreditScore: Credit score of the customer.
5. Geography: Customer’s country of residence.
6. Gender: Customer’s gender.
7. Age: Customer’s age.
8. Tenure: Number of years the customer has been with the bank.
9. Balance: Account balance of the customer.
10. NumOfProducts: Number of bank products the customer uses.
11. HasCrCard: Whether the customer has a credit card (1 for yes, 0 for no).
12. IsActiveMember: Whether the customer is an active member (1 for yes, 0 for no).
13. EstimatedSalary: Estimated salary of the customer.
14. Exited: Whether the customer has churned (1 for yes, 0 for no).

Data Import

We imported the dataset using Python libraries such as Pandas and explored its structure, content, and summary statistics.

Viewing the first 5 rows and column names:

Number of rows and columns:

Data types :

Check for null values

We checked for null values and found none in the dataset.

Summary Statistics

2. Data Cleaning and preprocessing.

Dropping irrelevant columns

Irrelevant columns such as RowNumber, CustomerId, and Surname were dropped to simplify the model.

Handling Categorical Data

Machine learning models require numerical input, so categorical data must be converted into numerical form. This was done using one-hot encoding technique.

One-Hot Encoding

One-hot encoding converts categorical variables into a series of binary columns, each representing a unique category. For example, if the Geography column has values Germany, Spain, and France, one-hot encoding will create three new columns: Geography_Germany, Geography_Spain, and Geography_France.

But we have a small problem, when we one-hot encode this variable, we create three dummy variables: Geography_France, Geography_Spain, and Geography_Germany.If we include all three dummy variables in our model, we will introduce perfect multicollinearity. This is because one of these variables can be perfectly predicted from the other two. For instance, if Geography_France and Geography_Spain are both 0, then Geography_Germany must be 1.

To avoid multicollinearity, we drop one of the dummy variables. In this case, we drop the first variable.

3. Data Exploration

We performed exploratory data analysis to visualize data distributions and correlations between variables.

In this case, the data is imbalanced and this therefore has to be handled.

Correlation

This shows that there stronger correlation between Age, Balance, Activity, Geography, and churning.

4. Feature Engineering

Features were engineered by splitting the data into independent (X) and dependent (y) variables.

Handling imbalanced data

SMOTE was applied to handle imbalanced data, generating synthetic samples for the minority class to balance the dataset.

What is SMOTE?

SMOTE (Synthetic Minority Over-sampling Technique) is a technique used to increase the number of samples in the minority class by creating synthetic examples rather than by over-sampling with replacement. Here’s how it works:

  1. Identify Minority Class Instances: SMOTE first identifies the minority class instances in your dataset.
  2. Random Selection: For each instance in the minority class, SMOTE selects one or more of its nearest neighbors (usually based on Euclidean distance).
  3. Generate Synthetic Samples: Synthetic samples are generated by taking the difference between the feature vector (sample) under consideration and its nearest neighbor, multiplying this difference by a random number between 0 and 1 and adding it to the feature vector under consideration.

The result after handling the imbalance :

Splitting the Dataset into Training and Test Sets

Splitting the dataset into training and test sets is a crucial step in evaluating the performance of a machine-learning model. By doing this, we can ensure that the model is tested on unseen data, providing an unbiased estimate of its generalization to new data.

Importing the Necessary Library

The train_test_split function from the sklearn.model_selection module is used to split the dataset into training and test sets.

Splitting the Data

  • X_res: The features after applying SMOTE.
  • y_res: The target variable after applying SMOTE.
  • test_size=0.20: This parameter sets 20% of the data aside for testing, while the remaining 80% is used for training.
  • random_state=40: This parameter ensures that the split is reproducible. Setting a seed value makes the randomness deterministic, ensuring that you get the same train-test split every time you run the code.

Standardizing the Data

Standardizing the data (scaling) is another important step. It ensures that each feature contributes equally to the distance metrics used by many machine learning algorithms.

Importing the StandardScaler

The StandardScaler from sklearn.preprocessing is used to standardize the features by removing the mean and scaling to unit variance.

Initializing the Scaler

An instance of StandardScaler is created.

Fitting and Transforming the Training Data

fit_transform(X_train): This method first fits the scaler to the training data by computing the mean and standard deviation of each feature, and then transforms the data using these parameters. The result is that the training data is standardized (mean = 0 and standard deviation = 1 for each feature).

Transforming the Test Data

transform(X_test): This method transforms the test data using the mean and standard deviation computed from the training data. It’s important to use the same parameters to ensure that the test data is scaled in the same way as the training data.

Converting Scaled Data Back to DataFrame

After scaling, the data is converted back to a data frame. This step preserves the feature names, making it easier to understand and work with the scaled data in subsequent steps.

5. Model Selection and Evaluation

When building machine learning models, it’s essential to compare the performance of various algorithms to determine the best fit for your dataset. Here, we implemented and evaluated six different models on the customer churn prediction dataset:

  1. Logistic Regression
  2. Support Vector Classifier (SVC)
  3. K-Nearest Neighbors (KNN)
  4. Decision Tree Classifier
  5. Random Forest Classifier
  6. Gradient Boosting Classifier

1. Logistic Regression

Logistic Regression is a linear model used for binary classification tasks. It predicts the probability that a given input belongs to a particular class.

Results:

  • Accuracy: 76.62%
  • Precision: 0.7747
  • Recall: 0.7579
  • F1 Score: 0.7662

2. Support Vector Classifier (SVC)

SVC is a powerful classifier that works well with high-dimensional spaces and is effective in cases where the number of dimensions exceeds the number of samples.

Results:

  • Accuracy: 82.67%
  • Precision: 0.8561
  • Recall: 0.7902
  • F1 Score: 0.8218

3. K-Nearest Neighbors (KNN)

KNN is a simple, instance-based learning algorithm that classifies a sample based on the majority class among its k-nearest neighbors.

Results:

  • Accuracy: 81.83%
  • Precision: 0.8320
  • Recall: 0.8026
  • F1 Score: 0.8171

4. Decision Tree Classifier

Decision Trees are a non-linear classifier that partitions the data into subsets based on the most significant differentiator.

Results:

  • Accuracy: 80.04%
  • Precision: 0.7993
  • Recall: 0.8082
  • F1 Score: 0.8037

5. Random Forest Classifier

Random Forest is an ensemble method that builds multiple decision trees and merges them to get a more accurate and stable prediction.

Results:

  • Accuracy: 86.22%
  • Precision: 0.8786
  • Recall: 0.8442
  • F1 Score: 0.8610

6. Gradient Boosting Classifier

Gradient Boosting is another ensemble technique that builds trees sequentially, each new tree correcting errors made by the previously trained tree.

Results:

  • Accuracy: 83.36%
  • Precision: 0.8587
  • Recall: 0.8032
  • F1 Score: 0.8300

Comparison of Model Performance

To compare the performance of these models, we create a data frame summarizing their accuracy, precision, recall, and F1 scores, and visualize the results using bar plots.

Accuracy Comparison

This shows that the random Forest model had the highest accuracy score

Precision Comparison

From the results, we observe that:

  • Random Forest Classifier performs the best in terms of accuracy (86.22%) and precision (0.8786).
  • Gradient Boosting Classifier and SVC also perform well, with accuracies of 83.36% and 82.67% respectively.
  • Logistic Regression, while simpler, shows lower performance compared to ensemble methods.
  • Decision Tree Classifier has a decent performance but is not as robust as the ensemble methods.
  • K-Nearest Neighbors performs better than Logistic Regression and Decision Trees but is outperformed by the ensemble methods.

By comparing these models, we can make an informed decision on which model to deploy based on the specific requirements of accuracy, precision, recall, and computational efficiency.

Model Saving

Once the best-performing model has been identified, in this case, the Random Forest Classifier, we can save the trained model and the scaler to disk. This allows us to load and use the model later without needing to retrain it.

Final Training and Scaling

Before saving the model, it’s a good practice to retrain it on the entire resampled dataset to ensure the model has learned from all the available data.

In these steps, we:

  1. Fit and transform the scaler on the entire resampled dataset (X_res).
  2. Retrain the Random Forest model using this scaled data.
  3. Save the trained model and the scaler using joblib.

Creating the GUI

To make the customer churn prediction model more accessible and practical, a Graphical User Interface (GUI) was built using libraries like Tkinter in Python. This GUI allows users to interact with the model by inputting customer data and receiving churn predictions.

Conclusion

This project successfully developed a Random Forest Classifier model to predict customer churn with an accuracy of 86.22%. By leveraging machine learning techniques and addressing data preprocessing challenges, the model provides actionable insights to help businesses reduce churn rates and retain valuable customers.

--

--