Predicting Bank Customer Churn with Machine Learning
Predicting customer churn is crucial for businesses aiming to retain customers and reduce losses. By identifying customers likely to leave, companies can implement strategies to improve retention. This project focuses on developing a predictive model using machine learning techniques to identify at-risk customers.
Objectives
The primary objectives of this project are:
- Identify factors that contribute to customer churn.
- Develop a predictive model to classify customers as churned or not churned.
- Evaluate the model’s performance and refine it for better accuracy.
- Provide actionable insights to help the business reduce churn rates.
Project Outline
- Data Import and Exploration
- Data Cleaning and Preprocessing
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Model Selection and Evaluation
- Model Deployment (GUI Application)
1. Data Import and Exploration
Data Source
The dataset used in this project was sourced from Kaggle: https://www.kaggle.com/datasets/barelydedicated/bank-customer-churn-modeling
This dataset comprises 10,000 rows and 14 columns, providing insights into customer interactions with a bank. Here’s a breakdown of the columns:
1. RowNumber: Represents the row number in the dataset.
2. CustomerId: Unique identifier for each customer.
3. Surname: Customer’s last name.
4. CreditScore: Credit score of the customer.
5. Geography: Customer’s country of residence.
6. Gender: Customer’s gender.
7. Age: Customer’s age.
8. Tenure: Number of years the customer has been with the bank.
9. Balance: Account balance of the customer.
10. NumOfProducts: Number of bank products the customer uses.
11. HasCrCard: Whether the customer has a credit card (1 for yes, 0 for no).
12. IsActiveMember: Whether the customer is an active member (1 for yes, 0 for no).
13. EstimatedSalary: Estimated salary of the customer.
14. Exited: Whether the customer has churned (1 for yes, 0 for no).
Data Import
We imported the dataset using Python libraries such as Pandas and explored its structure, content, and summary statistics.
Viewing the first 5 rows and column names:
Number of rows and columns:
Data types :
Check for null values
We checked for null values and found none in the dataset.
Summary Statistics
2. Data Cleaning and preprocessing.
Dropping irrelevant columns
Irrelevant columns such as RowNumber, CustomerId, and Surname were dropped to simplify the model.
Handling Categorical Data
Machine learning models require numerical input, so categorical data must be converted into numerical form. This was done using one-hot encoding technique.
One-Hot Encoding
One-hot encoding converts categorical variables into a series of binary columns, each representing a unique category. For example, if the Geography
column has values Germany
, Spain
, and France
, one-hot encoding will create three new columns: Geography_Germany
, Geography_Spain
, and Geography_France
.
But we have a small problem, when we one-hot encode this variable, we create three dummy variables: Geography_France
, Geography_Spain
, and Geography_Germany
.If we include all three dummy variables in our model, we will introduce perfect multicollinearity. This is because one of these variables can be perfectly predicted from the other two. For instance, if Geography_France
and Geography_Spain
are both 0, then Geography_Germany
must be 1.
To avoid multicollinearity, we drop one of the dummy variables. In this case, we drop the first variable.
3. Data Exploration
We performed exploratory data analysis to visualize data distributions and correlations between variables.
In this case, the data is imbalanced and this therefore has to be handled.
Correlation
This shows that there stronger correlation between Age, Balance, Activity, Geography, and churning.
4. Feature Engineering
Features were engineered by splitting the data into independent (X) and dependent (y) variables.
Handling imbalanced data
SMOTE was applied to handle imbalanced data, generating synthetic samples for the minority class to balance the dataset.
What is SMOTE?
SMOTE (Synthetic Minority Over-sampling Technique) is a technique used to increase the number of samples in the minority class by creating synthetic examples rather than by over-sampling with replacement. Here’s how it works:
- Identify Minority Class Instances: SMOTE first identifies the minority class instances in your dataset.
- Random Selection: For each instance in the minority class, SMOTE selects one or more of its nearest neighbors (usually based on Euclidean distance).
- Generate Synthetic Samples: Synthetic samples are generated by taking the difference between the feature vector (sample) under consideration and its nearest neighbor, multiplying this difference by a random number between 0 and 1 and adding it to the feature vector under consideration.
The result after handling the imbalance :
Splitting the Dataset into Training and Test Sets
Splitting the dataset into training and test sets is a crucial step in evaluating the performance of a machine-learning model. By doing this, we can ensure that the model is tested on unseen data, providing an unbiased estimate of its generalization to new data.
Importing the Necessary Library
The train_test_split
function from the sklearn.model_selection
module is used to split the dataset into training and test sets.
Splitting the Data
- X_res: The features after applying SMOTE.
- y_res: The target variable after applying SMOTE.
- test_size=0.20: This parameter sets 20% of the data aside for testing, while the remaining 80% is used for training.
- random_state=40: This parameter ensures that the split is reproducible. Setting a seed value makes the randomness deterministic, ensuring that you get the same train-test split every time you run the code.
Standardizing the Data
Standardizing the data (scaling) is another important step. It ensures that each feature contributes equally to the distance metrics used by many machine learning algorithms.
Importing the StandardScaler
The StandardScaler
from sklearn.preprocessing
is used to standardize the features by removing the mean and scaling to unit variance.
Initializing the Scaler
An instance of StandardScaler
is created.
Fitting and Transforming the Training Data
fit_transform(X_train): This method first fits the scaler to the training data by computing the mean and standard deviation of each feature, and then transforms the data using these parameters. The result is that the training data is standardized (mean = 0 and standard deviation = 1 for each feature).
Transforming the Test Data
transform(X_test): This method transforms the test data using the mean and standard deviation computed from the training data. It’s important to use the same parameters to ensure that the test data is scaled in the same way as the training data.
Converting Scaled Data Back to DataFrame
After scaling, the data is converted back to a data frame. This step preserves the feature names, making it easier to understand and work with the scaled data in subsequent steps.
5. Model Selection and Evaluation
When building machine learning models, it’s essential to compare the performance of various algorithms to determine the best fit for your dataset. Here, we implemented and evaluated six different models on the customer churn prediction dataset:
- Logistic Regression
- Support Vector Classifier (SVC)
- K-Nearest Neighbors (KNN)
- Decision Tree Classifier
- Random Forest Classifier
- Gradient Boosting Classifier
1. Logistic Regression
Logistic Regression is a linear model used for binary classification tasks. It predicts the probability that a given input belongs to a particular class.
Results:
- Accuracy: 76.62%
- Precision: 0.7747
- Recall: 0.7579
- F1 Score: 0.7662
2. Support Vector Classifier (SVC)
SVC is a powerful classifier that works well with high-dimensional spaces and is effective in cases where the number of dimensions exceeds the number of samples.
Results:
- Accuracy: 82.67%
- Precision: 0.8561
- Recall: 0.7902
- F1 Score: 0.8218
3. K-Nearest Neighbors (KNN)
KNN is a simple, instance-based learning algorithm that classifies a sample based on the majority class among its k-nearest neighbors.
Results:
- Accuracy: 81.83%
- Precision: 0.8320
- Recall: 0.8026
- F1 Score: 0.8171
4. Decision Tree Classifier
Decision Trees are a non-linear classifier that partitions the data into subsets based on the most significant differentiator.
Results:
- Accuracy: 80.04%
- Precision: 0.7993
- Recall: 0.8082
- F1 Score: 0.8037
5. Random Forest Classifier
Random Forest is an ensemble method that builds multiple decision trees and merges them to get a more accurate and stable prediction.
Results:
- Accuracy: 86.22%
- Precision: 0.8786
- Recall: 0.8442
- F1 Score: 0.8610
6. Gradient Boosting Classifier
Gradient Boosting is another ensemble technique that builds trees sequentially, each new tree correcting errors made by the previously trained tree.
Results:
- Accuracy: 83.36%
- Precision: 0.8587
- Recall: 0.8032
- F1 Score: 0.8300
Comparison of Model Performance
To compare the performance of these models, we create a data frame summarizing their accuracy, precision, recall, and F1 scores, and visualize the results using bar plots.
Accuracy Comparison
This shows that the random Forest model had the highest accuracy score
Precision Comparison
From the results, we observe that:
- Random Forest Classifier performs the best in terms of accuracy (86.22%) and precision (0.8786).
- Gradient Boosting Classifier and SVC also perform well, with accuracies of 83.36% and 82.67% respectively.
- Logistic Regression, while simpler, shows lower performance compared to ensemble methods.
- Decision Tree Classifier has a decent performance but is not as robust as the ensemble methods.
- K-Nearest Neighbors performs better than Logistic Regression and Decision Trees but is outperformed by the ensemble methods.
By comparing these models, we can make an informed decision on which model to deploy based on the specific requirements of accuracy, precision, recall, and computational efficiency.
Model Saving
Once the best-performing model has been identified, in this case, the Random Forest Classifier, we can save the trained model and the scaler to disk. This allows us to load and use the model later without needing to retrain it.
Final Training and Scaling
Before saving the model, it’s a good practice to retrain it on the entire resampled dataset to ensure the model has learned from all the available data.
In these steps, we:
- Fit and transform the scaler on the entire resampled dataset (
X_res
). - Retrain the Random Forest model using this scaled data.
- Save the trained model and the scaler using
joblib
.
Creating the GUI
To make the customer churn prediction model more accessible and practical, a Graphical User Interface (GUI) was built using libraries like Tkinter in Python. This GUI allows users to interact with the model by inputting customer data and receiving churn predictions.
Conclusion
This project successfully developed a Random Forest Classifier model to predict customer churn with an accuracy of 86.22%. By leveraging machine learning techniques and addressing data preprocessing challenges, the model provides actionable insights to help businesses reduce churn rates and retain valuable customers.