Building a ML Classification Model with Python — Telco Customer Churn Prediction

Andrew Obando
23 min readAug 20, 2023

--

Customer Churn Classification | Supervised Machine Learning
Customer Churn

Imagine if a business could predict whether customers are likely to leave/unsubscribe to their services, and do what they can to keep them. It’s like a superpower, or being a fortune teller. You know they say its cheaper and more efficient to keep an existing customer than to obtain a new customer. In the long run, acquiring a new customer can cost five times more than retaining an existing customer.

Customer churn is the loss of customers who were previously subscribed/making regular purchases or discontinue their association with your company within a specified timeframe. The churn rate says a lot about a business and its ability to retain customers, as well as being a predictor of future success.

In this project, we use Supervised Machine Learning (classification) to explore the significance of churn analytics. Supervised machine learning is a type of machine learning where the algorithm learns from labeled data. Classification is a type of supervised machine learning used to predict the category or class label of a new data instance based on its input features. In classification, the target variable is categorical, meaning it takes on a finite number of discrete values or classes. Examples of classification tasks include spam detection (classifying emails as spam or not spam), sentiment analysis (classifying text as positive, negative, or neutral), and image recognition (classifying images into different categories).

The project followed the The Cross Industry Standard Process for Data Mining (CRISP-DM) framework that serves as the base for a data science process.

Table of Contents

· Business Understanding
Primary Objective
Hypothesis
Analytical Questions
· Data Understanding
Issues with data
Hypothesis Testing
· Data Cleaning
Clean First Training Dataset
Clean Second Training Dataset
Merge Final Training Dataset
· Exploratory Data Analysis
Target Class Distribution
Analytical Questions
· Feature Engineering
Feature Importance
Dataset Split
Label Encoder
One-Hot Encoding
Feature Scaling
Train set Balancing (SMOTE Algorithm)
Train-Test Split
· Modelling
· Evaluation
K-Fold Cross-Validation
Classification Report
Hyperparameter tuning
Randomized Search CV
Confusion Matrix
· References
· Appreciation

Business Understanding

The churn analytics predictive model is a data-driven solution designed to address the persistent challenge of customer churn in subscription-based industries. It can be used as a strategic tool for telecommunication companies to proactively identify potential risk factors for churn, optimize retention efforts, and cultivate lasting customer relationships.

Up to 65% of a company’s business comes from existing customers. It’s no secret that repeat clients and customers drives a huge chunk of business revenue. According to this Harvard Business School report, on average, a 5% increase in customer retention rates results in 25% — 95% increase of profits. Customers have a lot of options and evolving preferences, therefore, accurately predicting churn has become important to retain and satisfy them.

This classification model aims to identify customers at risk of churn, enabling businesses to take proactive measures and implement targeted retention strategies.

Primary Objective

The objective is to create a classification model that evaluates customer data and provides insights into customer behavior, preferences, and patterns to predict the likelihood of customer churn. Model performance is assessed through metrics like accuracy, precision, recall and F1-score.

The business objective is ultimately to be able reduce customer churn rates and retain valuable customers. Armed with the model’s insights, businesses can create personalized offers, marketing campaigns, and proactive customer support initiatives, thus improving customer satisfaction and fostering loyalty.

Hypothesis

  • Null Hypothesis (H0): There is no significant relationship between customer tenure and churn rate in the telecom company.
  • Alternative Hypothesis (Ha): There is a significant relationship between customer tenure and churn rate in the telecom company.

Analytical Questions

  1. How does customer tenure relate to churn rates? Are long-tenured customers more likely to stay with the company, and do new customers exhibit higher churn behavior?
  2. Is there a correlation between the total charges and churn rates? Do customers with higher total charges exhibit different churn behavior compared to those with lower total charges
  3. What is the impact of contract type on churn rates? Do customers on long-term contracts have significantly lower churn rates compared to those on short-term contracts?
  4. Are there significant differences in churn behavior between customers who have device protection and those who don’t?
  5. What is the relationship between the availability of tech support and churn rates? Are customers with access to tech support more likely to remain with the company?
  6. Do streaming services play a role in customer churn? Are customers with streaming services, such as StreamingTV and StreamingMovies, more likely to stay with the company?
  7. How does the choice of payment method impact churn rates? Are customers with specific payment methods more prone to churn than others?

Data Understanding

First step is to collect and analyze our data for the project, explore it and verify its quality and relevance. We install our necessary libraries/packages for the project and source our data from multiple sources, including a remote Microsoft SQL server. You can view the jupyter notebook for the project to get more detailed insights here.

Our data consists of 3 datasets, two for the train and one for the test:

  • data = LP2_Telco_Churn_first_3000
  • data_xls = Telco-churn-second-2000 (our test data without the target variable)
  • data_2=LP2_telco-churn-last-2000 for testing
# View columns and rows of datasets

(data.shape,data_xls.shape,data2.shape)
Shape of data frame (rows and columns) | Data Understanding
Columns and rows of the 3 datasets

After using the pandas.read_csv function to load our data, here is a description of our datasets:

Data frame column descriptions and dtypes | Data Understanding
A majority of the columns in our dataset are categorical

Issues with data

  • Train datasets need to be merged, however they have inconsistent data types that need to be synchronized
  • We need to remove the CustomerID column from the training data as it is not necessary for our analysis
  • Missing values need to be imputed or deleted
  • TotalCharges column needs to be changed to numeric dtype from object dtype

Hypothesis Testing

We want to see if there is a relationship/correlation between tenure and churn. Since churn is a binary outcome (Yes or No), we want to use point-biserial correlation and assess its significance using a correlation test.

# 'Churn' should be a binary variable (0 for no churn, 1 for churn)
df_hypo = Data_All.copy()
df_hypo['Churn'] = df_hypo['Churn'].replace({'No': 0, 'Yes': 1})

# Calculate point-biserial correlation coefficient
correlation_coefficient, p_value = stats.pointbiserialr(df_hypo['Churn'], df_hypo['tenure'])

# Print the results
print(f"Point-Biserial Correlation Coefficient: {correlation_coefficient:.4f}")
print(f"P-value: {p_value:.4f}")

# Set the significance level (5%)
alpha = 0.05

# Determine if the correlation is statistically significant
if p_value < alpha:
print("There is a significant correlation between customer tenure and churn rate.")
else:
print("There is no significant correlation between customer tenure and churn rate.")
Point-Biserial Correlation | Data Understanding

The correlation coefficient of -0.3526 suggests a moderate negative correlation. This means that as customer tenure increases, the churn rate tends to decrease. Customers who have been with the company for a longer period are less likely to churn. This actually makes sense if you think about it. Longer tenure means customer loyalty and retention.

The p-value is less than the significance level, which suggests that there is a strong statistical evidence to reject the null hypothesis (no correlation) in favor of the alternative hypothesis (significant correlation).

Data Cleaning

This step includes selecting, cleaning and formatting our data. It takes up about 80% of any machine learning project. Keep in mind that data cleaning is an iterative process as it involves multiple rounds of processing throughout a machine learning project.

Clean First Training Dataset

data.info()
Data frame information (df.info) | Data Cleaning
Data from our first 3000 rows
  • We remove the customerID column. It serves no purpose in explaining whether or not the customer will churn, just only a unique identifier.
del data["customerID"]
  • We convert the bool dtypes to object dtypes for easier analysis.
# convert bool columns to object Dtypes

bool_to_obj = {'Partner': object,
'SeniorCitizen': object,
'Dependents': object,
'PhoneService': object,
'PaperlessBilling': object}

data = data.astype(bool_to_obj)
  • We then format our object column values and change them (from True/False to Yes/No) for uniformity and easier legibility.
# Replace True and False values with Yes and No in Object columns

replace_values = {True: 'Yes', False: 'No'}

# Use replace() method
data['Partner'] = data['Partner'].replace(replace_values)
data['SeniorCitizen'] = data['SeniorCitizen'].replace(replace_values)
data['Dependents'] = data['Dependents'].replace(replace_values)
data['PhoneService'] = data['PhoneService'].replace(replace_values)
data['MultipleLines'] = data['MultipleLines'].replace(replace_values)
data['OnlineSecurity'] = data['OnlineSecurity'].replace(replace_values)
data['OnlineBackup'] = data['OnlineBackup'].replace(replace_values)
data['DeviceProtection'] = data['DeviceProtection'].replace(replace_values)
data['TechSupport'] = data['TechSupport'].replace(replace_values)
data['StreamingTV'] = data['StreamingTV'].replace(replace_values)
data['StreamingMovies'] = data['StreamingMovies'].replace(replace_values)
data['PaperlessBilling'] = data['PaperlessBilling'].replace(replace_values)
data['Churn'] = data['Churn'].replace(replace_values)
  • We have one missing value in our target variable Churn which we drop. The missing values in TotalCharges have a tenure of 0, even though there are values in the MonthlyCharges column. A customer with a tenure of 0 and 0 total charges should not have any monthly charges. We can simply drop those rows (5) to avoid confusion.
# Drop rows where tenure = 0

data.drop(labels=data[data['tenure'] == 0].index, axis=0, inplace=True)
  • For the rest of our missing categorical values, we replace them with the mode of their respective columns.
# replace the remaining missing values with the mode of their respective columns

columns_to_replace = ['MultipleLines', 'OnlineSecurity', 'OnlineBackup',
'DeviceProtection', 'TechSupport', 'StreamingTV',
'StreamingMovies']

for column in columns_to_replace:
data[column].fillna(data[column].mode()[0], inplace=True)

Our data frame looks a little better now.

# Reset index after dropping rows

data.reset_index(drop=True, inplace=True)

data.info()
Cleaned data frame with no null values | Data Cleaning
First 3000 rows after cleaning

Clean Second Training Dataset

data2.info()
Pandas df.info method | Data Cleaning
Data from last 2000 rows
  • The customerID column gets dropped, and the TotalCharges column gets converted to numeric data type.
# change Dtype of TotalCharges column from object to numeric

data2['TotalCharges'] = pd.to_numeric(data2['TotalCharges'], errors='coerce')

After applying the transformation, the TotalCharges column has some missing values (3) which we drop just like we did in our first training dataset, since they have a tenure of 0.

Merge Final Training Dataset

We merge our 2 cleaned datasets together.

Data_All= pd.concat([data, data2], ignore_index=True)
Data_All.describe(include=['object']).T
Pandas .describe method | Data Cleaning
Final Train Dataset

Exploratory Data Analysis

Let’s explore and visualize the data to gain some insights and answer our analytical questions.

Target Class Distribution

Donut Chart of Churn Distribution | Exploratory Data Analysis
Churn Distribution
  • The target variable (Churn) in our data is imbalanced. This means that that one class is more than the other. We will have to balance the the target classes so that our model does not give biased predictions in favor of the majority class (not churned).
  • The percentage of customers that did not churn (majority class) can be used as a baseline to evaluate the quality of the our model. The model should similarly, or even outperform the baseline capabilities to be considered for future predictions.

Analytical Questions

1. How does customer tenure relate to churn rates?

KDE Plot of churn and tenure | Exploratory Data Analysis
KDE Plot of Churn by Tenure
  • From our hypothesis test, we have already gathered that there is a negative correlation between tenure and churn. The probability density plot shows us that customers with a lower tenure are more likely to churn. As the length of tenure increases, customers who have stayed for longer are less likely to churn.

2. Is there a correlation between the total charges and churn rates?

lmplot for total charges by churn | Exploratory Data Analysis
lmplot for total charges by churn
  • The weak downward slope on the lmplot, which helps us plot a logistic regression line on a scatterplot, suggests that there is a weak negative correlation between total charges and churn rates. The more a customer has spent with the company, the likelihood of churn slightly decreases.

3. What is the impact of contract type on churn rates?

Stacked Bar chart of churn by contract duration | Exploratory Data Analysis
Stacked bar chart of Churn by Contract Duration
  • Customers with short term (month-to-month) contracts have higher churn rates than customers with long term (one year+ contracts)

4. Are there significant differences in churn behavior between customers who have device protection and those who don’t?

Percentage Stacked Bar plot of churn by device protection | Exploratory Data Analysis
Percentage Stacked Bar plot of churn by device protection
  • Customers without device protection are slightly more likely to churn, although this metric is not clear or significant enough.

5. What is the relationship between the availability of tech support and churn rates?

Stacked Bar chart of churn by tech support | Exploratory Data Analysis
Stacked Bar chart of churn by tech support
  • Customers with no tech support were more likely to churn.

6. Do streaming services play a role in customer churn?

Subplot Stacked Bar chart of churn by Streaming Services | Exploratory Data Analysis
Subplot Stacked Bar chart of churn by Streaming Services
  • No significant relationship between streaming services and churn rates.

7. How does the choice of payment method impact churn rates?

Stacked Bar chart of churn by payment method | Exploratory Data Analysis
Stacked Bar chart of churn by payment method
  • Customers who pay by electronic check were more likely to churn, compared to other payment methods.

Feature Engineering

Feature engineering is an important part of machine learning, where we remove and/or add features in our dataset, and also transform our data into acceptable forms that can be passed through a ML model. A lot of ML models need numerical values as inputs, which means that categorical values have to be converted.

Feature Importance

We have already explored some relationships between our variables and target variable with our EDA. Now we want to identify which categorical features are more informative in relation to the target variable, making them potentially valuable for predicting or understanding the Churn behavior.

In machine learning, feature importance helps us in selecting the most relevant features for training our model. By focusing on the most important features, unnecessary noise and redundancy can be eliminated, leading to a simpler, more efficient and more accurate model. One way to do that is to get Mutual Information Scores.

Mutual Information is a measure of the dependence between two variables, which quantifies how much knowing the value of one variable reduces uncertainty about the other variable. Pearson’s correlation coefficient gives us correlation between linear relationships. Mutual information detects both linear and non-linear relationships

So in essence, a higher score means that the feature is good at predicting the target variable, and a lower score suggests that the variable is not that helpful in predicting the dependent variable. Let’s get the mutual information scores of our categorical features, in relation to our target variable (churn).

# Lets find out feature dependency on the target variable using mutual information score

x_cat = Data_All.select_dtypes(include=object).drop('Churn', axis=1)
y_cat = Data_All['Churn']

mi_scores = []

# loop to calculate the Mutual Information Score for each categorical feature
# with respect to the 'Churn' target variable

for column in x_cat.columns:
mi_score = mutual_info_score(x_cat[column], y_cat)
mi_scores.append((column, mi_score))

# sort features by their importance dependency on the target variable,
# with the most important ones at the top.

mi_scores.sort(key=lambda x: x[1], reverse=True)

for feature, score in mi_scores:
print(f"{feature} - {score}")
List of Mutual Information Scores | Feature Importance
List of Mutual Information Scores

This gives us a list of our categorical features and their scores in descending order. Instead of reading a list of numbers, its always good to visualize.

Plotly barplot of mutual information scores | Feature Importance
Horizontal Bar Plot of Mutual Information Scores
  • Contract has the highest mutual score. We saw in our earlier visualizations too that the contract type has a strong relationship with churn probability. Internet Service and payment method also have high mutual scores. The type of internet service and preferred payment method a customer has may be strongly related to churn.
  • The gender , phoneservice and multiplelines variables have a very low mutual score. This suggests that they have little to no predictive power or relationship with the target variable. I decided to delete them from the data frame as they may add unnecessary noise which may affect the model performance.

Dataset Split

We separate the independent variable(y) from the dependent variables(X), since we are using X to predict y.

X = df_encode.drop(columns=['Churn'])
y = df_encode['Churn']

Label Encoder

The target variable is then transformed to numerical values with label encoder. Label Encoding is a technique used in machine learning to transform categorical data into numerical format, specifically integers. Label encoding assigns integer labels sequentially without considering any order among categories. It is normally used to transform target values (y), not the input (X).

# Encode the target variable (Churn) to have 0 or 1 instead of No or Yes

labelEncoder = LabelEncoder()

y = labelEncoder.fit_transform(y)

The result is a single column with integer-encoded values.

One-Hot Encoding

When we have 3 or more options for a discrete variable, like we do for some of our categorical columns, we can use the one-hot encoder to transform the categorical data to numerical data. This will create new columns for each option in a variable.

If we use label encoding for columns with 3 or more options to assign each option with a random number (e.g. 0,1,2), some machine learning models can treat the order of the numbers as if they are significant and this can cause problems. So we use the one-hot encoder.

Some of my categorical variables only have 2 options, but i still used the one-hot encoder across them for uniformity.

# One hot encoding for categorical columns

categorical_columns = ['SeniorCitizen', 'Partner', 'Dependents',
'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
'PaperlessBilling', 'PaymentMethod']

# Create column transformer instance
transformer = make_column_transformer(
(OneHotEncoder(sparse=False), categorical_columns)
)

transformed_data = transformer.fit_transform(X[categorical_columns])

# Transformating back to a dataframe
transformed_df = pd.DataFrame(transformed_data, columns=transformer.get_feature_names_out())

# One-hot encoding removed an index. Let's put it back
transformed_df.index = X.index

# Joining tables
encoded_df = pd.concat([X, transformed_df], axis=1)

# Dropping original categorical columns
encoded_df.drop(categorical_columns, axis=1, inplace=True)

This code transforms the categorical columns in the list into numerical values, changes the result back into a data frame, merges this data frame into our original data frame, and deletes the original categories.

Remember this creates a new column for each option. Each column has a value of 1 where that option was true, and 0 where that option was false.

Snippet of Data Frame after One-Hot Encoding | Feature Engineering
Snippet of Data Frame after One-Hot Encoding

The issue with one-hot encoding is how it increases the dimensionality of the dataset by adding new columns. This can be a problem if you have many unique categories. Increasing the number of features will make your model more complex, leading to overfitting , longer training times or poor performance.

One-hot encoding can also introduce multicollinearity among the encoded features, where some features are highly correlated with each other. This can affect the stability and interpretability of the model, particularly in linear models. Always keep this in mind. In our case, we don’t have too many categories, so this should be fine.

Feature Scaling

We also have to do some transformations on our numerical columns. Feature scaling ensures that all numerical features are on a similar scale, also called normalization. This is important because many machine learning algorithms, such as gradient descent-based algorithms or distance-based algorithms like the k-nearest neighbors (KNN) are sensitive to the scale of features. Features with larger scales may dominate those with smaller scales during model training, leading to biased results. It can also lead to improved model performance by reducing the impact of varying feature scales on the model’s predictions.

We will use the standardscaler which scales the features such that they have a mean of 0 and a standard deviation of 1. This makes our model less sensitive to outliers in the data. This is also called standardization.

# standardization for numeric values

cols = ['tenure', 'MonthlyCharges', 'TotalCharges']

scaler = StandardScaler()

encoded_scaled = scaler.fit_transform(encoded_df[cols])

# Create new DataFrames with the scaled values
X_scaled_df = pd.DataFrame(encoded_scaled, columns=cols, index=X.index)

# Drop the original unscaled columns
encoded_df.drop(cols, axis=1, inplace=True)

# Concatenate the scaled columns with the original DataFrames
encoded_df = pd.concat([encoded_df, X_scaled_df], axis=1)[p

This code makes our numerical columns standardized, put in a new data frame, drops the original unscaled columns, and then concatenates the scaled columns with the original data frame.

Train set Balancing (SMOTE Algorithm)

SMOTE (Synthetic Minority Over-sampling Technique) is a method used to address class imbalance in a binary classification problem.

Earlier we realized that our target variable has a class imbalance. One class (the minority class) has significantly fewer instances than the other class (the majority class). This imbalance can negatively impact the performance of machine learning models, as they might become biased toward the majority class.

SMOTE will aim to balance the class distribution by generating synthetic samples until the minority class has the same number of instances as the majority class. By creating synthetic samples, SMOTE helps the model better capture the patterns in the minority class and prevents it from favoring the majority class due to the imbalance.

# apply SMOTE to the training data (oversampling)

smote = SMOTE(random_state=42, k_neighbors=5, sampling_strategy='auto')

X_resampled, y_resampled = smote.fit_resample(encoded_df, y)

Let’s draw a bar graph showing the original distribution and the balanced distribution.

Bar graph with two subplots showing target distribution before and after balancing | SMOTE Balancing
Before and after balancing target variable

Train-Test Split

We can now split our data into training and evaluation sets. By splitting the dataset into separate training and testing subsets, we can train the model on one subset (training set) and evaluate its performance on the other subset (evaluation set). This allows us to assess how well the model generalizes to new, unseen data. The evaluation set will be 20% of the training data. This is common practice in machine learning.

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)

Modelling

Now for the fun part. We can now train different models and choose the best performing model. It’s important to test different models because different models have different strengths and weaknesses, so testing them on the same data allows you to assess which model provides the most accurate predictions or best generalizes to unseen data.

We will be testing 6 different models:

  • Logistic Regression
  • K-Neighbors Classifier
  • Random Forest Classifier
  • Support Vector Classifier
  • Gradient Boosting Classifier
  • XGBoost Classifier

The steps are the same for each. We declare an instance of the model and then fit the model with our training data. Here is an example, with the Gradient Boosting Classifier:

gb = GradientBoostingClassifier(random_state=42)

# Train the model
gb.fit(X_train, y_train)

We can now assess the performance of the models.

Evaluation

K-Fold Cross-Validation

K-fold cross-validation estimates the performance on our models across multiple subsets of the data (k-folds), providing a comprehensive evaluation of their generalization ability. The model is trained and evaluated k times, with each fold serving as the validation set once. This process helps estimate how well a model will perform on new, unseen data and provides insights into its stability and consistency.

# Create a dataframe with the K-fold Cross-Validation results

models = [
('Logistic Regression', LR),
('K nearest neighbors', knn),
('Random Forest', rfm),
('SVC', svm),
('Gradient Boosting', gb),
('XGBoost', xgb)
]

# number of k-folds
k = 5

results = []

for name, model in models:
kf = KFold(n_splits=k, shuffle=True, random_state=42) # Create a KFold object
scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='accuracy')

# Append results to the list
results.append((name, scores.mean(), scores.std()))

results_df = pd.DataFrame(results, columns=['Model', 'Mean Accuracy', 'Std Deviation'])

results_df.sort_values(by='Mean Accuracy', ascending=False)
K-Fold Cross-Validation Scores | Machine Learning Model Evaluation
K-Fold Cross-Validation Scores
  • Average Accuracy is the mean across all k folds during the cross-validation process. Higher mean accuracy values indicate better predictive performance.
  • Standard Deviation measures the variability or spread of accuracy values across the k folds. A lower standard deviation suggests that the model’s performance is consistent across different subsets of the data (folds), while a higher standard deviation indicates that the model’s performance varies more widely. Smaller standard deviations are generally desirable because they indicate a more stable model.

The Random Forest model has the highest mean accuracy (0.846180) 85% among the evaluated models. This means that, on average, the model correctly predicted the target variable for about 85% of the data points in each fold. It performs well on average across different folds, and it has a relatively low standard deviation (0.007953), indicating consistent performance.

Classification Report

A classification report is a summary of the performance of a classification model, providing various evaluation metrics for each class in the dataset. It is commonly used in assessing classification models, hence the name.

model_names = ['Logistic Regression', 'k-NN', 'Random Forest', 'SVM', 'Gradient Boosting', 'XGBoost']
models = [LR, knn, rfm, svm, gb, xgb] # our trained models
model_names_list = []
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

# Loop through each model to calculate metrics and store information
for name, model in zip(model_names, models):
# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate accuracy, precision, recall, and F1-score
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Store model name and metrics
model_names_list.append(name)
accuracy_scores.append(accuracy)
precision_scores.append(precision)
recall_scores.append(recall)
f1_scores.append(f1)

# Create a DataFrame with the calculated metrics
metrics_df = pd.DataFrame({
'Model': model_names_list,
'Accuracy': accuracy_scores,
'Precision': precision_scores,
'Recall': recall_scores,
'F1-Score': f1_scores
})

# Display the DataFrame
metrics_df.sort_values(by='Accuracy', ascending=False)
Classification Report of classification models | Machine Learning Model Evaluation
Classification Report
  • Accuracy: Accuracy measures the proportion of correctly predicted instances out of the total instances in the dataset. It is an important metric for classification tasks. Our XGBoost model has the highest accuracy (0.852). This means that around 85% of the predictions made by the Random Forest model were correct. Since our data is balanced, we want to use the accuracy score as our main evaluation metric.

With balanced data, there is less risk of a classification model being biased towards one class due to class imbalance. As a result, accuracy effectively captures the model’s ability to correctly classify instances from both classes equally. When the classes are balanced, a high accuracy score indicates that the model is performing well across both classes and is making accurate predictions overall.

  • Precision: Precision quantifies how many of the positive predictions made by the model were actually correct. It’s the ratio of true positives (correctly predicted positives) to the total number of instances predicted as positive. A higher precision indicates fewer false positives. In the XGBoost model, the precision is approximately (0.8487) meaning that about 85% of the positive predictions made by the model were accurate.
  • Recall: Recall, also known as sensitivity or true positive rate, measures the proportion of actual positive instances that were correctly predicted by the model. It’s the ratio of true positives to the total number of actual positives. A higher recall indicates fewer false negatives. In the XGBoost model, the recall is approximately 0.8567, indicating that about 86% of the actual positive instances were correctly identified by the model.
  • F1-Score: The F1-Score is the harmonic mean of precision and recall. It combines both precision and recall into a single metric. The F1-Score gives us a balanced measure of a model’s performance, considering both false positives and false negatives. In the XGBoost model, the F1-Score is approximately 0.8527, which takes into account both precision and recall.

Our top 3 models (Random Forest, XGBoost and Gradient Boosting) are all tree based models, specifically ensemble learning techniques that combine multiple individual trees to improve overall performance and robustness. They reduce overfitting by averaging or boosting the individual trees’ predictions. They offer a combination of powerful features that make them robust, accurate, and versatile for classification tasks across a wide range of domains and data characteristics.

Hyperparameter tuning

Machine learning models are parameterized such that there has to be a search for the combination of parameters that will result in the optimal performance of the model. The parameters that define the model architecture are referred to as hyperparameters while the process of exploring a range of values is called hyperparameter tuning. It is important to note the distinction between model parameters and hyperparameters. Unlike hyperparameters, model parameters are learnt during the training phase while setting hyperparameters is exclusive of the training process. Ideally, when hyperparameter tuning is completed, the result is the best parameters for the model. Grid search and random search are two common strategies for tuning hyperparameters.

Currently, our models have been trained with the default parameters that it comes with. The parameters control aspects of the model’s learning process and can significantly impact its performance. Before we choose a final model, we have to tune the hyperparameters of our best performing models to see if we can improve the accuracy of our models, and then choosing the best.

In our example, I will Fine-tune our top 3 models using a RandomizedSearchCV method found in the sklearn.model_selection module to find the best hyperparameters and achieve the maximum performance of each of the Top 3 models, then compare them again to select the best one.

Randomized Search CV

RandomizedSearchCV is a technique used for hyperparameter tuning in machine learning models. It is a variant of the more traditional GridSearchCV method, which exhaustively searches through a specified subset of hyperparameter combinations. RandomizedSearchCV, on the other hand, performs a randomized search over a specified hyperparameter space.

The main reason to use a randomized search over a grid search is because RandomizedSearchCV is computationally more efficient (takes less time and resources) compared to GridSearchCV, especially when the hyperparameter space is large. By randomly sampling a subset of configurations, it can explore a wider range of hyperparameter values with fewer total evaluations.

After tuning the top 3 models, the best model was the random forest classifier. I will only be showing the tuning process for this model.

First it’s a good idea to look at the current parameters of the model.

# Check current model parameters

current_params = rfm.get_params()
current_params
Default Model Parameters | Random Forest Classifier | Hyperparameter Tuning
Default Model Parameters (Random Forest Classifier)

These are the current parameters for our model. Every model has different parameters. From a first glance it looks like some complex scientific gibberish, and you would be right. It does take some experience and study to understand the meaning and effects of the different hyperparameters.

For our random search, we will pick a grid of hyperparameters, and random combinations of them will be sampled.

# Define the parameter distributions for hyperparameter tuning
param_grid = {
'n_estimators': [20, 50, 100, 200, 300],
'max_depth': [None, 10, 15, 20, 25],
'min_samples_split': [2, 3, 4, 5, 6],
'min_samples_leaf': [1, 2, 3, 4, 5],
'class_weight': ['balanced', None],
'max_features': ['auto', 'sqrt', 'log2'],
'criterion': ['gini', 'entropy']
}

# Initialize RandomizedSearchCV with the RandomForestClassifier model and parameter distributions
random_search_rf = RandomizedSearchCV(estimator=rfm, param_distributions=param_grid,
scoring='accuracy', n_iter=150, random_state=42,
cv=5, n_jobs=-1, verbose = 1)

# fit best estimator on train data
random_search_rf.fit(X_train, y_train)

# best parameters
best_params = random_search_rf.best_params_

best_params

This code tunes the hyperparameters of our model in the paramgrid using RandomizedSearchCV to find the combination that maximizes accuracy on our dataset. Our hyperparameters for the random forest classifier are defined below:

  • n_estimators: Number of trees in the forest.
  • max_depth: Maximum depth of the trees.
  • min_samples_split: Minimum number of samples required to split an internal node.
  • min_samples_leaf: Minimum number of samples required to be at a leaf node.
  • class_weight: Weights associated with classes in the form of a dictionary or "balanced" to automatically adjust weights inversely proportional to class frequencies.
  • max_features: Number of features to consider when looking for the best split.
  • criterion: Function to measure the quality of a split ('gini' or 'entropy').

We also get an output of the best parameters.

RandomizedSearchCV best parameters for Random Forest Classifier | Hyperparameter Tuning
RandomizedSearchCV best parameters

We can go ahead and get a classification report of our tuned model with the best parameters after fitting it with our train data and predicting with our test data and compare it with our original model.

RandomizedSearchCV tuning Random Forest Classifier | Hyperparameter Tuning
Before and after hyperparameter tuning

Our model has improved by every metric. Our accuracy has improved slightly to 86%. This is fair enough. Let’s look at a confusion matrix to see its performance a little closer.

Confusion Matrix

It is an N x N matrix that gives a summary of the correct and incorrect predicted classification results for the Ntarget classes. The values in the diagonal of the matrix represent the number of correctly predicted classes while every other cell in the matrix indicates the misclassified classes. This means that the more predicted values that fall in the diagonal, the better the model. True positive, false positive, true negative and false negative are terms used when interpreting a confusion matrix.

# construct the confusion matrix for tuned model
confusion_matrix_rf = confusion_matrix(y_test, random_search_rf_pred)

plt.figure(figsize=(6, 6))
sns.heatmap(confusion_matrix_rf, annot=True, fmt='d', cmap='Blues', cbar=False,
xticklabels=['Churn', 'No Churn'],
yticklabels=['Churn', 'No Churn'])
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Confusion Matrix')
plt.show()
Tuned Random Forest Classifier model confusion matrix | Model Evaluation
Tuned model confusion matrix
  • The top-left value (630) represents the number of instances that were correctly classified as positive(often referred to as True Positives or TP).
  • The top-right value (110) represents the number of instances that were predicted a positive class while the actual class is negative (False Positives or FP). This is also called a Type I error.
  • The bottom-left value (97) represents the number of instances where the predicted value is negative and the actual value is positive (False Negatives or FN). This is also known as a Type II error.
  • The bottom-right value (643) represents the number of instances that were correctly classified as belonging to the negative class (True Negatives or TN).

We can now save our tuned model and use it to make predictions on unseen data.

destination = "Toolkit"

#create a directory if it doesn't exist
if not os.path.exists(destination):
os.makedirs(destination)

#Create a dictionary to store the object and their filename
models ={
"model":tuned_rf_model,
}

#Loop through the model and save using joblib.dump()
for name,model in models.items():
file_path =os.path.join(destination,f"{name}.joblib")
joblib.dump(model,file_path)

And thats it, we have successfully trained, tuned and saved a classification model 🙌

Appreciation

I highly recommend Azubi Africa for their comprehensive and effective programs. Read More articles about Azubi Africa here and take a few minutes to visit this link to learn more about Azubi Africa life-changing programs.

Thank you for reading!

You can access all the code for this project in my github repository. Show your appreciation if you found this insightful. Leave me a comment if you have any points or suggestions.

Connect with me on LinkedIn.

--

--

Andrew Obando

Data Science | Generative AI | BI Visualization | Web Dev