Building a ML Classification Model with Python — Telco Customer Churn Prediction
Imagine if a business could predict whether customers are likely to leave/unsubscribe to their services, and do what they can to keep them. It’s like a superpower, or being a fortune teller. You know they say its cheaper and more efficient to keep an existing customer than to obtain a new customer. In the long run, acquiring a new customer can cost five times more than retaining an existing customer.
Customer churn is the loss of customers who were previously subscribed/making regular purchases or discontinue their association with your company within a specified timeframe. The churn rate says a lot about a business and its ability to retain customers, as well as being a predictor of future success.
In this project, we use Supervised Machine Learning (classification) to explore the significance of churn analytics. Supervised machine learning is a type of machine learning where the algorithm learns from labeled data. Classification is a type of supervised machine learning used to predict the category or class label of a new data instance based on its input features. In classification, the target variable is categorical, meaning it takes on a finite number of discrete values or classes. Examples of classification tasks include spam detection (classifying emails as spam or not spam), sentiment analysis (classifying text as positive, negative, or neutral), and image recognition (classifying images into different categories).
The project followed the The Cross Industry Standard Process for Data Mining (CRISP-DM) framework that serves as the base for a data science process.
Table of Contents
· Business Understanding
∘ Primary Objective
∘ Hypothesis
∘ Analytical Questions
· Data Understanding
∘ Issues with data
∘ Hypothesis Testing
· Data Cleaning
∘ Clean First Training Dataset
∘ Clean Second Training Dataset
∘ Merge Final Training Dataset
· Exploratory Data Analysis
∘ Target Class Distribution
∘ Analytical Questions
· Feature Engineering
∘ Feature Importance
∘ Dataset Split
∘ Label Encoder
∘ One-Hot Encoding
∘ Feature Scaling
∘ Train set Balancing (SMOTE Algorithm)
∘ Train-Test Split
· Modelling
· Evaluation
∘ K-Fold Cross-Validation
∘ Classification Report
∘ Hyperparameter tuning
∘ Randomized Search CV
∘ Confusion Matrix
· References
· Appreciation
Business Understanding
The churn analytics predictive model is a data-driven solution designed to address the persistent challenge of customer churn in subscription-based industries. It can be used as a strategic tool for telecommunication companies to proactively identify potential risk factors for churn, optimize retention efforts, and cultivate lasting customer relationships.
Up to 65% of a company’s business comes from existing customers. It’s no secret that repeat clients and customers drives a huge chunk of business revenue. According to this Harvard Business School report, on average, a 5% increase in customer retention rates results in 25% — 95% increase of profits. Customers have a lot of options and evolving preferences, therefore, accurately predicting churn has become important to retain and satisfy them.
This classification model aims to identify customers at risk of churn, enabling businesses to take proactive measures and implement targeted retention strategies.
Primary Objective
The objective is to create a classification model that evaluates customer data and provides insights into customer behavior, preferences, and patterns to predict the likelihood of customer churn. Model performance is assessed through metrics like accuracy, precision, recall and F1-score.
The business objective is ultimately to be able reduce customer churn rates and retain valuable customers. Armed with the model’s insights, businesses can create personalized offers, marketing campaigns, and proactive customer support initiatives, thus improving customer satisfaction and fostering loyalty.
Hypothesis
- Null Hypothesis (H0): There is no significant relationship between customer tenure and churn rate in the telecom company.
- Alternative Hypothesis (Ha): There is a significant relationship between customer tenure and churn rate in the telecom company.
Analytical Questions
- How does customer tenure relate to churn rates? Are long-tenured customers more likely to stay with the company, and do new customers exhibit higher churn behavior?
- Is there a correlation between the total charges and churn rates? Do customers with higher total charges exhibit different churn behavior compared to those with lower total charges
- What is the impact of contract type on churn rates? Do customers on long-term contracts have significantly lower churn rates compared to those on short-term contracts?
- Are there significant differences in churn behavior between customers who have device protection and those who don’t?
- What is the relationship between the availability of tech support and churn rates? Are customers with access to tech support more likely to remain with the company?
- Do streaming services play a role in customer churn? Are customers with streaming services, such as StreamingTV and StreamingMovies, more likely to stay with the company?
- How does the choice of payment method impact churn rates? Are customers with specific payment methods more prone to churn than others?
Data Understanding
First step is to collect and analyze our data for the project, explore it and verify its quality and relevance. We install our necessary libraries/packages for the project and source our data from multiple sources, including a remote Microsoft SQL server. You can view the jupyter notebook for the project to get more detailed insights here.
Our data consists of 3 datasets, two for the train and one for the test:
data = LP2_Telco_Churn_first_3000
data_xls = Telco-churn-second-2000
(our test data without the target variable)data_2=LP2_telco-churn-last-2000 for testing
# View columns and rows of datasets
(data.shape,data_xls.shape,data2.shape)
After using the pandas.read_csv
function to load our data, here is a description of our datasets:
Issues with data
- Train datasets need to be merged, however they have inconsistent data types that need to be synchronized
- We need to remove the
CustomerID
column from the training data as it is not necessary for our analysis - Missing values need to be imputed or deleted
TotalCharges
column needs to be changed to numeric dtype from object dtype
Hypothesis Testing
We want to see if there is a relationship/correlation between tenure
and churn
. Since churn is a binary outcome (Yes or No), we want to use point-biserial correlation and assess its significance using a correlation test.
# 'Churn' should be a binary variable (0 for no churn, 1 for churn)
df_hypo = Data_All.copy()
df_hypo['Churn'] = df_hypo['Churn'].replace({'No': 0, 'Yes': 1})
# Calculate point-biserial correlation coefficient
correlation_coefficient, p_value = stats.pointbiserialr(df_hypo['Churn'], df_hypo['tenure'])
# Print the results
print(f"Point-Biserial Correlation Coefficient: {correlation_coefficient:.4f}")
print(f"P-value: {p_value:.4f}")
# Set the significance level (5%)
alpha = 0.05
# Determine if the correlation is statistically significant
if p_value < alpha:
print("There is a significant correlation between customer tenure and churn rate.")
else:
print("There is no significant correlation between customer tenure and churn rate.")
The correlation coefficient of -0.3526 suggests a moderate negative correlation. This means that as customer tenure increases, the churn rate tends to decrease. Customers who have been with the company for a longer period are less likely to churn. This actually makes sense if you think about it. Longer tenure means customer loyalty and retention.
The p-value is less than the significance level, which suggests that there is a strong statistical evidence to reject the null hypothesis (no correlation) in favor of the alternative hypothesis (significant correlation).
Data Cleaning
This step includes selecting, cleaning and formatting our data. It takes up about 80% of any machine learning project. Keep in mind that data cleaning is an iterative process as it involves multiple rounds of processing throughout a machine learning project.
Clean First Training Dataset
data.info()
- We remove the
customerID
column. It serves no purpose in explaining whether or not the customer will churn, just only a unique identifier.
del data["customerID"]
- We convert the
bool
dtypes to object dtypes for easier analysis.
# convert bool columns to object Dtypes
bool_to_obj = {'Partner': object,
'SeniorCitizen': object,
'Dependents': object,
'PhoneService': object,
'PaperlessBilling': object}
data = data.astype(bool_to_obj)
- We then format our object column values and change them (from True/False to Yes/No) for uniformity and easier legibility.
# Replace True and False values with Yes and No in Object columns
replace_values = {True: 'Yes', False: 'No'}
# Use replace() method
data['Partner'] = data['Partner'].replace(replace_values)
data['SeniorCitizen'] = data['SeniorCitizen'].replace(replace_values)
data['Dependents'] = data['Dependents'].replace(replace_values)
data['PhoneService'] = data['PhoneService'].replace(replace_values)
data['MultipleLines'] = data['MultipleLines'].replace(replace_values)
data['OnlineSecurity'] = data['OnlineSecurity'].replace(replace_values)
data['OnlineBackup'] = data['OnlineBackup'].replace(replace_values)
data['DeviceProtection'] = data['DeviceProtection'].replace(replace_values)
data['TechSupport'] = data['TechSupport'].replace(replace_values)
data['StreamingTV'] = data['StreamingTV'].replace(replace_values)
data['StreamingMovies'] = data['StreamingMovies'].replace(replace_values)
data['PaperlessBilling'] = data['PaperlessBilling'].replace(replace_values)
data['Churn'] = data['Churn'].replace(replace_values)
- We have one missing value in our target variable
Churn
which we drop. The missing values inTotalCharges
have atenure
of 0, even though there are values in theMonthlyCharges
column. A customer with a tenure of 0 and 0 total charges should not have any monthly charges. We can simply drop those rows (5) to avoid confusion.
# Drop rows where tenure = 0
data.drop(labels=data[data['tenure'] == 0].index, axis=0, inplace=True)
- For the rest of our missing categorical values, we replace them with the mode of their respective columns.
# replace the remaining missing values with the mode of their respective columns
columns_to_replace = ['MultipleLines', 'OnlineSecurity', 'OnlineBackup',
'DeviceProtection', 'TechSupport', 'StreamingTV',
'StreamingMovies']
for column in columns_to_replace:
data[column].fillna(data[column].mode()[0], inplace=True)
Our data frame looks a little better now.
# Reset index after dropping rows
data.reset_index(drop=True, inplace=True)
data.info()
Clean Second Training Dataset
data2.info()
- The
customerID
column gets dropped, and theTotalCharges
column gets converted to numeric data type.
# change Dtype of TotalCharges column from object to numeric
data2['TotalCharges'] = pd.to_numeric(data2['TotalCharges'], errors='coerce')
After applying the transformation, the TotalCharges
column has some missing values (3) which we drop just like we did in our first training dataset, since they have a tenure of 0.
Merge Final Training Dataset
We merge our 2 cleaned datasets together.
Data_All= pd.concat([data, data2], ignore_index=True)
Data_All.describe(include=['object']).T
Exploratory Data Analysis
Let’s explore and visualize the data to gain some insights and answer our analytical questions.
Target Class Distribution
- The target variable (Churn) in our data is imbalanced. This means that that one class is more than the other. We will have to balance the the target classes so that our model does not give biased predictions in favor of the majority class (not churned).
- The percentage of customers that did not churn (majority class) can be used as a baseline to evaluate the quality of the our model. The model should similarly, or even outperform the baseline capabilities to be considered for future predictions.
Analytical Questions
1. How does customer tenure relate to churn rates?
- From our hypothesis test, we have already gathered that there is a negative correlation between tenure and churn. The probability density plot shows us that customers with a lower tenure are more likely to churn. As the length of tenure increases, customers who have stayed for longer are less likely to churn.
2. Is there a correlation between the total charges and churn rates?
- The weak downward slope on the lmplot, which helps us plot a logistic regression line on a scatterplot, suggests that there is a weak negative correlation between total charges and churn rates. The more a customer has spent with the company, the likelihood of churn slightly decreases.
3. What is the impact of contract type on churn rates?
- Customers with short term (month-to-month) contracts have higher churn rates than customers with long term (one year+ contracts)
4. Are there significant differences in churn behavior between customers who have device protection and those who don’t?
- Customers without device protection are slightly more likely to churn, although this metric is not clear or significant enough.
5. What is the relationship between the availability of tech support and churn rates?
- Customers with no tech support were more likely to churn.
6. Do streaming services play a role in customer churn?
- No significant relationship between streaming services and churn rates.
7. How does the choice of payment method impact churn rates?
- Customers who pay by electronic check were more likely to churn, compared to other payment methods.
Feature Engineering
Feature engineering is an important part of machine learning, where we remove and/or add features in our dataset, and also transform our data into acceptable forms that can be passed through a ML model. A lot of ML models need numerical values as inputs, which means that categorical values have to be converted.
Feature Importance
We have already explored some relationships between our variables and target variable with our EDA. Now we want to identify which categorical features are more informative in relation to the target variable, making them potentially valuable for predicting or understanding the Churn
behavior.
In machine learning, feature importance helps us in selecting the most relevant features for training our model. By focusing on the most important features, unnecessary noise and redundancy can be eliminated, leading to a simpler, more efficient and more accurate model. One way to do that is to get Mutual Information Scores.
Mutual Information is a measure of the dependence between two variables, which quantifies how much knowing the value of one variable reduces uncertainty about the other variable. Pearson’s correlation coefficient gives us correlation between linear relationships. Mutual information detects both linear and non-linear relationships
So in essence, a higher score means that the feature is good at predicting the target variable, and a lower score suggests that the variable is not that helpful in predicting the dependent variable. Let’s get the mutual information scores of our categorical features, in relation to our target variable (churn).
# Lets find out feature dependency on the target variable using mutual information score
x_cat = Data_All.select_dtypes(include=object).drop('Churn', axis=1)
y_cat = Data_All['Churn']
mi_scores = []
# loop to calculate the Mutual Information Score for each categorical feature
# with respect to the 'Churn' target variable
for column in x_cat.columns:
mi_score = mutual_info_score(x_cat[column], y_cat)
mi_scores.append((column, mi_score))
# sort features by their importance dependency on the target variable,
# with the most important ones at the top.
mi_scores.sort(key=lambda x: x[1], reverse=True)
for feature, score in mi_scores:
print(f"{feature} - {score}")
This gives us a list of our categorical features and their scores in descending order. Instead of reading a list of numbers, its always good to visualize.
Contract
has the highest mutual score. We saw in our earlier visualizations too that the contract type has a strong relationship with churn probability.Internet Service
andpayment method
also have high mutual scores. The type of internet service and preferred payment method a customer has may be strongly related to churn.- The
gender
,phoneservice
andmultiplelines
variables have a very low mutual score. This suggests that they have little to no predictive power or relationship with the target variable. I decided to delete them from the data frame as they may add unnecessary noise which may affect the model performance.
Dataset Split
We separate the independent variable(y) from the dependent variables(X), since we are using X to predict y.
X = df_encode.drop(columns=['Churn'])
y = df_encode['Churn']
Label Encoder
The target variable is then transformed to numerical values with label encoder. Label Encoding is a technique used in machine learning to transform categorical data into numerical format, specifically integers. Label encoding assigns integer labels sequentially without considering any order among categories. It is normally used to transform target values (y), not the input (X).
# Encode the target variable (Churn) to have 0 or 1 instead of No or Yes
labelEncoder = LabelEncoder()
y = labelEncoder.fit_transform(y)
The result is a single column with integer-encoded values.
One-Hot Encoding
When we have 3 or more options for a discrete variable, like we do for some of our categorical columns, we can use the one-hot encoder to transform the categorical data to numerical data. This will create new columns for each option in a variable.
If we use label encoding for columns with 3 or more options to assign each option with a random number (e.g. 0,1,2), some machine learning models can treat the order of the numbers as if they are significant and this can cause problems. So we use the one-hot encoder.
Some of my categorical variables only have 2 options, but i still used the one-hot encoder across them for uniformity.
# One hot encoding for categorical columns
categorical_columns = ['SeniorCitizen', 'Partner', 'Dependents',
'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
'PaperlessBilling', 'PaymentMethod']
# Create column transformer instance
transformer = make_column_transformer(
(OneHotEncoder(sparse=False), categorical_columns)
)
transformed_data = transformer.fit_transform(X[categorical_columns])
# Transformating back to a dataframe
transformed_df = pd.DataFrame(transformed_data, columns=transformer.get_feature_names_out())
# One-hot encoding removed an index. Let's put it back
transformed_df.index = X.index
# Joining tables
encoded_df = pd.concat([X, transformed_df], axis=1)
# Dropping original categorical columns
encoded_df.drop(categorical_columns, axis=1, inplace=True)
This code transforms the categorical columns in the list into numerical values, changes the result back into a data frame, merges this data frame into our original data frame, and deletes the original categories.
Remember this creates a new column for each option. Each column has a value of 1 where that option was true, and 0 where that option was false.
The issue with one-hot encoding is how it increases the dimensionality of the dataset by adding new columns. This can be a problem if you have many unique categories. Increasing the number of features will make your model more complex, leading to overfitting , longer training times or poor performance.
One-hot encoding can also introduce multicollinearity among the encoded features, where some features are highly correlated with each other. This can affect the stability and interpretability of the model, particularly in linear models. Always keep this in mind. In our case, we don’t have too many categories, so this should be fine.
Feature Scaling
We also have to do some transformations on our numerical columns. Feature scaling ensures that all numerical features are on a similar scale, also called normalization. This is important because many machine learning algorithms, such as gradient descent-based algorithms or distance-based algorithms like the k-nearest neighbors (KNN) are sensitive to the scale of features. Features with larger scales may dominate those with smaller scales during model training, leading to biased results. It can also lead to improved model performance by reducing the impact of varying feature scales on the model’s predictions.
We will use the standardscaler
which scales the features such that they have a mean of 0 and a standard deviation of 1. This makes our model less sensitive to outliers in the data. This is also called standardization.
# standardization for numeric values
cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
scaler = StandardScaler()
encoded_scaled = scaler.fit_transform(encoded_df[cols])
# Create new DataFrames with the scaled values
X_scaled_df = pd.DataFrame(encoded_scaled, columns=cols, index=X.index)
# Drop the original unscaled columns
encoded_df.drop(cols, axis=1, inplace=True)
# Concatenate the scaled columns with the original DataFrames
encoded_df = pd.concat([encoded_df, X_scaled_df], axis=1)[p
This code makes our numerical columns standardized, put in a new data frame, drops the original unscaled columns, and then concatenates the scaled columns with the original data frame.
Train set Balancing (SMOTE Algorithm)
SMOTE (Synthetic Minority Over-sampling Technique) is a method used to address class imbalance in a binary classification problem.
Earlier we realized that our target variable has a class imbalance. One class (the minority class) has significantly fewer instances than the other class (the majority class). This imbalance can negatively impact the performance of machine learning models, as they might become biased toward the majority class.
SMOTE will aim to balance the class distribution by generating synthetic samples until the minority class has the same number of instances as the majority class. By creating synthetic samples, SMOTE helps the model better capture the patterns in the minority class and prevents it from favoring the majority class due to the imbalance.
# apply SMOTE to the training data (oversampling)
smote = SMOTE(random_state=42, k_neighbors=5, sampling_strategy='auto')
X_resampled, y_resampled = smote.fit_resample(encoded_df, y)
Let’s draw a bar graph showing the original distribution and the balanced distribution.
Train-Test Split
We can now split our data into training and evaluation sets. By splitting the dataset into separate training and testing subsets, we can train the model on one subset (training set) and evaluate its performance on the other subset (evaluation set). This allows us to assess how well the model generalizes to new, unseen data. The evaluation set will be 20% of the training data. This is common practice in machine learning.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)
Modelling
Now for the fun part. We can now train different models and choose the best performing model. It’s important to test different models because different models have different strengths and weaknesses, so testing them on the same data allows you to assess which model provides the most accurate predictions or best generalizes to unseen data.
We will be testing 6 different models:
- Logistic Regression
- K-Neighbors Classifier
- Random Forest Classifier
- Support Vector Classifier
- Gradient Boosting Classifier
- XGBoost Classifier
The steps are the same for each. We declare an instance of the model and then fit the model with our training data. Here is an example, with the Gradient Boosting Classifier:
gb = GradientBoostingClassifier(random_state=42)
# Train the model
gb.fit(X_train, y_train)
We can now assess the performance of the models.
Evaluation
K-Fold Cross-Validation
K-fold cross-validation estimates the performance on our models across multiple subsets of the data (k-folds), providing a comprehensive evaluation of their generalization ability. The model is trained and evaluated k times, with each fold serving as the validation set once. This process helps estimate how well a model will perform on new, unseen data and provides insights into its stability and consistency.
# Create a dataframe with the K-fold Cross-Validation results
models = [
('Logistic Regression', LR),
('K nearest neighbors', knn),
('Random Forest', rfm),
('SVC', svm),
('Gradient Boosting', gb),
('XGBoost', xgb)
]
# number of k-folds
k = 5
results = []
for name, model in models:
kf = KFold(n_splits=k, shuffle=True, random_state=42) # Create a KFold object
scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='accuracy')
# Append results to the list
results.append((name, scores.mean(), scores.std()))
results_df = pd.DataFrame(results, columns=['Model', 'Mean Accuracy', 'Std Deviation'])
results_df.sort_values(by='Mean Accuracy', ascending=False)
- Average Accuracy is the mean across all k folds during the cross-validation process. Higher mean accuracy values indicate better predictive performance.
- Standard Deviation measures the variability or spread of accuracy values across the k folds. A lower standard deviation suggests that the model’s performance is consistent across different subsets of the data (folds), while a higher standard deviation indicates that the model’s performance varies more widely. Smaller standard deviations are generally desirable because they indicate a more stable model.
The Random Forest model has the highest mean accuracy (0.846180) 85% among the evaluated models. This means that, on average, the model correctly predicted the target variable for about 85% of the data points in each fold. It performs well on average across different folds, and it has a relatively low standard deviation (0.007953), indicating consistent performance.
Classification Report
A classification report is a summary of the performance of a classification model, providing various evaluation metrics for each class in the dataset. It is commonly used in assessing classification models, hence the name.
model_names = ['Logistic Regression', 'k-NN', 'Random Forest', 'SVM', 'Gradient Boosting', 'XGBoost']
models = [LR, knn, rfm, svm, gb, xgb] # our trained models
model_names_list = []
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
# Loop through each model to calculate metrics and store information
for name, model in zip(model_names, models):
# Make predictions on the test data
y_pred = model.predict(X_test)
# Calculate accuracy, precision, recall, and F1-score
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
# Store model name and metrics
model_names_list.append(name)
accuracy_scores.append(accuracy)
precision_scores.append(precision)
recall_scores.append(recall)
f1_scores.append(f1)
# Create a DataFrame with the calculated metrics
metrics_df = pd.DataFrame({
'Model': model_names_list,
'Accuracy': accuracy_scores,
'Precision': precision_scores,
'Recall': recall_scores,
'F1-Score': f1_scores
})
# Display the DataFrame
metrics_df.sort_values(by='Accuracy', ascending=False)
- Accuracy: Accuracy measures the proportion of correctly predicted instances out of the total instances in the dataset. It is an important metric for classification tasks. Our XGBoost model has the highest accuracy (0.852). This means that around 85% of the predictions made by the Random Forest model were correct. Since our data is balanced, we want to use the accuracy score as our main evaluation metric.
With balanced data, there is less risk of a classification model being biased towards one class due to class imbalance. As a result, accuracy effectively captures the model’s ability to correctly classify instances from both classes equally. When the classes are balanced, a high accuracy score indicates that the model is performing well across both classes and is making accurate predictions overall.
- Precision: Precision quantifies how many of the positive predictions made by the model were actually correct. It’s the ratio of true positives (correctly predicted positives) to the total number of instances predicted as positive. A higher precision indicates fewer false positives. In the XGBoost model, the precision is approximately (0.8487) meaning that about 85% of the positive predictions made by the model were accurate.
- Recall: Recall, also known as sensitivity or true positive rate, measures the proportion of actual positive instances that were correctly predicted by the model. It’s the ratio of true positives to the total number of actual positives. A higher recall indicates fewer false negatives. In the XGBoost model, the recall is approximately 0.8567, indicating that about 86% of the actual positive instances were correctly identified by the model.
- F1-Score: The F1-Score is the harmonic mean of precision and recall. It combines both precision and recall into a single metric. The F1-Score gives us a balanced measure of a model’s performance, considering both false positives and false negatives. In the XGBoost model, the F1-Score is approximately 0.8527, which takes into account both precision and recall.
Our top 3 models (Random Forest, XGBoost and Gradient Boosting) are all tree based models, specifically ensemble learning techniques that combine multiple individual trees to improve overall performance and robustness. They reduce overfitting by averaging or boosting the individual trees’ predictions. They offer a combination of powerful features that make them robust, accurate, and versatile for classification tasks across a wide range of domains and data characteristics.
Hyperparameter tuning
Machine learning models are parameterized such that there has to be a search for the combination of parameters that will result in the optimal performance of the model. The parameters that define the model architecture are referred to as hyperparameters while the process of exploring a range of values is called hyperparameter tuning. It is important to note the distinction between model parameters and hyperparameters. Unlike hyperparameters, model parameters are learnt during the training phase while setting hyperparameters is exclusive of the training process. Ideally, when hyperparameter tuning is completed, the result is the best parameters for the model. Grid search and random search are two common strategies for tuning hyperparameters.
Currently, our models have been trained with the default parameters that it comes with. The parameters control aspects of the model’s learning process and can significantly impact its performance. Before we choose a final model, we have to tune the hyperparameters of our best performing models to see if we can improve the accuracy of our models, and then choosing the best.
In our example, I will Fine-tune our top 3 models using a RandomizedSearchCV
method found in the sklearn.model_selection
module to find the best hyperparameters and achieve the maximum performance of each of the Top 3 models, then compare them again to select the best one.
Randomized Search CV
RandomizedSearchCV is a technique used for hyperparameter tuning in machine learning models. It is a variant of the more traditional GridSearchCV method, which exhaustively searches through a specified subset of hyperparameter combinations. RandomizedSearchCV, on the other hand, performs a randomized search over a specified hyperparameter space.
The main reason to use a randomized search over a grid search is because RandomizedSearchCV is computationally more efficient (takes less time and resources) compared to GridSearchCV, especially when the hyperparameter space is large. By randomly sampling a subset of configurations, it can explore a wider range of hyperparameter values with fewer total evaluations.
After tuning the top 3 models, the best model was the random forest classifier. I will only be showing the tuning process for this model.
First it’s a good idea to look at the current parameters of the model.
# Check current model parameters
current_params = rfm.get_params()
current_params
These are the current parameters for our model. Every model has different parameters. From a first glance it looks like some complex scientific gibberish, and you would be right. It does take some experience and study to understand the meaning and effects of the different hyperparameters.
For our random search, we will pick a grid of hyperparameters, and random combinations of them will be sampled.
# Define the parameter distributions for hyperparameter tuning
param_grid = {
'n_estimators': [20, 50, 100, 200, 300],
'max_depth': [None, 10, 15, 20, 25],
'min_samples_split': [2, 3, 4, 5, 6],
'min_samples_leaf': [1, 2, 3, 4, 5],
'class_weight': ['balanced', None],
'max_features': ['auto', 'sqrt', 'log2'],
'criterion': ['gini', 'entropy']
}
# Initialize RandomizedSearchCV with the RandomForestClassifier model and parameter distributions
random_search_rf = RandomizedSearchCV(estimator=rfm, param_distributions=param_grid,
scoring='accuracy', n_iter=150, random_state=42,
cv=5, n_jobs=-1, verbose = 1)
# fit best estimator on train data
random_search_rf.fit(X_train, y_train)
# best parameters
best_params = random_search_rf.best_params_
best_params
This code tunes the hyperparameters of our model in the paramgrid
using RandomizedSearchCV to find the combination that maximizes accuracy on our dataset. Our hyperparameters for the random forest classifier are defined below:
n_estimators
: Number of trees in the forest.max_depth
: Maximum depth of the trees.min_samples_split
: Minimum number of samples required to split an internal node.min_samples_leaf
: Minimum number of samples required to be at a leaf node.class_weight
: Weights associated with classes in the form of a dictionary or "balanced" to automatically adjust weights inversely proportional to class frequencies.max_features
: Number of features to consider when looking for the best split.criterion
: Function to measure the quality of a split ('gini' or 'entropy').
We also get an output of the best parameters.
We can go ahead and get a classification report of our tuned model with the best parameters after fitting it with our train data and predicting with our test data and compare it with our original model.
Our model has improved by every metric. Our accuracy has improved slightly to 86%. This is fair enough. Let’s look at a confusion matrix to see its performance a little closer.
Confusion Matrix
It is an N x N matrix that gives a summary of the correct and incorrect predicted classification results for the Ntarget classes. The values in the diagonal of the matrix represent the number of correctly predicted classes while every other cell in the matrix indicates the misclassified classes. This means that the more predicted values that fall in the diagonal, the better the model. True positive, false positive, true negative and false negative are terms used when interpreting a confusion matrix.
# construct the confusion matrix for tuned model
confusion_matrix_rf = confusion_matrix(y_test, random_search_rf_pred)
plt.figure(figsize=(6, 6))
sns.heatmap(confusion_matrix_rf, annot=True, fmt='d', cmap='Blues', cbar=False,
xticklabels=['Churn', 'No Churn'],
yticklabels=['Churn', 'No Churn'])
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Confusion Matrix')
plt.show()
- The top-left value (630) represents the number of instances that were correctly classified as positive(often referred to as True Positives or TP).
- The top-right value (110) represents the number of instances that were predicted a positive class while the actual class is negative (False Positives or FP). This is also called a Type I error.
- The bottom-left value (97) represents the number of instances where the predicted value is negative and the actual value is positive (False Negatives or FN). This is also known as a Type II error.
- The bottom-right value (643) represents the number of instances that were correctly classified as belonging to the negative class (True Negatives or TN).
We can now save our tuned model and use it to make predictions on unseen data.
destination = "Toolkit"
#create a directory if it doesn't exist
if not os.path.exists(destination):
os.makedirs(destination)
#Create a dictionary to store the object and their filename
models ={
"model":tuned_rf_model,
}
#Loop through the model and save using joblib.dump()
for name,model in models.items():
file_path =os.path.join(destination,f"{name}.joblib")
joblib.dump(model,file_path)
And thats it, we have successfully trained, tuned and saved a classification model 🙌
References
Thank you for reading!
You can access all the code for this project in my github repository. Show your appreciation if you found this insightful. Leave me a comment if you have any points or suggestions.
Connect with me on LinkedIn.