Comprehensive Guide to Machine Learning (Part 3 of 3)

Published in

Analytics Vidhya

8 min readSep 3, 2020

Pic Courtesy: https://www.advectas.com/en/blog/what-is-machine-learning/

Welcome to the 3rd and final part of the “Comprehensive Guide to Machine Learning” series. Over the course of this series, we looked at several crucial concepts which play a significant role in developing a good machine learning model.

Concepts like data cleansing and EDA help a great deal in acquiring deeper understanding of the data. Similarly, concepts like feature engineering and feature selection help in making sure that only useful and relevant data is being fed to the machine learning model. You can get a quick recap of these concepts by visiting the below links:

Comprehensive Guide to Machine Learning (Part 1 of 3)

In this comprehensive guide, I’ll try to explore the different gears and pinions that makes a machine learning model…

medium.com

Comprehensive Guide to Machine Learning (Part 2 of 3)

Welcome to 2nd part of the “Comprehensive Guide to Machine Learning” series. In the first part of this series, we…

medium.com

In this final post, we will look at actual model development and how to perform model validation to ensure that the model is behaving as expected and predicting correct results.

7) Baseline model building

The first step towards model building is to decide that which machine learning framework to use, based on the type of task we have at our hand. Below are few of the machine learning frameworks which work well in majority of the scenarios:

Neural Networks (Tensorflow/ Keras/ PyTorch)
XGBoost
LightGBM
CatBoost
Logistic Regression
Random Forest
Support Vector Machines (SVM)

The “Pet Adoption” dataset that I have been referring to through out this series, has majority of features as categorical variables. So I decided to build the final machine learning model using “CatBoost” framework, since it works pretty well with categorical features.

For experiment sake, I also tried out models based on Neural Networks, XGBoost and LightGBM. But CatBoost gave me the best results out of all these machine learning frameworks. You can find the other codebase at below link.

dlaststark/machine-learning-projects

Centralized repository to store and handle all my machine learning projects - dlaststark/machine-learning-projects

github.com

Below image shows the final CatBoost model I used for making predictions.

Let’s go over few of the crucial model configurations for better understanding of different nuts and bolts of CatBoost framework.

objective='MultiClass'

In the “Pet Adoption” dataset, we are classifying between 4 different breed categories and 3 different pet categories. So considering all that, we need to set the model objective as “MultiClass”.

eval_metric='TotalF1'

We need to set an evaluation metric for the CatBoost model, so that we can judge that the model is behaving as expected and converging to global minima. For classification tasks, below are the commonly used evaluation metrics:

Accuracy
F1 Score
Area Under the Curve (AUC)
Precision Score
Recall Score

Generally for highly imbalanced datasets like “Pet Adoption”, it’s preferred to use F1 Score or AUC as evaluation criteria, since they provide a much better holistic understanding of the model performance. You can get further understanding of these metrics by visiting below link.

Performance Metrics for Classification problems in Machine Learning

“Numbers have an important story to tell. They rely on you to give them a voice.” — Stephen Few

medium.com

class_weights=[0.165, 0.185, 1]

In case of imbalanced datasets, the machine learning models tend to get biased towards the majority class. To ensure that the model is giving proper weightage to all the prediction classes, we can set the class weights.

Below example shows how to calculate the class weights for imbalanced datasets.

Total Sample: 100

Class A: 60 samples

Class B: 30 samples

Class C: 10 samples

Class A %: (60/100) = 0.6

Class B %: (30/100) = 0.3

Class C%: (10/100) = 0.1

Class Weight = (Lowest Class %) / (Current Class %)

Class A weight: 0.1/0.6 =0.1667

Class B weight: 0.3/0.2 = 0.5

Class C weight: 0.1/0.1 = 1

learning_rate=0.025

This hyper-parameter sets the speed at which the model will converge towards the global minima.

Learning rate is the most critical hyper-parameter in any machine learning model. Setting it too low will cause the model to train very slowly. On the other hand, setting it to high value can cause the model to overshoot the global minima and result in poor predictions.

I’d highly recommend to go through below link to get deeper understanding of learning rate and it’s significance in machine learning models.

Understanding Learning Rates and How It Improves Performance in Deep Learning

This post is an attempt to document my understanding on the following topic:

towardsdatascience.com

reg_lambda=0.009

This parameter helps in preventing model overfitting, which we have already discussed in the 2nd post of this series.

You can visit below for better understanding of rest of the CatBoost model parameters.

Python package training parameters - CatBoost. Documentation

Several parameters have aliases. For example, the iterations parameter has the following synonyms: num_boost_round…

catboost.ai

Also, I’d highly recommend to go through below link for better understanding of ensemble machine learning models (CatBoost, XGBoost, LightGBM).

Ensemble methods: bagging, boosting and stacking

Understanding the key concepts of ensemble learning.

towardsdatascience.com

8) Hyper-parameters Tuning

Let’s first understand the difference between Model Parameters and Model Hyper-parameters.

Model Parameters: These are the parameters that the model determines on its own while training on the dataset provided. These are the fitted parameters.
Model Hyper-parameters: These are adjustable parameters that must be tuned, prior to model training, in order to obtain a model with optimal performance.

Hyper-parameters are important since they directly control behaviour of the training model, having important impact on performance of the model under training.

In case of a CatBoost model, below are the critical hyper-parameters to be fine-tuned to achieve an optimal performing model.

Learning Rate: controls training speed
Regularisation Lambda: controls model overfitting
Subsample: controls boosting in ensemble models
Max Depth: controls depth of decision trees
Min Data in Leaf: controls model overfitting
Max Leaves: controls model complexity

I have used the “Optuna” python library to automate the task of hyper-parameters tuning. You can visit below link to get better understanding of Optuna and how it works behind the scenes.

A 5 min guide to hyper-parameter optimization with Optuna

Finding the best hyper-parameters for your model is now a breeze.

towardsdatascience.com

Below image shows the objective function created for Optuna, so that it can search and find good hyper-parameters.

Once the objective function is set, we can use the below commands to let Optuna run free on the hyper-parameters search space.

Once Optuna is finished with the number of trials provided, we can extract the best hyper-parameters by executing the below commands.

9) Model validation

Once we are set with our model and done with the hyper-parameters tuning, next step is to validate how the model is performing on the “Validation” and “Test” datasets.

Sklearn python library provides two cross-validation functionalities, which are as listed below.

K-Fold: For regression and balanced classification tasks
Stratified K-Fold: For imbalanced classification tasks

Since we are dealing with an imbalanced dataset, I chose to go with “Stratified K-Fold” cross-validation with 5 splits performed. Below images show the model validation performed using same.

The training results would look like below images.

As we can see from the above image, the training F1-score is around 99% and the validation F1-score is around 93%. So we can safely assume that the model is not underfitting or overfitting.

Next, let’s check the F1-score on the “Test” dataset.

As we can see, the test F1-Score is around 90%. So the model is already performing way better. We can further confirm this by checking the confusion matrix shown below.

I’d highly recommend to go through the below links to acquire better understand of model validation techniques and confusion matrix.

Cross-Validation in Machine Learning

There is always a need to validate the stability of your machine learning model. I mean you just can’t fit the model to…

towardsdatascience.com

Understanding Confusion Matrix

When we get the data, after data cleaning, pre-processing and wrangling, the first step we do is to feed it to an…

towardsdatascience.com

10) Making Predictions

Let’s quickly revisit what all we did up to this point.

We cleaned the data and performed EDA on it to gain better understanding
We performed feature engineering to generate new insightful features for the model
We performed feature selection to discard irrelevant features
We performed train/validation/test split to prepare the datasets for model validation
We built the baseline model and performed hyper-parameters tuning to get the best set of hyper-parameters
We performed cross-validation to ensure that the model is behaving as expected

Pheww! That’s hell lot of steps to perform, to build the final machine learning model. This brings us to the final showdown moment, which is making predictions on test dataset.

I usually follow the same approach of cross-validation to let the model train and predict on multiple K-fold splits, and then take the average of all predictions made across all data splits.

Below image shows the codebase for same.

Concluding Remarks

This concludes the 3rd and final part of the comprehensive machine learning guide. I do hope that you gained insightful knowledge into the nitty gritty of machine learning model development. And I’m sure you’ll use the learnings from this series to build models of your own and make predictions on real-world problems.

As always, you can find the codebase for this post at below link.

dlaststark/machine-learning-projects

Permalink Dismiss GitHub is home to over 50 million developers working together to host and review code, manage…

github.com

Do leave me your comments, feedback and challenges (if you’re facing any) and I’ll touchbase with you individually to collaborate together.

Also please visit my blog (link below) to explore more on Machine Learning and Linux Computing.

Comprehensive Guide to Machine Learning (Part 3 of 3)

Comprehensive Guide to Machine Learning (Part 1 of 3)

In this comprehensive guide, I’ll try to explore the different gears and pinions that makes a machine learning model…

Comprehensive Guide to Machine Learning (Part 2 of 3)

Welcome to 2nd part of the “Comprehensive Guide to Machine Learning” series. In the first part of this series, we…

7) Baseline model building

dlaststark/machine-learning-projects

Centralized repository to store and handle all my machine learning projects - dlaststark/machine-learning-projects

Performance Metrics for Classification problems in Machine Learning

“Numbers have an important story to tell. They rely on you to give them a voice.” — Stephen Few

Understanding Learning Rates and How It Improves Performance in Deep Learning

This post is an attempt to document my understanding on the following topic:

Python package training parameters - CatBoost. Documentation

Several parameters have aliases. For example, the iterations parameter has the following synonyms: num_boost_round…

Ensemble methods: bagging, boosting and stacking

Understanding the key concepts of ensemble learning.

8) Hyper-parameters Tuning

A 5 min guide to hyper-parameter optimization with Optuna

Finding the best hyper-parameters for your model is now a breeze.

9) Model validation

Cross-Validation in Machine Learning

There is always a need to validate the stability of your machine learning model. I mean you just can’t fit the model to…

Understanding Confusion Matrix

When we get the data, after data cleaning, pre-processing and wrangling, the first step we do is to feed it to an…

10) Making Predictions

Concluding Remarks

dlaststark/machine-learning-projects

Permalink Dismiss GitHub is home to over 50 million developers working together to host and review code, manage…

Freelance Blogger | Ramblings of Tech Geek

Ramblings of Tech Geek | Freelance blogging about my different technological undertakings

Written by Tapas Das