Comprehensive Guide to Machine Learning (Part 3 of 3)

Tapas Das
Analytics Vidhya
Published in
8 min readSep 3, 2020

--

Pic Courtesy: https://www.advectas.com/en/blog/what-is-machine-learning/

Welcome to the 3rd and final part of the “Comprehensive Guide to Machine Learning” series. Over the course of this series, we looked at several crucial concepts which play a significant role in developing a good machine learning model.

Concepts like data cleansing and EDA help a great deal in acquiring deeper understanding of the data. Similarly, concepts like feature engineering and feature selection help in making sure that only useful and relevant data is being fed to the machine learning model. You can get a quick recap of these concepts by visiting the below links:

In this final post, we will look at actual model development and how to perform model validation to ensure that the model is behaving as expected and predicting correct results.

7) Baseline model building

The first step towards model building is to decide that which machine learning framework to use, based on the type of task we have at our hand. Below are few of the machine learning frameworks which work well in majority of the scenarios:

  • Neural Networks (Tensorflow/ Keras/ PyTorch)
  • XGBoost
  • LightGBM
  • CatBoost
  • Logistic Regression
  • Random Forest
  • Support Vector Machines (SVM)

The “Pet Adoption” dataset that I have been referring to through out this series, has majority of features as categorical variables. So I decided to build the final machine learning model using “CatBoost” framework, since it works pretty well with categorical features.

For experiment sake, I also tried out models based on Neural Networks, XGBoost and LightGBM. But CatBoost gave me the best results out of all these machine learning frameworks. You can find the other codebase at below link.

Below image shows the final CatBoost model I used for making predictions.

Let’s go over few of the crucial model configurations for better understanding of different nuts and bolts of CatBoost framework.

objective='MultiClass'

In the “Pet Adoption” dataset, we are classifying between 4 different breed categories and 3 different pet categories. So considering all that, we need to set the model objective as “MultiClass”.

eval_metric='TotalF1'

We need to set an evaluation metric for the CatBoost model, so that we can judge that the model is behaving as expected and converging to global minima. For classification tasks, below are the commonly used evaluation metrics:

  • Accuracy
  • F1 Score
  • Area Under the Curve (AUC)
  • Precision Score
  • Recall Score

Generally for highly imbalanced datasets like “Pet Adoption”, it’s preferred to use F1 Score or AUC as evaluation criteria, since they provide a much better holistic understanding of the model performance. You can get further understanding of these metrics by visiting below link.

class_weights=[0.165, 0.185, 1]

In case of imbalanced datasets, the machine learning models tend to get biased towards the majority class. To ensure that the model is giving proper weightage to all the prediction classes, we can set the class weights.

Below example shows how to calculate the class weights for imbalanced datasets.

Total Sample: 100

Class A: 60 samples

Class B: 30 samples

Class C: 10 samples

Class A %: (60/100) = 0.6

Class B %: (30/100) = 0.3

Class C%: (10/100) = 0.1

Class Weight = (Lowest Class %) / (Current Class %)

Class A weight: 0.1/0.6 =0.1667

Class B weight: 0.3/0.2 = 0.5

Class C weight: 0.1/0.1 = 1

learning_rate=0.025

This hyper-parameter sets the speed at which the model will converge towards the global minima.

Learning rate is the most critical hyper-parameter in any machine learning model. Setting it too low will cause the model to train very slowly. On the other hand, setting it to high value can cause the model to overshoot the global minima and result in poor predictions.

I’d highly recommend to go through below link to get deeper understanding of learning rate and it’s significance in machine learning models.

reg_lambda=0.009

This parameter helps in preventing model overfitting, which we have already discussed in the 2nd post of this series.

You can visit below for better understanding of rest of the CatBoost model parameters.

Also, I’d highly recommend to go through below link for better understanding of ensemble machine learning models (CatBoost, XGBoost, LightGBM).

8) Hyper-parameters Tuning

Let’s first understand the difference between Model Parameters and Model Hyper-parameters.

Model Parameters: These are the parameters that the model determines on its own while training on the dataset provided. These are the fitted parameters.

Model Hyper-parameters: These are adjustable parameters that must be tuned, prior to model training, in order to obtain a model with optimal performance.

Hyper-parameters are important since they directly control behaviour of the training model, having important impact on performance of the model under training.

In case of a CatBoost model, below are the critical hyper-parameters to be fine-tuned to achieve an optimal performing model.

  • Learning Rate: controls training speed
  • Regularisation Lambda: controls model overfitting
  • Subsample: controls boosting in ensemble models
  • Max Depth: controls depth of decision trees
  • Min Data in Leaf: controls model overfitting
  • Max Leaves: controls model complexity

I have used the “Optuna” python library to automate the task of hyper-parameters tuning. You can visit below link to get better understanding of Optuna and how it works behind the scenes.

Below image shows the objective function created for Optuna, so that it can search and find good hyper-parameters.

Once the objective function is set, we can use the below commands to let Optuna run free on the hyper-parameters search space.

Once Optuna is finished with the number of trials provided, we can extract the best hyper-parameters by executing the below commands.

9) Model validation

Once we are set with our model and done with the hyper-parameters tuning, next step is to validate how the model is performing on the “Validation” and “Test” datasets.

Sklearn python library provides two cross-validation functionalities, which are as listed below.

  • K-Fold: For regression and balanced classification tasks
  • Stratified K-Fold: For imbalanced classification tasks

Since we are dealing with an imbalanced dataset, I chose to go with “Stratified K-Fold” cross-validation with 5 splits performed. Below images show the model validation performed using same.

The training results would look like below images.

As we can see from the above image, the training F1-score is around 99% and the validation F1-score is around 93%. So we can safely assume that the model is not underfitting or overfitting.

Next, let’s check the F1-score on the “Test” dataset.

As we can see, the test F1-Score is around 90%. So the model is already performing way better. We can further confirm this by checking the confusion matrix shown below.

I’d highly recommend to go through the below links to acquire better understand of model validation techniques and confusion matrix.

10) Making Predictions

Let’s quickly revisit what all we did up to this point.

  • We cleaned the data and performed EDA on it to gain better understanding
  • We performed feature engineering to generate new insightful features for the model
  • We performed feature selection to discard irrelevant features
  • We performed train/validation/test split to prepare the datasets for model validation
  • We built the baseline model and performed hyper-parameters tuning to get the best set of hyper-parameters
  • We performed cross-validation to ensure that the model is behaving as expected

Pheww! That’s hell lot of steps to perform, to build the final machine learning model. This brings us to the final showdown moment, which is making predictions on test dataset.

I usually follow the same approach of cross-validation to let the model train and predict on multiple K-fold splits, and then take the average of all predictions made across all data splits.

Below image shows the codebase for same.

Concluding Remarks

This concludes the 3rd and final part of the comprehensive machine learning guide. I do hope that you gained insightful knowledge into the nitty gritty of machine learning model development. And I’m sure you’ll use the learnings from this series to build models of your own and make predictions on real-world problems.

As always, you can find the codebase for this post at below link.

Do leave me your comments, feedback and challenges (if you’re facing any) and I’ll touchbase with you individually to collaborate together.

Also please visit my blog (link below) to explore more on Machine Learning and Linux Computing.

--

--