Avoiding Machine Learning Pitfalls : From a practitioner’s perspective — Part 2

Published in

WiCDS

4 min readJul 5, 2022

The aim of this blog is to understand how one can reliably build machine learning models. Many data science aspirants choose the field of data science because they think that building model to solve real world problems is too cool! And sure, it is, but choosing the right model for the dataset at hand is of utmost importance. Never fall prey to the ever-increasing advanced SOTA model architectures just because it is popular and used by everyone.

Stage 2: How to reliably build models
Given the availability of SOTA models in ML/DL through libraries like HuggingFace, TensorFlow Hub, and tools like AutoML, etc., anyone with a decent coding skill can train/retrain models with ease. But understanding the rationale behind how to choose the right model, how the training process should be conducted, and selecting the right kind of hyperparameters requires far more effort than just using tools/libraries. The following points can help in choosing the right model for the problem at hand,

i. Prevent test data leakage into the training process. A basic mistake that many beginners in the field of ML make is letting the test data leak into the training process. There are number of ways this could happen,

a) Performing data preparation on the entire dataset without splitting them into train, validation and test subsets. Check out scikit-learn to split your dataset.
b) Performing feature engineering and feature selection on the whole dataset.
c) Using the same test set to evaluate the generality of multiple models.

To prevent data leakage, always perform data preparation, feature engineering and feature selection using the train data and save the features so that it can be used to transform the test set. Also, use validation set to fine-tune the hyperparameters of the model so that it can generalize better on unseen data.

ii. Experiment with different models. Have you heard of “No Free Lunch theorem”? There is no single model that will perform best on all the problems. Once you understand what you are trying to solve, then start with a simplest baseline model. Keep experimenting with different models till you find the best one for your problem. Fail fast. Use a priori knowledge and pick models to try out accordingly. Talking to domain experts/SMEs and researching on what has already been done in your domain will help you in identifying the right model for your task.

iii. Do not use inappropriate models. Make sure you pick the model based on the characteristics of the dataset. For example, if your data contains categorical features, then pick a tree based method. If you have less data to start with, then using a ML model seems more suitable than a DL model. If your problem can be solved using statistical methods or even rules for that matter, then do not pick a complex fancy deep learning model. You should never choose a model based on its popularity. Swatting a fly can be done with just hands, you know! The justification of choosing the model should be completely based on the data and should never be on the recency of the model.

iv. To make the model better, always optimize the model hyperparameters. In applied research, your aim is to find the best model that can solve the business problem. All the models, be it ML or DL, comes with its default set of parameters or architectures. And most of the times, the default configurations of a model will never give the best result. No one-size-fits-all. You may need to fine-tune the hyperparameters based on the dataset (validation dataset) and doing this manually is a tedious task. To make your life easier, you can resort to some optimization strategy like grid search, random search, or beam search. But it is still a challenge to scale these methods to large number of hyperparameters or to complex models. AutoML tools can significantly reduce your effort on finding the right hyperparameters for a model and in fact the right model for your problem.

v. Feature selection and optimization of hyperparameters should be done carefully as part of training. Remember about the three splits of dataset that we discussed earlier (point i). You should always perform feature engineering and feature selection on the train split and hyperparameter tuning on the validation split.

In general, k-fold cross validation is used for performing hyperparameter tuning and model selection. In k-fold cross validation, the train data is split into k non-overlapping folds and each of the k-fold becomes part of the held out set and remaining k-1 folds becomes part of the training set and k models are fit and the mean performance of the k models on the k held-out dataset is obtained. This method may result in highly optimistic estimate on the performance of the model because you’re inadvertently overfitting the model and hence the model may not generalize well.

To overcome the above bias, you can use nested cross-validation (double cross-validation) where the approach tries to optimize hyperparameter and select model without overfitting on the training data. Here the hyperparameter optimization is nested inside the k-fold cross-validation procedure for model selection. Since it involves two loops of cross-validation, it is also known as double cross-validation. Refer to this link to understand the difference between k-fold cross-validation and nested cross-validation in detail.

* Note: This is Part 2 of the 5 Part series on “Avoiding Machine Learning Pitfalls : From a practitioner’s perspective”. Before you read this blog, please have a look at Part 1 to understand the different stages of the machine learning process.

Thank you for reading and appreciate your feedback. Stay tuned for the next part!

Avoiding Machine Learning Pitfalls : From a practitioner’s perspective — Part 2

Written by Abinaya Mahendiran