Avoiding Machine Learning Pitfalls : From a practitioner’s perspective — Part 1

Abinaya Mahendiran
WiCDS
Published in
4 min readJun 28, 2022
Credit: Feodora Chiosea / iStock

The aim of this blog is to understand how the paper “How to avoid machine learning pitfalls: a guide for academic researchers” is applicable to the applied research in the industry. Though the above paper is written keeping in mind the academicians and researchers who are relatively new to the field of Machine Learning(ML), it can also be used as a guide for all the ML practitioners in general. This will serve as a guide to avoid common mistakes that occur when using machine learning techniques.

Stages of Machine Learning process:
Machine learning process will broadly fit into the following five stages,
— Stage 1: What to do before you start building your models
— Stage 2: How to reliably build models
— Stage 3: How to robustly evaluate models
— Stage 4: How to compare models fairly
— Stage 5: How to report your results

Stage 1: Before you start to build models
It is never a good idea to rush into building models without understanding the problem at hand, and the goal that needs to be accomplished. Here are some of the factors that needs to be addressed before you start solutioning,

i. Before you start building models, the first and foremost step is to understand the business use case. In many real world scenarios, the problem can be solved using non-ML solutions. Understand that ML might not be the need of the hour. Do not heed to the hype and blindly go for an ML solution. The ultimate aim should be to solve the business problem in the best way possible and that doesn’t necessarily include ML. Don’t hesitate to build a non-ML solution.

ii. Once the use case is clear, take your time to find the right data and if the data is available, spend enough time to understand the data. See if the data source is reliable, and the quality of the data is good. It’s good to understand the limitations of the data before starting to think about solutions because if you feed garbage to the model, you’ll get garbage! You should ideally be spending more time in collecting quality data for the problem you’re trying to solve. Datacentric AI is the need of the hour.

iii. Never look at all the data, look at only a sample of the available data. The reason being, we tend to make assumptions, gain insights and identify patterns in the data that will guide the modelling process. It is always good to look at the train data and gain insights rather than looking at the whole data (including the test data). This will help prevent untestable assumptions creeping into the test data and failure of models from generalizing. Avoid data leakage at all costs.

iv. Make sure there is enough data to start building solutions since the model complexity and generalization of the model depends on the availability of data. If there is scarcity of data and data imbalance, find other ways (using cross-validation, data augmentation, weak supervision, zero-shot learning etc.) to increase the data at hand. How much of data is enough data? Well, that completely depends on the problem.

v. Talking to SME’s to understand the usefulness of the problem at hand is crucial. SME’s might be helpful in providing information on feature set and methods that had worked well in the past for the same problem you’re trying to solve. Asking the right question is of utmost importance and gain all the possible information from the SME’s.

vi. It is always important to do a literature survey to understand the past work that has been carried out, its limitations and gaps (what has worked well and what has not), and potential scope for improvement. This will help in building on top of the existing work without starting from scratch and prevent re-inventing the wheel. It is never a good idea to reinvent the wheel when you’re trying to solve a business problem unless you’re trying to learn something from scratch to expand your knowledge.

vii. Before you even think about solution, it is also a good practice to think about model deployment. If you find the answer to “Why do you want to build an ML model for the given problem?”, it will help in streamlining your model building process. Factors like the complexity of the model, inference requirements, and how the user will be using it, etc., will depend on the resources. So, planning ahead of time will help in avoiding last minute delay in deployments. MLOps can be leveraged here.

** Note: This is Part 1 of the 5 Part series on “Avoiding Machine Learning Pitfalls : From a practitioner’s perspective”. Thank you for reading and appreciate your feedback. Stay tuned for the next part!

--

--