Understand the domain and data before starting a machine learning project

Overfitted Cat
4 min readJan 2, 2022

--

In the last blog post, we have seen a common failure of a machine learning project. Both sides (John and stakeholders) failed on every single level. The project was doomed. But, that’s ok. There is no shame at all. It is all part of learning and life. It was an experience that would make John a better person and expert. This blog post will represent some points that John will improve next time starting a new project.

Off-topic, fun fact: As I’m writing this blog post, there was a song in the background.

Don’t be the problem, be the solution
Don’t be the problem, be the solution
Don’t be the problem, be the solution
Problem, problem, problem, problem

Faith without works is
Talk without works is
Faith without works is
Dead, dead, dead, dead

- TalkTalk, A Perfect Circle

Let’s see how John can be the solution next time, on a new project.

Understand WHY we need Machine Learning

Sadly as it might get, sometimes we tend to overengineer a solution by throwing everything together. Sometimes we start working on something without asking simple WHY? The first thing is understanding the domain. What is the purpose of the software? Who is going to use it? How can machine learning help? Do we have any data? Are there any privacy and security concerns? How tolerable is it to make mistakes? What are the main challenges that machine learning needs to solve?

Sometimes, machine learning isn’t cut for the potential project. Maybe the rule-based system would have worked ok as a baseline, even developed way faster. Be aware of your ignorance. Don’t assume, don’t jump to conclusions, ask and listen. I bet most people like to be listened to and understood before any action is taken. If you show a little empathy, they will be nicer to you as well.

Working on machine learning is fun, but sometimes it is ok to step back and see a bigger picture. Maybe the company you are working for is not yet ready for machine learning. Maybe they need a strategy and a road map to data-based projects. If all you have is a hammer, everything looks like a nail. Don’t be the hammer, but also, don’t be a nail :)

Understand data

Understanding the data is tightly coupled with understanding the domain. Many would argue data is more crucial than an algorithm. Ready data opens the door to all kinds of possibilities to bring value. The first of all is analytics. Humans see what they want to see. We are all biased creatures auto-piloted by beliefs and emotions. With analytics, we can work toward understanding our customers better. We can conclude how to make things better for everyone. Data analytics is an opener to machine learning projects. Without it, we will wander in the dark.

Some questions you might want to answer first. How is data collected? How many labeled data points are there? Is data representative for the problem that you want to solve?
What if there is no data at all? Should we give up? Should we wait for a month to collect data? Of course not. There is much to be done and prepared before a serious ML project. Commonly, there won’t be any free labeled data set or pre-trained model you could use for your project. In the times like this, good communication with stakeholders is your best friend. You could work together and develop a strategy on data collection, data labeling, and, as well as, algorithm improvements. Maybe some basic algorithm would serve well as a starter. Set up milestones and think about how these milestones will contribute to them. You got this.

Endnote

In this blog post, we explored how understanding the domain and data can help John. In the next one, we will explore more topics on improving the process of any machine learning project.
Sometimes you may find it hard talking to people. Like they don’t want to understand you at all. Despite this, try to put in their shoes and try to understand. The change should start with you. Try to listen and understand, and they will listen to you. Don’t be the problem, be the solution.

You can find all blogs in the series:

  1. Yet another unhealthy machine learning project
  2. Understand the domain and data before starting a machine learning project
  3. The evaluation metrics and error analysis in ML projects

--

--