Data-centric approach

Published in

Nerd For Tech

3 min readJan 7, 2022

Usually, in machine learning problems we are more focused on improving the performance of our algorithm on a fixed dataset. But while doing that we are not focusing on whether our training data is really reasonable for us to achieve that results.

Even most of the research in today’s ML is based on the improvement of the algorithm-centric approach. If we are not working on proper data then our approach to improving ML also will be futile.

We usually do data preprocessing before working on the problem but is that really enough?

data centric approach focuses on data rather than focusing on algorithm. It keeps our algorithm same while trying to improve our data. That does produce more accurate results over time

How can we shift to a data-centric approach?

The data-centric approach is used widely in the industry today. For working with this approach we have to ask ourselves just these questions

What kind of data are we working with?
What should be our baseline while building the ML model?
Does it really gonna work on our problem?
What should be our baseline?
How can we do a Sanity check on our model?

In the further blog, we are gonna discuss the above questions

What kind of Data we might work with?

Mainly, there are 2 types of data structured and unstructured datasets. While structured datasets are some software datasets (eg excel CSV files, JSON files, etc) in these data types machines are really good at producing results.

On the contrary, when it comes to unstructured data such as images, audios humans are better than machines for predicting what the data is about.

What should be our baseline while building ML models?

As discussed above if we know our data is structured. Then some other model can be our baseline. But if we are working on the unstructured data then our baseline should be Human-Level performance should be our baseline.

How can we find out if the ML algorithm is gonna work on our problem?

For finding out if the ML algorithm is gonna work on our problem then we should do a sanity test first, Let’s see how can we do a sanity test:

For structured data, we can try to overfit small training data before training on the large one. Just to see if the model is working fine on a small dataset before proceeding towards a large dataset. (saves time)
For speech recognition, you can try to do it with overfit the one audio script. If the algorithm is not doing fine for a single audio clip then our effort will be futile for working with a large amount of data. (Here we have to improve the quality of our data)
For image recognition: just see if it can at least make work with one image. If it does then we can try to make it work on multiple images
for classification we have to train the algorithm for a small subset of the 10 or 100 images if it’s not working on those then it’ll not be helpful for other images or large data.

thanks for reading my blog :) follow for more,
have a good day 😃

Data-centric approach

How can we shift to a data-centric approach?

What kind of Data we might work with?

What should be our baseline while building ML models?

How can we find out if the ML algorithm is gonna work on our problem?

Written by Som