High-quality data makes MLOps greater than the sum of its parts

Published in

Sogeti Data | Netherlands

6 min readJun 11, 2021

*“The whole is greater than the sum of its parts” ~ Aristotle*

Do you recognize the urge of organizations to apply AI to their business processes? This might be due to the fact that most organizations already have deployed some Machine Learning (ML) models in productions. If so, there’s little need to discuss the impact and benefits these models deliver. Following from that, it would be more interesting to discuss the way AI solutions are developed nowadays and what we can learn from it. So, let’s jump into it.

Myth or Fact: go big or go home

For a long time, let’s call it the Big Data Age, there was a general belief that only giant datasets would bring any value to the Machine and/or Deep Learning domain. Interestingly, this idea was mainly planted by some of the large(st) tech companies who made, from a strategic point of view, a deliberate decision to tell the world that developing only works with big data at your disposal.

While this might be the case in some situations, this is definitely not always true. If, for example, your dataset is too diluted with ‘boring’ cases, this is actually not effective. This will definitely not have any beneficial effect(s) to your model as it will not learn enough from these examples.

In addition, it is definitely not only about size if we look at the heightened downstream impact of poor data practices in high-stakes domains. They can have outsized effects on vulnerable communities and contexts as we saw in reduced accuracy in IBM’s cancer treatment AI and at Google Flu Trends missing the flu peak by 140%.

Fortunately, within the field of data science/engineering we realize more and more that we should no longer under-value and de-glamorize the data preparation aspect of AI.

From big data to good data: data-centric

Very recently, in a “A Chat with Andrew on MLOps” Andrew Ng also pointed out that, instead of emphasizing always on the model tuning, we need to shift our mindset from big data to quality data, and from model-centric to data-centric.

It is widely known that model building consumes approximately 20% of the time spent on AI development whereas the remaining time (80%) is spent on pre- and post-processing steps. Paradoxically, 99% of the AI research is done on the modelling part. The way ML has been developed over the last few decades has been driven by academia which focused on improving the algorithms on benchmarks/public datasets. In industry, the workflow is often quite the opposite from academia: instead of the model being the contribution, you search for popular models on Git repositories who fit well to your collected and relevant data. Then, you start curating your dataset and tweak the model until the two plays well together.

*The well known 80/20 rule in AI development*

Why does a data-centric approach care so much?

Andrew and his team conducted two experiments to highlight the differences between model-centric vs data-centric approaches on different (computer vision) Machine Learning projects. It clearly shows that a model-centric approach is limited and cannot improve metrics that much compared to the baseline model. The data-centric approach on the other hand, proves that with little increase in quality data a better performance can be achieved.

Another experiment highlights the difference between using clean data vs. noisy data. Here, noisy data is a dataset used for automotive components classification where the bounding boxes are slightly inconsistently labeled. So, let’s say you start off with 500 examples of the noisy dataset and you want to get up to 0.6 accuracy. You could either nearly triple the amount of (noisy) trainings examples or you could clean up the data set to get a similar level of accuracy. The latter is used more often since there is not always more data available and/or it can be very costly to obtain.

Moreover, big data problems that display a long tail of rare events in the input data, like we see with self-driving car, can also be considered as small data problems.

This is exactly what Andrej Karpathy, director of AI at Tesla and a world-leading expert in the field of when it comes to machine learning & training neural nets, said in a podcast with Pieter Abbeel a few weeks back. A focus on data is the primary modeling approach whereas a focus on model tuning is secondary in the development of the Tesla Autopilot system. From his point of view, a ML model could be considered as a software artifact “compiled” from data. He already described this a while back as software 2.0.

How to improve data quality?

It goes without saying that the quality of a ML model, in terms of accuracy, fairness, and robustness, often is a reflection of the quality of the underlying data. But what exactly does this mean? When do we have bad quality and how can we improve our datasets?

Ingest validation with unit tests for data
First of all, there are several things that already can go wrong at the ingestion phase. Schema’s can have undergone changes, data can get delayed, can contain unknown values, etc. To make sure that your data behaves like expected while ingesting into your (data)pipelines, it is comman practice to test your data like you would unit test your code. There are some great open source libraries you could consider for your data validation, namely: Great Expectations, TensorFlow Data Validation (TFDV) (part of the TensorFlow Extended(TFX) suite) and Amazon’s Deequ.

Synthetic data in case of too little or imbalanced data
Secondly, we not always have the right amount of data. This could either mean we have too little data in general or we have an under-representation of some classes (imbalanced data). Moreover, collected data does not always contain every possible scenario. Generation of synthetic data is a great solution to mitigate these problems and to account for the missing rare scenarios. There are several advanced augmentation techniques which can help you to improve your AI solution by artificially enlarging your dataset.

Identifying label errors using confident learning
Then, there is always a certain percentage of your data that is inaccurate or mislabeled. Maybe much to your surprise, but even the most-cited datasets used to test ML systems contain these errors (check it out here).
To find possible errors in your dataset you might want to try the cleanlab package. This package is a framework to identify label errors, characterize label noise, and learn with noisy labels known as confident learning.

Examples of label errors that currently exist in Amazon Reviews, MNIST, and Quickdraw dataset identified using confident learning.

Conclusion — DataOps a must have for MLOps

As MLOps aims to understand, measure, and improve the quality of ML models, it is not surprising to see that data quality needs to play a prominent and central role in MLOps. Hence, MLOps goes hand in hand with a good DataOps practice to address all the above-mentioned quality points.