Geek Culture
Published in

Geek Culture

Data-Centric AI: Why It Is the Future Of AI

The aspect of data quality is finally getting the limelight it deserves; here’s why.

Photo by Fabio on Unsplash

The name of Andrew Ng is going down in history on account of his theory that has reformed AI development. We know he’s the former head of Google Brain, former chief scientist at Baidu, and the most prolific and respected figure in the AI community.

I mean, haven’t we all started learning machine learning and artificial intelligence, thanks to his courses?

Yet, the movement he started in the AI community has shaken the foundation of a preconceived understanding of the development of AI.

Rather than staying focused on data model performance, why not shift our attention towards improving the quality of data we feed to the AI algorithms?

This was not just a simple question posed to the community, for he had evidence to suggest this was precisely the need of the hour to propel AI development towards its next phase.

Shifting the Focus Towards The Data

The basis of Andrew’s argument was that data serves like food for AI. For a chef to prepare a masterpiece course, the AI requires an equally fine and refined dataset.

The focus on models now needs to be shifted towards data, i.e., rather than focusing on the model of the algorithm, more work needs to be put into the sorting, organisation, and refinement of data. The more data there is for AI, the better its performance.

The model iterations need to be limited in comparison to data work. According to Andrew, about 99% of papers published on AI research focus on the model, with a mere 1% talking about the data.

Limitations of Model-Centric Approach

With a Model-centric approach, there is not much you can do to improve the overall efficiency of your program. After all, there will be a point when you have performed enough iterations that will saturate your model no matter how sophisticated the changes. The limitations will come from data that will not suffice for the model to perform beyond a defined parameter.

How is Data-Centric Approach Different?

It all comes to improving the data processing power, particularly improving the consistency of data. This is achieved via the following ways:

  • Developers need to sort the labelling: Different sections of data may be labelled differently, including very similar sections having different names and entirely different sections having similar names.
  • Generating unique data that the model is unfamiliar with: This means familiarising the model with data different from data used in training while avoiding overstuffing of data. Similarly, noisy data also needs to be removed to eliminate the high degree of variances and better generalise unprocessed data.
  • Selecting data sources carefully: By eliminating unrelated data sources, being wary of data accumulation data storage, and connecting correct logic to create a singular structure for better and more effective training of the machine learning models.
  • Feature Engineering: We can introduce new features non-existent in source and data in its raw form and result in significant improvement in the processing of input data along with the labels/targets.

The data-centric development of AI isn’t a one-time process that is completed once the system has been launched. Instead, it requires the model to process the data given and analyse the results.

We can then analyse the errors and make further improvements, not in the model, but data! This will naturally improve the system. We cannot enhance the data via one-time “pre-processing”.

Final Thoughts

It all keeps coming back to improvement and refinement of data, not to be confused with more data. More data means more noise which means less precision.

Better, consistent, and more precise labelling of data is the key. The better the quality of data, the more it will decrease the requirement of excessive data samples. So even if there is a limited amount of data available, if it is more meaningful, it will help the system learn faster and better.

Like you, I, too, am in the process of learning more and adopting the data-centric approach. As a first step, I’ve helped collate an open-source GitHub Repository with all sorts of resources. I’ve come across data-centric AI. And I’ve joined the Data-Centric AI Community and hoping to learn with like-minded peers and experts in the industry.

As always, I’ll be happily sharing my journey over here, and I’m hopeful it helps you accelerate your learning journey too.

For more helpful insights on breaking into data science, honest experiences, and learnings, consider joining my private list of email friends.




A new tech publication by Start it up (

Recommended from Medium

Ethical & Responsible AI Storytelling

History of Conversational AI #1: What is a Chatbot, and Why Do I Need One?

Are You Ready to Have a Robot as Your Boss?

Are You Ready to Have a Robot as Your Boss?

Agency and Structure: Conceptualising Applied AI Ethics in Organisations

How Facebook Uses Artificial Intelligence.

The AI Winter is Over. Here’s Why.

Difference between Artificial Intelligence and Machine Learning

Is it right to analyze sales conversations through artificial intelligence for better outcomes?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Arunn Thevapalan

Arunn Thevapalan

Senior Data Scientist & Top 1000 Writer. I write about my journey breaking into data science and building profitable side-hustles —

More from Medium

The Last Mile of Machine Learning and Beyond: Model Serving and Monitoring

Why MLOps Needs to Be Data-Centric

Machine Learning Systems Pt. 1: Overview and Challenges

What is Machine Learning model Interpretation?

A Calm beach representing the calmness after you trust your Machine Learning Model