Data-Centric AI: Why It Is the Future Of AI
The aspect of data quality is finally getting the limelight it deserves; here’s why.
The name of Andrew Ng is going down in history on account of his theory that has reformed AI development. We know he’s the former head of Google Brain, former chief scientist at Baidu, and the most prolific and respected figure in the AI community.
I mean, haven’t we all started learning machine learning and artificial intelligence, thanks to his courses?
Yet, the movement he started in the AI community has shaken the foundation of a preconceived understanding of the development of AI.
Rather than staying focused on data model performance, why not shift our attention towards improving the quality of data we feed to the AI algorithms?
This was not just a simple question posed to the community, for he had evidence to suggest this was precisely the need of the hour to propel AI development towards its next phase.
Shifting the Focus Towards The Data
The basis of Andrew’s argument was that data serves like food for AI. For a chef to prepare a masterpiece course, the AI requires an equally fine and refined dataset.
The focus on models now needs to be shifted towards data, i.e., rather than focusing on the model of the algorithm, more work needs to be put into the sorting, organisation, and refinement of data. The more data there is for AI, the better its performance.
The model iterations need to be limited in comparison to data work. According to Andrew, about 99% of papers published on AI research focus on the model, with a mere 1% talking about the data.
Limitations of Model-Centric Approach
With a Model-centric approach, there is not much you can do to improve the overall efficiency of your program. After all, there will be a point when you have performed enough iterations that will saturate your model no matter how sophisticated the changes. The limitations will come from data that will not suffice for the model to perform beyond a defined parameter.
How is Data-Centric Approach Different?
It all comes to improving the data processing power, particularly improving the consistency of data. This is achieved via the following ways:
- Developers need to sort the labelling: Different sections of data may be labelled differently, including very similar sections having different names and entirely different sections having similar names.
- Generating unique data that the model is unfamiliar with: This means familiarising the model with data different from data used in training while avoiding overstuffing of data. Similarly, noisy data also needs to be removed to eliminate the high degree of variances and better generalise unprocessed data.
- Selecting data sources carefully: By eliminating unrelated data sources, being wary of data accumulation data storage, and connecting correct logic to create a singular structure for better and more effective training of the machine learning models.
- Feature Engineering: We can introduce new features non-existent in source and data in its raw form and result in significant improvement in the processing of input data along with the labels/targets.
The data-centric development of AI isn’t a one-time process that is completed once the system has been launched. Instead, it requires the model to process the data given and analyse the results.
We can then analyse the errors and make further improvements, not in the model, but data! This will naturally improve the system. We cannot enhance the data via one-time “pre-processing”.
Final Thoughts
It all keeps coming back to improvement and refinement of data, not to be confused with more data. More data means more noise which means less precision.
Better, consistent, and more precise labelling of data is the key. The better the quality of data, the more it will decrease the requirement of excessive data samples. So even if there is a limited amount of data available, if it is more meaningful, it will help the system learn faster and better.
Like you, I, too, am in the process of learning more and adopting the data-centric approach. As a first step, I’ve helped collate an open-source GitHub Repository with all sorts of resources. I’ve come across data-centric AI. And I’ve joined the Data-Centric AI Community and hoping to learn with like-minded peers and experts in the industry.
As always, I’ll be happily sharing my journey over here, and I’m hopeful it helps you accelerate your learning journey too.
For more helpful insights on breaking into data science, honest experiences, and learnings, consider joining my private list of email friends.