Is Data-centric AI the new way to fuel AI performance?

Shantanu Chandra
AI FUSION LABS
Published in
8 min readNov 14, 2022

In September 2021, Snorkel.ai hosted the “The Future of Data-Centric AI — 2021” summit, to bring together data-centric AI experts from academia and industry to explore the transition from a model-centric practice to a data-centric approach for building and deploying AI applications.

In this article, we will investigate the pressing query “What is data-centric AI?” and detail the three principles that drive this field at its core. We then go on to see how data-centric AI is shifting the limelight from the traditional model-centric approach when it comes to fueling AI performance in the wild and driving impact, thereby setting the “Improve data, improve accuracy” movement in motion. The “Improve data, improve accuracy” argues that data-centric AI manages to drive exceptional AI performance by:

1. Using time and resources on a more fruitful data curation process than the inefficient model-tuning activities on sub-optimal data.

2. Reducing noise and biases from data which results in vastly cleaner datasets for AI systems to build upon for reliable real-world decision-making.

3. Carefully curating datasets via collaboration across various SMEs and data scientists. The resulting data is infused with strong prior world knowledge that drives rich learning for the AI systems as they spend less time re-inferring the known and more time learning hidden patterns.

What is Data-Centric AI?

The term data-centric AI sounds tautological given that everything we do in AI has data at its core. However, there is a gaping difference between this new paradigm and the traditional way of AI/ML development. Historically, the focus on advancing the field of AI/ML has been on developing better and more complex models that work on the underlying static data, which is typically treated as independent from the model-development process (for instance, a bunch of .csv files or images that are prepared and downloaded once at the beginning of the process). As a result, a vast majority of time and effort is spent in iteratively making changes to the model that optimizes performance on this given static fixed data.

The shift from this traditional “model-centric AI” to “data-centric AI” development advocates a fundamental shift in the focus of the ML community, rather than a technological or methodological shift (Mustafa Hajij, 2021; Steven Euijong Whang, 2022). It is essentially viewing the data as a more central figure in the AI development process. The need for such a realignment of focus stems from the fact that lately in the space of AI applications and deployment, data is becoming a key differentiator (and bottleneck) when it comes to performance, as the models keep getting increasingly standardized (transformer architectures now solve tasks across all modalities), push-button (e.g., pre-trained models readily available) and data-hungry (billions of parameters require exponentially bigger datasets to train on). Having said that, it does not imply an either-or scenario, since successful AI development requires iterations on both models and data. Data-centric AI is rather a shift in the fundamental focus on how to develop and deploy impactful AI applications going forward.

Principles of Data-Centric AI

The founding principles of data-centric AI are premised upon the recent developments in the field and what the future holds for AI researchers and developers to take the field forward:

1. AI development today centers around data

2. Data-centric AI needs to be programmatic

3. Data-centric AI needs to include subject matter experts in the loop

To understand this new synergy between data and AI systems, we will delve deeper into these pillars in the subsequent sections, highlighting how recent developments are driving these changes and how data-centric AI is accelerating real-world AI performance via the “Improve data, improve accuracy” movement.

1. AI development today centers around data

In a typical model-centric ML development process, we spend most of our time iterating over processes such as feature engineering (selecting/curating data attributes for the model to learn from), model architecture design (designing the parameters and data flow of the model), and training algorithm design (choosing the right training paradigm and loss functions). These processes of AI development are still a focus of the majority of AI research (and rightly so); however, a few recent trends have steered the focus to a more data-centric approach, namely:

1. Today’s deep learning models are getting increasingly powerful and push-button as they take raw data and learn from them autonomously without any pre-processed and curated features. This eliminates the time and effort spent on custom feature engineering to aid model learning. However, this also makes them far more data-hungry requiring huge dumps of data to learn from such raw features.

2. These model architectures are getting increasingly convergent, i.e., more and more variety of tasks and data modalities are being handled by an increasingly small and stable set of model architectures (e.g., transformers). This is slowly but surely eliminating the effort required to develop novel architectures for each task and data modality.

3. Models today are far more accessible as a result of incredible community and open-source company efforts. However, at the same time these models are also far less practically modifiable owing to really deep and complex black-box architectures (you never know how exactly the third layer of multi-headed attention is affecting your output).

What “Improve data, improve accuracy” says:

ML development today is forced to focus its attention on the operations related to the training data (collection, labelling, augmentation, slicing, management, etc.). This plan of action dictates spending more time on curating the right dataset than on model design. This ensures AI systems are built with quality data that clearly conveys what the AI must learn. This also reduces unnecessary time and resources spent on futile model fine-tuning exercises on inconsistent data.

2. Data-centric AI needs to be programmatic

The development of data-hungry large state-of-the-art AI models is usually halted by the need to label, curate, and manage training data. This often requires employing armies of human annotators for sometimes even entire person-quarters or person-years to label and manage the data to be ready for ML development.

Data is increasingly becoming the new bottleneck in AI development. We can see clear signs of this where it takes less than five human days to make minor modifications to the model that ultimately result in <1% improvement in performance. On the other hand, the quality of the training data, when improved manually by carefully curating them and adding a few more data points takes 8–9 human months but leads to >10x improvement when compared to the choice of the model architecture. This underscores the theme that the highest leverage point in many applications is the quantity and quality of data. But it is extremely difficult to obtain quality data for the majority of the real-world use cases where AI can really make an impact. Thus, manual labelling and curation is often a non-starter for these use cases, even in large organizations, since the datasets:

1. Require significant subject matter expertise: You need financial analysts, doctors, and network technicians to do the data labelling after being specially trained for the task first.

2. Are highly private: Real-world datasets are usually bounded by strong legal and security restrictions and hence cannot just be shipped to any third party for labelling.

3. Their objectives change rapidly: Both the data distributions and the downstream model objective change multiple times, which may require the data to be re-labelled constantly.

Solving these critical challenges when dealing with massive stack piles of manually labelled training datasets is a practical nightmare for organizations today. This is where programmatic labelling comes in. The idea is to raise the abstraction level with which the users can interact with this new central point of development, i.e., data. This implies that instead of labelling, curating, slicing, and augmenting the data by hand doing it programmatically (e.g., asking an SME to specify keywords, phrases, or pattern matchers for a document instead of manually going through each document to label them by hand).

What “Improve data, improve accuracy” says:

Programmatic labelling significantly accelerates the crucial and critical data curation phase while also ensuring consistency across the labels. It reduces the noise and, in some cases, even biases in training labels making the data more streamlined with correct and clear signals for learning. This has proven to be vastly favourable for AI products to deliver the most reliable solutions in the real world.

3. Data-centric AI needs to be collaborative with the SMEs

The model-centric development way is the “throw it over the wall” model where there is no interaction between the SMEs who usually label the data and the data scientists who work on this data which is “thrown over the wall” by the labellers. This is very impractical and even dangerous for many real work applications. Thus, data-centric AI aims to make a synchronous workflow between the two by allowing them to collaborate on a common central ground. The key idea here is to take the knowledge of the SME and directly inject it into the model than playing this game of 20,000 questions where the model is trying to re-infer the features or heuristics what the SME already knows.

What “Improve data, improve accuracy,” says:

Collaboration between teams reduces development time in two ways. First, they can work in parallel and directly influence the data used for the AI system dynamically rather than going back and forth in the development process. Second, the models now train on cleaner-curated data infused with strong prior knowledge from multiple teams. This reduces the time and resources required for the models to converge faster to desired performance.

In brief, AI application development requires managing ever-changing data landscapes, iterative improvements (on both data and model), and seamless collaborations between domain experts and data scientists. Data-centric AI is built on this philosophy and is emerging as the primary driver for faster and more reliable AI performance for agile and reliable decision-making in the real world. The “Improve data, improve accuracy” movement is only getting started and is here to stay!

About the author: Shantanu is an AI Research Scientist at the AI Center of Excellence lab at ZS. He did his Bachelor’s in Computer Science Engineering and Master’s in Artificial Intelligence from the University of Amsterdam with his thesis at the intersection of geometric deep learning and NLP. His research areas include Graph Neural Networks (GNNs), NLP, multi-modal AI, deep generative models, and meta-learning.

References

Mustafa Hajij, G. Z. (2021). Data-Centric AI Requires Rethinking Data Notion.

Steven Euijong Whang, Y. R.-G. (2022). Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective.

--

--

Shantanu Chandra
AI FUSION LABS

AI Research Scientist, AI Lab @ ZS | MS in AI, Univ of Amsterdam