Data-centric AI / Big data vs. Good data

Atlantbh
Atlantbh Engineering
5 min readJan 10, 2023

We can agree that AI is not a one-size-fits-all solution. However, for many companies, AI can provide significant benefits and help drive growth and success. Some potential benefits include the following:

  • Improved efficiency: AI can automate tasks and processes, saving time and resources for other activities.
  • Increased accuracy: AI can make more accurate predictions and decisions than humans, reducing the risk of errors.
  • Enhanced customer experience: AI can personalize customer interactions, providing a more tailored and seamless experience.
  • Cost savings: AI can help businesses reduce costs by automating tasks, improving efficiency, and reducing the need for human labor.
  • Competitive advantage: Companies that use AI can gain a competitive advantage over those that don’t by being able to analyze and act on data faster and more effectively.

The last decade has brought significant growth in the field of Data Science in general. However, the total value of AI is still locked in many sectors, such as health care, manufacturing, and government technology. In a study published by Accenture, 80% of all Proof of Concepts (PoCs) do not make it into production. What exactly is going wrong, and why is AI not as successful in the real world as it is in academic studies?

Model-centric vs. Data-centric approach

A strong model-centric AI (traditionally presented in academia) considers data only as a static parameter. If a model’s performance is not as expected, AI/ML engineers will try to tune the model’s hyperparameters or even change the model while data stays untouched.

Subject Matter Experts do not play a vital role in the whole process of developing AI-based systems in a model-centric approach. Instead, AI/ML engineers usually make business-related decisions during that process.

As per Andrew Ng, an AI system is a combination of code and data. While a significant amount of time and resources have been dedicated to developing code and algorithms, it is now necessary to prioritize improving data quality and relevance to achieve desired outcomes.

By shifting the focus to data, we are basically moving from model-centric to data-centric AI.

In a data-centric approach, the role of Subject Matter Experts is significant, and AI systems are much closer to every business.

Adopting a data-centric approach in the real world or during work on a practical project is more fruitful. That means the model stays fixed and focuses on feeding the model with good or high-quality data. A decent algorithm with good data may even outperform a great algorithm with not-so-good data.

Data should not be the only priority in the development process. It is also important to consider the structure and functionality of the model and code used. The choice of model can significantly impact the accuracy, as demonstrated in the analysis of the Titanic disaster, where the prediction accuracy varied from 78% to 94%, depending on the algorithm.

Big Data vs. Good Data: What’s More Important for AI?

Good Data and Big Data are two terms often used in the field of Data Science and AI. Good data refers to data that is accurate, relevant, and well-organized. Big data, on the other hand, refers to large datasets that are too large and complex to be processed and analyzed using traditional data processing tools and techniques.

In order for AI algorithms to make accurate predictions and decisions, they need to be trained on large amounts of data. However, simply having a lot of data is not enough. The data must also be high-quality and well-organized for the AI model to learn effectively.

In many cases, companies only have access to small datasets, which can lead to poor results if the focus is solely on the model. Andrew Ng emphasizes the benefits of a data-centric approach to machine learning and suggests that there should be a shift towards this approach within the community. He uses the example of a steel defect detection problem, in which the data-centric approach improved the model’s accuracy by 16% compared to a model-centric approach [Source: neptune.ai].

Data Quality Issues

Many famous datasets have data quality issues. For instance, The COCO and ImageNet datasets are widely used in the field of computer vision for tasks such as object detection, image classification, and image segmentation. While these datasets are widely used and generally considered high quality, some issues have been discovered with them that may impact their usefulness for specific tasks. Some of the problems detected in the COCO dataset include mislabelled objects (first image) and labeling inconsistencies (second image) [Source: neuralception.com].

Ensuring Data Quality in Machine Learning Development

While the specific methods for improving data quality may vary depending on a specific problem, some general steps can be applied in many situations:

  • Verify data sources.
  • Ensure that data sources are reliable and accurate.
  • Make sure your model gets enough data to be able to generalize.
  • Validate and clean data.
  • Check labeling inconsistencies.
  • Perform validation checks and look-ups on labeled data if applicable.
  • Check labels distribution and manually check entries with low-frequency labels.
  • Try to make a tool that will give you suspicious labels so you can check them manually.
  • Standardize data to a common format or schema.
  • Involve Subject Matter Experts.
  • Document agreements after discussing inconsistencies with labelers.
  • Analyze data and perform feature engineering.
  • Train a model with labeled data and apply the model to training data. Check the diffs.
  • Perform error analysis.

The development of machine learning-based systems is a highly iterative process. Keeping track of your notes, findings, and conventions in a documentation system can help maintain data quality. By following mentioned steps and being vigilant about potential issues, you can help ensure that your data is of high quality and ready for use in your machine-learning projects.

Wrapping Up

It is important to consider the quality and quantity of data and the structure and functionality of the model and code used. By prioritizing data quality and involving subject matter experts, companies can improve the accuracy and effectiveness of their AI systems, drive growth and success, and gain a competitive advantage.

Blog by Mehmed Kadrić, Senior Data Analyst at Atlantbh

Originally published at https://www.atlantbh.com on January 10, 2023.

--

--

Atlantbh
Atlantbh Engineering

Tech-tips and lessons learned about Software Engineering