Becoming an AI Factory — Part 2: Data

İhsancan Özpoyraz
KoçDigital
Published in
6 min readDec 29, 2022
Photo of Industrial Pipes | Victor — Unsplash

Every factory starts production with the supply of raw materials. The raw material for artificial intelligence (AI) factories is data — the building blocks of intelligent systems. A biscuits factory supplies sugar, flour, and eggs to make cookies; an automotive factory uses steel, rubber, and other materials to build cars; and an AI factory sources data such as text, images, audio, and video to create intelligent systems and applications. The more data that is available to an AI factory, the better it will be able to build the applications that consumers need and want. For this reason, it is important for AI factories to have access to a wide range of data sources to ensure they have sufficient quantities of high-quality data for all of their projects. Unfortunately, there are a number of obstacles to achieving this goal.

Many organizations struggle to collect the data that they need to power their AI applications. They also lack the expertise required to clean and process the data effectively. As a result, many companies waste valuable time and resources trying to source and process the data they need. These issues create significant challenges for organizations that want to build successful AI products. However, they can be overcome with the right approach and a bit of creative thinking. Becoming an AI factory requires access to the right data to drive innovation and creativity across all parts of the organization. In this blog, we’ll explore tips on how companies can overcome the challenge of sourcing and processing high-quality data to power their AI products and solutions.

Raw-material volatility and quality issues are major challenges for all industrial companies. Best-in-class manufacturers employ a range of measures to mitigate quality failures across the supply chain. BCG compiled best practices from a variety of industries to identify the most important factors for driving reliable material supply. Supply-side actions ensuring a continuous supply such as supporting raw-material suppliers and end-to-end material visibility are as strategic as choosing the right supplier in the first place. Digitalization is another facilitator for integrated processes that can help manage uncertainties early on in the material supply lifecycle. AI factories should follow similar strategies to ensure the continuous supply of high-quality data to fuel their AI applications. Robust data-sourcing practices will help ensure that companies have access to the necessary quantity of high-quality data to power their AI applications.

Fortunately, the emerging data and AI ecosystem provide a wealth of resources designed to help companies establish effective data-sourcing programs. There are countless tools and services available in the market today that can accelerate the process of establishing data-sourcing capabilities in your organization. We can categorize these tools and services into five main categories:

  • Data Observability and Quality
  • Data Catalog and Discovery
  • Data Marketplaces
  • Data Generation
  • Data Governance

Each category represents a different approach to building robust data-sourcing capabilities in your AI factory. The rest of this blog explores each of these categories to help you navigate this complex landscape and discover the right tools and services to fit your needs. These tools and services can be considered the machinery of your AI factory; each one plays a critical role in ensuring the long-term success of your AI initiatives.

Data Observability and Quality:

When it comes to managing the quality of the data flowing into your AI factory, the first place to start is with data observability. By observing and measuring the quality and attributes of the data flowing through your factory, you can identify potential issues early and eliminate them before they cause problems downstream. A few tools can help you achieve this goal: Datakin, Monte Carlo, Manta, Collibra, Datafold, Databand, Acceldata, Metaplane, Talend, Soda, Bigeye, Great Expectations (Superconductive), Anomalo, Precisely, Lightup, and Validio.

Each of these tools uses a different approach to help monitor and measure the quality of the data flowing through your AI factory. A few outstanding examples to consider are Monte Carlo, which “uses machine learning to infer and learn what your data looks like, proactively identify data downtime, assess its impact, and notify those who need to know,” Databand (an IBM company) which “pinpoints unknown data incidents and reduce mean time to detection (MTTD) from days to minutes,” Datafold which “turns SQL queries into smart alerts,” Soda which allows “checking data as code and use a common language to check and manage data quality across all data sources, from ingestion to consumption,” and Acceldata which provides “real-time insights for data engineers, data scientists, data administrators, platform engineers, data officers, and platform leads.”

Data Catalog and Discovery:

Tools in this category help you catalogue and manage the data flowing through your AI factory and identify the best sources from which to acquire data for your models. Popular tools for this task include Metaphor, Atlan, data.world, Steamma, Select Star, Secoda, Castor, magda, Amundsen, and CKAN.

Atlan position itself as the Google for your data and “allows you to search across your data universe using natural language, business context or using SQL syntax,” data.world’s “cloud-native SaaS platform leverages the power of the knowledge graph to make data discovery easy,” and Select Start’s automated data discovery platform enables you to “easily find, tag, and add documentation to your data so everyone can find the right dataset for their use case.”

Data Marketplaces:

Data marketplaces allow you to reach external sources of data to augment the data in your systems. These tools bring multiple sources of data into one place so that you can purchase different types of data from different sources. Some popular examples include AWS Data Exchange, Snowflake Marketplace, Narrative, Dawex, Explorium, and KoçDigital’s DaaS.

The AWS Data Exchange allows customers to “easily find, subscribe to, and use third-party data in the cloud (+3500 third-party data sets available)”, Snowflake Marketplace “is home to a variety of data, data services, and applications (360 providers, offering more than 1,700 live, ready-to-query data) and provides users to avoid the risk and hassle of copying and moving stale data; instead, you can securely access live, governed, and shared data and receive near real-time automatic updates,” and KoçDigital’s DaaS (Data-as-a-Service) allows customers to consume external data easily through robust APIs.

Data Generation:

Data generation tools provide synthetically generated data for test and training purposes. These can be used to augment datasets in existing systems and ensure that the models you train are capable of handling real-world scenarios. Popular tools for this task include Scale Synthetic, Unity, and AI.Reverie (acquired by Meta). Scale Synthetic “helps teams overcome the inability to collect enough edge cases, data collection bias, and data privacy issues with augmented or fully synthetic data.”

Data Governance:

Data governance is the practice of controlling access to and use of an organization’s data to meet specific organizational goals. This can be done through policies governing what data is available, how it is collected and used, and how privacy is handled. Many organizations use data governance to ensure data consistency and integrity throughout their systems while ensuring compliance with regulatory mandates and other legal requirements. There are a number of tools available to help you manage your data governance and ensure your data is secure and protected. These include Informatica, Ataccama, IBM Cloud Pak for Data, Alation, Immuta, Atlan, Collibra, and ALTR.

Informatica provides “access to trusted insights with integrated governance of data and viewing your data holistically by linking technical metadata with business context,” and Alation’s “active data governance puts people first, so folks have access to the data they need, with guidance in-workflow on how to use it.”

This is the second part of a blog series. The link for the previous part is below:

Becoming an AI Factory — Part 1: Why?

--

--