Navigating the data management landscape in the age of Gen AI

Exploring challenges and opportunities in AI-driven data enablement

Willem Koenders
ZS Associates
9 min readJan 19, 2024

--

Image created by Bing Image Creator, edited by the author, and inspired by this meme.

As organizations embark on their Gen-AI powered transformative journeys, understanding the relationship between data management and AI becomes crucial. This article delves into the challenges and opportunities presented by Gen AI, exploring how robust data management practices are not just a necessity but a catalyst for the successful deployment of AI technologies.

Key data management challenges for Gen AI

Data management plays a crucial role to enable AI. It involves the collection, storage, processing, maintenance, and democratization of data to ensure it is primed for AI applications. As we step into the era of Generative AI (“Gen AI”), this takes on heightened importance. Gen AI systems are advanced and complex, requiring large, diverse, and high-quality datasets to function optimally.

One of the foremost challenges is maintaining data quality. The old adage “garbage in, garbage out” holds true in the context of Gen AI. Just like any other AI use case or business process, the quality of the data fed into the system directly impacts the quality of the output.

Another significant challenge is managing the sheer volume of data needed, especially for those who wish to train their own Gen AI models. While off-the-shelf models may require less data, custom training demands vast amounts of data and substantial processing power. This has a direct impact on the infrastructure and energy required. For instance, generating an image can consume as much energy as fully charging a mobile phone. Some estimate that Google’s AI-focused operations can consume as much energy as the entire country of Ireland.

Privacy and security concerns are paramount as many Gen AI applications rely on sensitive data about individuals or companies. Consider the use case of personalizing communications, which cannot be effectively executed without having, indeed, personal details about the intended recipient. In Gen AI, the link between input data and outcomes is less explicit compared to other predictive models, particularly those with clearly defined dependent variables. This lack of transparency can make it challenging to understand how and why specific outputs are generated, complicating efforts to ensure privacy and security. This can also cause ethical problems when the training data contains biases.

Most Gen AI applications have a specific demand for data integration, as they tend to require synthesizing information from a variety of sources. For instance, a Gen AI system designed for market analysis might need to integrate data from social media, financial reports, news articles and consumer behavior studies. The ability to seamlessly combine these disparate data sets is crucial to understand context and produce relevant results. This integration not only demands the right technological solutions but also raises complexities around data compatibility, consistency, and processing efficiency. As such, data integration becomes a pivotal aspect of the data management process, directly impacting the functionality and effectiveness of Gen AI applications.

Let’s unpack these challenges in a bit more detail.

Data quality

Garbage in, garbage out. Image generated by ChatGPT.

Just like in any other AI, analytics, or business application, the quality of the input data determines the quality of the output. Bad data leads to unreliable results. There is no way around sitting down and thinking through and explicitly documenting the requirements that you have for the input data. This can be expressed in common, well-known dimensions such as completeness, validity, timeliness, accuracy, consistency, even for unstructured data. The key question is: what needs to be true for the data to be considered reliable, fit-for-purpose input?

Fortunately, the foundational capabilities necessary for effective data quality management in Gen AI are similar to those in other domains. It starts with setting a clear strategy and expectations, which is where policies, standards, and a data quality framework come into play. An operating model with well-defined roles and responsibilities can then be created. For instance, if Gen AI uses data from a specific source, who is responsible for ensuring the data’s quality from that source? (Hint: This should not fall on the Gen AI engineer.)

When it comes to implementing data quality controls, they should be as close to the source as possible. Controls can be integrated into the data capture processes or used to measure data at rest and in motion. This approach ensures that data meets the set expectations and provides alerts when data quality dips. I would try avoid creating large, centralized data quality teams, as they often prove to be ineffective. Instead, focus on engaging the producers of critical data and addressing data quality issues upstream, at the source.

Now, there is a distinct aspect of data quality in the context of Gen AI compared to other AI or analytics applications. In typical predictive models, such as those forecasting customer churn or mortgage defaults, it’s relatively straightforward to retrospectively assess the accuracy of predictions. However, with Gen AI, this assessment is more challenging. Gen AI models can provide highly convincing responses even when they lack a solid basis for the correct answer. This phenomenon, known as “hallucination,” occurs when the model generates plausible but incorrect or nonsensical responses. To counter this, it’s crucial to implement a process that evaluates the outputs of the Gen AI model, even if only on a sample basis. When deviations from expected good answers are observed, it’s important to investigate whether this could be due to poor or inaccurate input data. Implementing such a process requires dedication and a well-defined approach to continually ensure the integrity and quality of the data feeding into Gen AI systems.

Data acquisition and related privacy concerns

Image created by ChatGPT.

When it comes to training or operating Gen AI models, there’s often a need for personal and potentially sensitive data from individuals or companies. This data can be crucial for the AI to learn and generate accurate, relevant outputs. However, individuals and organizations might be hesitant to share their data due to privacy concerns and the fear of misuse. The reluctance is understandable, as such data can reveal a lot about a person or an organization’s private details.

To address these privacy challenges, there are at least three effective approaches: establishing proactive privacy policies and controls, relying on third-party data, and using synthetic data.

Being proactive about privacy is key. If sensitive data is needed, it’s essential to be transparent and clear about why it’s being collected and how it will benefit the data provider. A straightforward and easy-to-understand privacy policy, rather than a lengthy, legalese document, builds trust. And then you need to ensure that foundational capabilities and processes are in place to uphold these policies, of course. A single privacy incident can significantly damage a reputation that was built up over years.

In some cases, depending on the Gen AI application, using third-party data can be a viable alternative to using clients’ data. For example, a Gen AI model developed for market analysis might use publicly available consumer behavior data instead of directly gathering data from specific customers. This approach reduces the burden of convincing customers to share their data and lessens the obligation to protect it, as less of it is in your hands.

Another innovative solution is the use of synthetic data. Synthetic data is artificially generated data that mimics real data characteristics without containing any actual personal information. It can be a powerful tool, especially in scenarios where privacy concerns are paramount. For instance, in a project I was involved in, we developed a Gen AI solution to create executive summaries highlighting key insights and trends from survey data. Instead of using actual client data, which would have been risky and biased, we used Gen AI to generate thousands of realistic survey responses, complete with the kind of grammar mistakes and inconsistencies found in real responses. This synthetic data then served as the training material for our MI Gen AI application, effectively avoiding the pitfalls of using sensitive real data.

Data foundations

In the journey towards successfully deploying Gen AI, foundational data management capabilities play an enabling role. Throughout this article, we’ve touched on various aspects that all tie back to these essential capabilities. There’s a long-established practice of using data capability maturity frameworks and measurements to assess an organization’s data management strengths and identify gaps. These frameworks are a good starting point to determine the specific capabilities required to effectively activate Gen AI use cases.

While it’s possible to develop a data capability maturity framework independently, I recommend exploring what has already been established in the field. Having spent considerable time researching and building such frameworks over the past decade, I’ve discovered that there are specific, tangible elements that are almost like a checklist. These are the essentials that must be in place for any chance of successfully and sustainably activating Gen AI. Conversely, if these components are in place, success is nearly inevitable.

While I can’t disclose the complete framework here, I can share that I’ve worked on a specific framework for Gen AI, where there are two major categories of capabilities to focus on. The first category relates to true enterprise capabilities — those that can be established once and utilized across multiple Gen AI use cases. Examples include having a clear AI strategy with defined objectives and goals, establishing roles and responsibilities for Gen AI-related processes and transformation, and ensuring access to foundational models as well as basic data platform, storage, and processing capabilities.

The second set of capabilities is use case specific. It’s good to remember that not all Gen AI use cases are the same. Some may require specialized modeling expertise, varying volumes and diversity of data, or the need for annotated and historical data. The quality of the data also depends on the specific use case, with some applications not requiring any data at all, like using an off-the-shelf basic copilot application. All such use case-specific capabilities generate a second list of items to check off to ensure successful activation.

For those interested in delving deeper into the specifics of this framework, stay tuned for my future writings on the topic, or feel free to reach out to me directly.

How Gen AI can help data management

The focus of this POV so far has been on the role of data management in enabling Gen AI, but let’s take a look at the reverse: how Gen AI (and large language models more generally) could possibly enhance data management.

AI-powered integration tools can streamline the processing and analysis of data from various sources. Tools like Informatica’s Intelligent Data Management Platform, IBM Watson Knowledge Catalog, and WCKD RZR’s DataNow have capabilities to scan and discover metadata, and interpret and understand the data. This aids to create a common understanding of data in a semi-automated way, significantly reducing the time and effort required to manage data effectively. Additionally, machine learning algorithms can automate data cleaning and preparation, though these have been in use for some time.

A newer development is the application of Gen AI itself in data management. For instance, Gen AI can interpret unstructured information such as meeting recordings (or really, the corresponding transcripts) and historic emails. From these, it can infer which systems are part of the IT landscape and identify corresponding issues. In a more common use case, Gen AI is used to generate consistent, business-friendly definitions for critical data attributes.

While these advancements are very real and exciting, I advise maintaining a healthy level of skepticism. It’s relatively easy to create a demo environment where these tools perform exceptionally well. In such controlled settings, data is tailored to showcase the strengths of the tools. But, the real challenge emerges in real-life organizational environments. Here, data is spread across different systems, regions and access protocols, often lacking a common set of interoperability standards. This complexity is what makes the work arduous and explains why there are dozens (if not hundreds) of companies all proclaiming to have the solution to integrate data from various sources into platforms or virtualized views for subsequent consumption.

Closure

Navigating the data management landscape in the age of Gen AI presents both challenges and opportunities. The future of (Gen) AI is intrinsically linked to how well we manage and utilize data, making it imperative to adopt a strategic, thoughtful approach to data management that prioritizes privacy, efficiency and innovation.

Any thoughts? Feel free to drop them in the comments!

Read more insights from ZS.

--

--

Willem Koenders
ZS Associates

Global leader in data strategy with ~12 years of experience advising leading organizations on how to leverage data to build and sustain a competitive advantage