QuantumBlack, AI by McKinsey

We are the Artificial Intelligence (AI) arm of McKinsey & Company. We are a global community of technical and business experts, and we thrive on using AI to tackle the most complex problems.

Featured

Solving data quality for gen AI applications

6 min readJan 30, 2025

--

The popularity of generative AI (gen AI) has created an imperative for organizations to take control of their unstructured data. This article examines how QuantumBlack Labs has used its award-winning product, AI4DQ (“AI for Data Quality”) to assess and solve data quality issues to ensure successful gen AI use cases. The product includes a toolkit to help businesses gain deep insights and a holistic understanding of their unstructured data. AI4DQ Unstructured offers correction strategies to improve unstructured data quality that directly translate into gen AI results that are more accurate and consistent.

AI4DQ Unstructured in action within an app built with Vizro.

QuantumBlack Labs is the R&D and software development hub within QuantumBlack, AI by McKinsey. QuantumBlack Labs has more than 250 technologists dedicated to driving AI innovation and supporting and accelerating the work of its more than 1400 data scientists across over 100 locations. We use our colleagues’ collective experience to develop suites of tools and assets that ensure AI/ML models reach production and achieve sustained impact.

The importance of unstructured data for gen AI

As we described in our blog post about AI-assisted data remediation, data quality is a common problem in many organizations, particularly as they scale. Gen AI applications are often underpinned by large volumes of unstructured data from various formats and sources, such as PDF, images, and videos. This type of data is often prone to data quality issues that can be difficult to identify. These issues can cause significant performance problems at scale if they’re not resolved before training the gen AI application.

Defining and assessing unstructured data quality

While data quality issues in structured data can be clearly defined, and therefore resolved, there is less of an established framework for unstructured data. The types of data quality issues we see most frequently in unstructured data include:

  • Diverse document formats, like .pdf, .ppt, .xls, which contain elements that are hard to parse, such as images or complex tables.
  • A lack of metadata tags across a diverse corpus of documents, making search and retrieval difficult.
  • Siloed data storage across multiple business units and departments.
  • Conflicting information in multiple versions or in outdated documents.
  • Irrelevant documents or documents containing boilerplate or repetitive content.
  • Multiple languages in the document corpus that aren’t suitable for the LLM.
  • Sensitive information, such as names and addresses, ingested without proper filtering or access control.

If practitioners fail to adequately address data quality issues upstream, there can be significant downstream challenges which cannot easily be resolved after the fact, such as:

  • Gen AI hallucination, creating inaccurate and inconsistent results.
  • Information loss from raw input data.
  • Wasted memory and compute due to irrelevant outputs.
  • Information leakage, such as PII, which creates compliance risk.

AI4DQ Unstructured, by QuantumBlack Labs

AI4DQ Unstructured generates actionable insights to quantify and improve data quality readiness for gen AI applications such as RAG pipelines. The toolkit offers a collection of detection and correction strategies for unstructured data quality issues, such as complex tables and images, use of foreign languages that require translation, outdated and irrelevant content, and duplicate documents.

A scoring mechanism presents individual data quality scores to give a view of the overall unstructured data quality of the corpus.

AI4DQ Unstructured data quality scores for human in-the-loop review within an app built with Vizro.

AI4DQ Unstructured also offers E2E workflows, using three dimensions to solve the complex and unique challenges associated with unstructured data quality. Taking a human-in-the-loop approach, AI4DQ Unstructured scans through the input corpus across the three dimensions and flags the documents that need attention.

The three dimensions used for data assessment

Document clustering and labeling workflow

This workflow helps businesses understand the main types and themes of their documents. It also helps them decide if there are sufficient documents to build a gen AI workflow.

AI4DQ Unstructured combines NLP techniques with the power of gen AI to:

  • Train custom embeddings on top of the existing corpus.
  • Cluster the documents using the embeddings to classify them based on semantic meaning.
  • Label each document cluster with a “document type” as metadata to inform the main categories for the input documents.
  • Develop fine-grained and customized tags on each document, by cluster, to enrich the document metadata, which can feed into search and retrieval.
Document clustering and labeling workflow

Clustering and labeling can also be performed at a “chunk” level (i.e., whole documents divided into more granular sections) to reveal more detail. Chunk-level metadata can be cached alongside each document for more accurate and targeted search and retrieval.

Document de-duplication workflow

Document duplication is a common problem where there are multiple versions of the same document in the input corpus with minor differences. This workflow finds the latest version of a document and removes duplicate documents, avoiding downstream consumption of conflicting or outdated information.

AI4DQ Unstructured automatically identifies and presents duplicated/versioned document for human review by:

  • Creating and extracting metadata to describe each document.
  • Comparing documents against to establish pair-wise duplicates.
  • Resolving document “entities” using the pair-wise duplicates as edges to generate duplicated document sets.
  • Offering the human in-the-loop a view of potential duplicated/versioned documents and recommending appropriate correction strategies.
Document de-duplication workflow

The impact of AI4DQ Unstructured

The impact of solving data quality problems in an unstructured corpus can be significant.

  • Enhanced document retrieval and search efficiency, which improves the gen AI application’s results. One project using AI4DQ saw a 20 percent increase in RAG pipeline accuracy thanks to the addition of metadata tags on document themes.
  • Cost savings by reducing the amount of time spent analyzing irrelevant or outdated information as well as savings from reducing overheads.
  • Risk reduction by avoiding unnecessary compliance risks from information leakage or inappropriate data access.
The impact of AI4DQ Unstructured

A public health deep dive

Working with an international health organization, AI4DQ was deployed to conduct a rapid data quality assessment to accelerate research and report writing. The organization wanted a gen AI application to reduce the time overhead of creating these reports. We ingested and processed 2.5 GB of data and identified more than ten high-priority data quality issues that were blocking the effectiveness of this gen AI use case.

Using AI4DQ Unstructured to prioritize the most relevant solutions, we identified issues across the 1,500+ files. We suggested remediation strategies to ensure that only high-quality documents were used to train the LLM. Improvements include:

  • Identification and removal of 100+ irrelevant/duplicated documents, saving 10–15 percent in data storage cost
  • Preservation of information that would otherwise be permanently lost for 5 percent of critical policy documents.

When does an organization need AI4DQ Unstructured?

AI4DQ Unstructured is useful for organizations embarking on implementing gen AI solutions, particularly those that:

  • Have a large collection of documents of various formats stored in silos across the organization.
  • Lack a comprehensive understanding of the types of documents available and their contents.
  • Lack a robust strategy for the management of this unstructured data.
  • Are unsure how to use the unstructured data.

Accurate data management is crucial for successful scaling. Data quality is a key blocker for organizations implementing gen AI applications. The unstructured nature of the data means that data quality is often overlooked and can be difficult to define and resolve. AI4DQ Unstructured sets out a unique framework to ensure that input data is recent, relevant, fit-for-purpose, and ingested in an easy-to-understand format for gen AI.

QuantumBlack Horizon is a family of enterprise AI products, including Kedro, Brix, and Alloy, that provides the foundations for organization-level AI adoption by addressing pain points like scaling. It’s a first-of-its-kind product suite that helps McKinsey clients discover, assemble, tailor, and orchestrate AI projects.

To learn more about what QuantumBlack Horizon and AI4DQ can do for you, please email alan_conroy@mckinsey.com.

Thanks to all who contributed to this article: Alan Conroy, Nishant Kumar, Zihao Xu, Paul Southall, James Mulligan, Jo Stichbury, Joanna Sych, Sarah Mulligan & Matt Fitzpatrick.

--

--

QuantumBlack, AI by McKinsey
QuantumBlack, AI by McKinsey

Published in QuantumBlack, AI by McKinsey

We are the Artificial Intelligence (AI) arm of McKinsey & Company. We are a global community of technical and business experts, and we thrive on using AI to tackle the most complex problems.

QuantumBlack, AI by McKinsey
QuantumBlack, AI by McKinsey

Written by QuantumBlack, AI by McKinsey

We are the AI arm of McKinsey & Company. We are a global community of technical & business experts, and we thrive on using AI to tackle complex problems.

Responses (1)