Featured
Unlocking consumable data for generative AI
AI4DX by QuantumBlack, AI by McKinsey
Enterprises today are racing to adopt generative AI (gen AI) and the intelligent agents that promise transformative efficiency and insight. However, there is a critical dependency on access to trusted, usable data that is structured, labeled, and organized in a way that is easily consumable by gen AI.
The reality is that ~80–90% of enterprise data is trapped in unstructured formats: contracts, invoices, reports, clinical notes, and even handwritten scans. The result? Slow decision-making, potential compliance risks, and limited potential for gen AI. Historically, there has been no efficient method for extracting consumable data with speed and accuracy.
This article explains how AI4DX (AI for data extraction, by QuantumBlack, AI by McKinsey) can unlock the hidden value in unstructured data to enable faster insights, stronger compliance, and AI systems that can truly deliver.
Introducing AI4DX
The traditional tools used to unlock value from enterprise data have so far either failed to produce accurate results, or have needed a resourcing that could be engaged elsewhere. A tool for AI-driven data extraction can deliver faster and more efficient processes that are accurate, transparent, and scalable, ready to fuel downstream analytics, compliance processes, and the next generation of gen AI and autonomous agents.
AI4DX (AI for data extraction) solves exactly this challenge. It is part of McKinsey’s broader AI4Data ecosystem, but its focus is unique: turning unstructured documents into structured, audit-ready data. It lives in the same suite of offerings as our AI4DQ solution, which can detect and correct data quality issues in unstructured data assets.
By combining gen AI-enabled tooling, vision models, and schema-based workflows, AI4DX delivers both speed and accuracy. Moreover, AI4DX ensures transparency through citations, confidence scores, and human-in-the-loop reviews.
Key capabilities of AI4DX
AI4DX addresses the most common pain points in data extraction through a collection of productivity tools that accelerate both engineering and business-facing workflows.
These tools combine large language models (LLMs), vision AI, and structured schema validation to transform raw, messy documents into reliable, structured data:
- Document ingestion: AI4DX can parse and interpret diverse document types that include PDFs, Excel, PowerPoint, Word, and even handwritten scans using advanced Vision LLM technology. Unlike traditional parsers, AI4DX can handle complex layouts (e.g., multi-block Excel sheets, tables embedded in presentations) with high precision. Its Excel block detection system combines LLM-based and heuristic-based methods to parse and extract data. This flexibility ensures that organizations can onboard data from heterogeneous sources without extensive pre-processing
- Data schema: At the heart of AI4DX lies schema-driven extraction. Users define schemas via Pydantic models or YAML to enable robust type-checking, validation, and support for nested or domain-specific structures. A gen AI-powered auto-suggestion engine generates schemas directly from screenshots or data dictionaries, accelerating the design of data models. Out-of-the-box validation ensures extracted data conforms to expectations before entering downstream systems
- Automated prompt engineering: One of the biggest hurdles in LLM-based workflows is prompt engineering. AI4DX addresses this with an automated rule optimizer that tunes system prompts to maximize accuracy and generalizability.
- UI for human review: AI4DX includes a purpose-built interface for efficient human-in-the-loop workflows. Reviewers can navigate directly to source citations with one click, receive flags for low-confidence extractions, and make inline edits or approvals. Visual bounding boxes and explanations make it intuitive to validate results, and the workflow ensures that corrections feed back into the optimization cycle.
- Extraction results explanation: With AI4DX, every extracted value is accompanied by confidence scores, page-level citations, detailed reasoning, and even alternative value suggestions for ambiguities. The module also offers a feature for visual grounding, with bounding boxes pinpointing the exact region on a page where each piece of data was found, boosting transparency and supporting human audits.
Together, these features transform what was once a manual, error-prone process into a scalable, transparent, and reliable pipeline for structured data extraction.
Tangible impact of AI4DX across industries
The value of AI4DX is already being demonstrated in production environments across various McKinsey clients:
- In a top legal services firm, we implemented entity extraction across diverse and complex legal documents. Our solution ingested ~400 documents across diverse types (e.g., legal contracts, hand-written notes) for enhanced parsing quality and accurate extraction of ~5K entities, reducing manual document review time by ~95 percent across ~400 diverse documents, achieving ~80–95 percent extraction accuracy and saving four weeks of engineering time.
- For a global energy provider, we developed an E2E data ingestion platform for 80+ types of scanned invoices and deal documents. AI4DX centralized thousands of scanned documents by pairing vision-based PDF parsing with 150+ automated extraction pipelines that feed the enterprise data warehouse. The solution achieved ~100 percent extraction accuracy on 80 percent of documents, and ~90 percent faster pipeline setup, which gave the team the capacity to work on problem-solving in other areas.
- For a hardware manufacturer, we accelerated sales quote preparation using AI4DX, and reduced model development time by 50 percent. The extraction accuracy also increased from 75 percent to 90 percent, and 10 users were onboarded in under two weeks.
These examples underscore how AI4DX is not only improving accuracy and efficiency for our clients, but is also creating the data foundation required for gen AI adoption.
When does an organization need AI4DX?
AI4DX is useful for organizations developing gen AI solutions that need to:
- Develop faster, more accurate data extraction, where AI4DX can cut cycle times from weeks to hours.
- Free up teams’ time for higher-value work, with minimal human input required to extract unstructured data.
- Comply with regulatory requirements, since AI4DX can create audit-ready outputs with complete traceability.
AI4DX within the AI4Data ecosystem
AI4DX is one pillar of McKinsey’s AI4Data ecosystem, which provides end-to-end value creation across data through the following capabilities:
- AI for Data Quality (AI4DQ): Covers both structured (AI4DQ) and unstructured (AI4DQ Unstructured) data. Enables detection, remediation (deduplication, formatting, missing value imputation), validation, and continuous monitoring of data quality.
- AI for Data Products (AI4DP): Accelerates the creation of data products by auto-generating schemas and ETL/ELT scripts. This shortens the time required to develop AI-ready datasets (e.g., structured schema, feature tables, or vector embeddings), driving a faster path to insight.
- AI for Data Discovery (AI4DD): Provides ontology mapping, profiling, and data quality assessment to help enterprises discover, organize, and contextualize their data assets.
- AI for Data Lineage (AI4DL): Performs smart mapping and tracing of data across systems and platforms. Lineage is critical for transparency, for audits and compliance with regulatory requirements such as BCBS 239 for the banking sector, which mandates banks to demonstrate traceability and accuracy of risk data aggregation and reporting.
- AI for Data Extraction (AI4DX): Extracts and structures key fields from unstructured documents into structured formats, with the ability to add labels at the point of extraction for downstream use cases, as described in this article.
- AI for Data Intelligence (AI4DI): Provides agentic orchestration of data workflows through a no-code AI-driven interface, to enable seamless coordination across modules and automation of data processes.
Together, the modules ensure enterprises can build AI-ready data pipelines that maximize the impact of generative AI and intelligent agents. For example, AI4DQ ensures the integrity of inputs at source; AI4DX extracts and structures unstructured data and applies labels; AI4DL traces lineage to ensure compliance and auditability; and AI4DP generates the appropriate data product, whether a schema for analytics or a vector dataset for AI applications.
Building the foundation for gen AI success
As enterprises deploy gen AI, one truth is clear: data quality and structure determine impact. Without accurate, validated, and audit-ready data, the promise of generative AI and agents is left unfulfilled.
AI4DX can create pipelines for structured data to address current data quality challenges and lay the foundation for future advancements in AI. Its robust features, scalability, and adaptability make it an indispensable tool for businesses aiming to harness the full potential of their data.
Over the years, QuantumBlack, AI by McKinsey has helped organizations reinvent themselves to achieve accelerated, sustainable, and inclusive growth with AI. In QuantumBlack Labs, the R&D innovation hub within QuantumBlack, we use our colleagues’ collective experience to develop suites of tools and assets to facilitate their engagements.
To learn more about what AI4DX can do for you, please email ai4data_requests@mckinsey.com.
Thanks to all who contributed to this article: Alan Conroy, Nishant Kumar, Zihao Xu, Andreea Bajenaru, Lukas Olson, Jo Stichbury, Joanna Sych.
