From data to decisions: Leveraging Generative AI and data products

Willem Koenders
ZS Associates
Published in
16 min readMay 15, 2024

This year, my colleague and fellow data-nerd Abhinav Batra (LinkedIn, Medium) and I were invited to speak at the Pharma SOS Conference in New Orleans about not one but two of our favorite topics: data products and Generative AI. We enjoyed the presentation and following discussion to such an extent that we decided to convert them to a written point of view.

Abhinav (left) and me at the Pharma SOS Conference in New Orleans in February earlier this year.

In the remainder of this article, we will:

  • Define what a data product is
  • Propose a set of data product design principles
  • Present an overview of data product types
  • Provide examples of common data products in the pharmaceutical industry
  • Explain how data products can help activate Gen AI use cases
  • Present a modern, domain-driven data lake reference architecture
  • Outline how the data value chain may be revolutionized through Gen AI

Data products

Let’s start with defining what a data product is. It is a curated collection of data components that is organized and presented in a way that is easy to understand and use, building a better experience and enhancing trust for data consumers. It offers superior, consistent, and reliable data understanding and access, which allows consumers to get answers to their questions (or a chain of questions) to support business decisions and outcomes.

Figure 1 — Key characteristics of Data Products.

A data product can also be described using a set of key characteristics. These are presented in Figure 1. During the conference, data products were mentioned in several presentations and panel discussions. The analogy where data products are compared to dishes was especially popular, so we’ll keep using that here as well to illustrate the key characteristics — data products are the carrots and tomatoes that are used by a selection of chefs:

  • Inherent value. A data product in and of itself is valuable. If you have high-quality carrots and tomatoes, these have value of themselves, even if we don’t know exactly what we’d do with them. Stick them in front of a chef, and ideas for a dish will emerge instantaneously.
  • Business impact. We must have some idea about how the carrots and tomatoes are going to be used. Perhaps to garnish a broader dish, to be served as raw vegetable snacks, or to be added to a soup. We might not know the exact dishes, but we do have a reasonable idea about their most common uses and can estimate their impact through these applications.
  • Discoverable. They are easy to find and accessible for the intended users. For chefs experimenting with dishes, there is a register that shows what foodstuffs are available and where to find them, including carrots and tomatoes. You don’t want to have to drive an hour to get some — they should be located reasonably close to where they may be needed.
  • Understandable. They are clear, well-labeled, and unambiguous. The chef does not have to wonder what kind of carrots they are or where the tomatoes came from. If needed, one can take a look at the packaging to discover where they were grown and what the nutritional value is.
  • Addressable. If you are a chef running a professional kitchen, you want to know in what fridge you can find the carrots and tomatoes. This should not change overnight. A kitchen performing at a high pace needs reliable inputs — those carrots and tomatoes should be in the same fridge, every day, where they are expected to be.
  • Trusted and curated. Chefs lack the time to sort out imperfect carrots and tomatoes, such as those that are under- or oversized, or have mold, bugs, or discolorations. They expect that the rotten parts have been removed and that they can trust the quality of the ingredients given to them, so that they can focus on making the best possible dish.
  • Secure. Not just everyone should have access to the fridge. If that were the case, there’d be a chance that the food could be used up or tampered with. At the same time, access should be provided to those who should have access — a fridge without a door is of no use.
  • Product orientation. The carrots and tomatoes are managed as a product with customers and a lifecycle. Some chefs might develop a liking for bigger carrots or tomatoes with a particular texture. They might need more or less of them. Whatever the demand, it is important that the supply and preparation takes into account the estimated and desired use.

Design principles

Figure 2 — Design principles behind the creation and maintenance of Data Products.

Having established what data products are and what they are supposed to be able to enable, a set of design principles emerges behind successful implementations. They are illustrated in Figure 2 and further explained below:

  1. Autonomy and cohesion: Each data product functions as an autonomous, atomic unit that includes all necessary components such as code for data ingestion, transformation, sample data, unit tests, data quality tests, and infrastructure-as-code for provisioning. It also enforces access policies, ensuring that it remains a self-contained entity outputting a single denormalized dataset.
  2. Common development framework: The central IT department supports domain teams by developing a specification language based on the Open Application Model (OAM) for declarative data product definitions. This allows teams to autonomously create and manage their data products using a shared platform that handles the CI/CD pipeline and capability registry.
  3. Consistent metadata management: To enhance the searchability and interoperability of data products, a uniform cataloging process is established across domains. This includes standard metadata like unique names, descriptions, ownership, data sharing agreements, data classifications, and distribution rights.
  4. Automated governance and access control: Data product teams can specify access policies programmatically using role-based or attribute-based control methods. The platform integrates corporate identity management with data storage solutions, automating the execution of access controls and ensuring secure data distribution.
  5. Data sharing protocols: Data products support various sharing methods, prioritizing native mechanisms of the storage platform for similar producer-consumer environments (e.g., Redshift, Snowflake). When different storage platforms are used, data copying is considered a last resort, with strict adherence to governance and access controls to maintain security.

Types, levels and examples of data products

Having explored key characteristics of and recommended design principles behind data products, let’s consider that not all are created equal. Data products exist in different shapes and types, which is sometimes what complicates the concept in people’s eyes as to what they are and what they are not.

Data products may live in different “stages.” Sometimes, these are referred to as a medallion architecture, where a data product can be promoted from bronze, to silver, to gold.

Such classifications are perfectly in line with the one we’ll maintain for the purposes of this point of view, as presented in Figure 3. We’ve defined 4 subsequent levels:

  • Level 1 — Raw/Staged data: This initial level involves raw data from various sources, which is standardized and subjected to basic quality controls such as format standardization and null checks. It also includes the addition of audit columns like load ID and date, maintaining a comprehensive history by each load date to track data lineage.
  • Level 2 — Conformed data: At this level, raw data is processed and transformed into a normalized dimensional data model. This stage consolidates historical data and ensures data integrity and consistency through rigorous normalization and confirmation processes, facilitating easier access and analysis.
  • Level 3 — Analytics-ready data: Data at this stage is cross-functional, integrated with master identifiers, and organized into denormalized, flat datasets. This level focuses on ensuring data consistency across different subject areas, integrating common business rules, and precalculating key performance indicators (KPIs) to support analytics.
  • Level 4 — Fit for purpose data: The most refined level, designed to meet specific needs of consuming applications, often customized for particular business functions like marketing analytics, patient analytics, and return on investment (ROI) calculations in industries such as pharmaceuticals. This data is tailored to drive specific business actions and decisions.
Figure 3 — Subsequent levels of data products, increasingly tailored to meet specific business needs.

The first two levels are classified as source-oriented data products as the data continues to be structured mostly in line with how it was sourced. The last two levels are consumer-oriented as, indeed, the data products have been more substantially transformed for specific uses of the data. Let’s take a look at the pharmaceutical industry to investigate what possible source- and consumer-oriented data products might be.

Source-oriented data products

Source-oriented data products are pivotal for gathering and managing diverse sets of data relevant to business operations and patient care. For instance, master data products are crucial, encompassing databases like customer masters which detail information on healthcare professionals (HCPs), patients, consumers, and their affiliations. Master data may also include product masters that catalog all pertinent details about pharmaceutical products being developed or sold, and employee masters that maintain records on employees, their training, performance evaluations, and customer relationships.

Another example of a group of source-oriented data products comprises sales data. These compile sales figures across various frequencies, business lines, and regions, enhancing the understanding of market reach and performance. They may also track personal activity metrics such as the number of calls made, samples distributed, and involvement in speaker programs, which are essential for assessing the effectiveness of sales strategies.

Data products focused on claims and electronic medical records (EMR) are essential for a comprehensive view of healthcare interactions. These include data products for hospital claims, pharmacy claims, and payer claims from sources like Optum and Truven. Each dataset offers insights into billing and reimbursement patterns that are critical for financial planning and compliance. Specifically, EMR data products, such as those from Flatiron or Humedica, integrate clinical data like prescriptions (Rx) and diagnoses (Dx) from various healthcare providers, offering a rich source of real-world evidence that can support clinical studies and patient care strategies.

Consumer-oriented data products

Consumer-oriented data products are designed to support specific business functions and decision-making processes that directly interact with and influence customer relations and market strategies. For example, the HCP 360 data product provides a comprehensive view of healthcare professionals (HCPs), integrating data across multiple touchpoints to support use cases like field reporting, account profiling, segmentation, and omnichannel orchestration. This product helps pharmaceutical companies tailor their engagement strategies, optimize promotional responses, and enhance overall HCP relationship management.

Another essential data product may be Value Access & Pricing, which offers insights into the complex dynamics of drug pricing and market access. This product supports a range of analytical applications including contract analytics, copay analytics, and distribution channel analysis. It also aids in more strategic areas such as government affairs, health economics, outcomes research, and access strategy formulation. The data helps companies navigate the regulatory and competitive landscape, predict healthcare pathways, and develop protocols and policies that optimize product pricing and access.

Field Performance is a data product geared towards optimizing sales force activities and effectiveness. It provides metrics and analytics necessary for managing incentive compensation, setting sales goals, crediting sales activities, and reporting field performance. It supports the optimization of sample distribution, enhancing the effectiveness of the sales force. This data product is crucial for pharmaceutical companies looking to maximize the efficiency and impact of their sales teams, ensuring that resources are aligned with market opportunities and company objectives.

These are just examples — for a full list and more details, also for other sectors besides pharma, reach out to either of us.

The linkage to Generative AI

One of the driving forces behind the growing interest in data products is the emergence of generative AI, a type of artificial intelligence that learns from vast amounts of data to create content or generate new data that resembles the original input. This technology can produce anything from text and images to code and music, simulating human-like creativity.

However, the successful deployment of generative AI hinges critically on solid data foundations. Without access to high-quality data from reliable sources, these AI models can become inefficient and potentially biased, leading to outcomes that create harm rather than value. Ensuring the integrity and quality of the data is paramount; without it, you cannot effectively activate the intended use case. Moreover, the deployment of generative AI requires strict ethical and regulatory vigilance and strategic expertise to ensure accuracy, legal compliance, and alignment with business objectives. This is needed to mitigate risks of bias and operational errors, reinforcing the importance of quality data and thoughtful oversight in generative AI projects.

Figure 4 — Generative AI use cases require a minimum maturity across a select set of data management capabilities. Source: https://medium.com/zs-associates/navigating-the-data-management-landscape-in-the-age-of-gen-ai-82a5337a8c00.

We can break this out in a few dimensions. In order to train and deploy models, the Gen AI applications need to have access to enough data that is sufficiently diverse. They may require vast volumes of data if the expected output is complex and instable. The stability here refers to the fact that identical models may produce different results when given an identical prompt, simply given the nature of how Gen AI models work. In some cases, that variability is fatal, in which case sufficient data is required to train the models. The data also needs to be sufficiently diverse. Gen AI, even more so than most other modelling techniques, refers the diversity of the data it has been and is given. If your model is trained on social interactions where 95% were with people from 25 years or younger, it might not perform as effectively later on when exposed to people that are older than 80.

For a similar reason, data quality is massively important. This is the biggest problem also because garbage in remains garbage out. With Gen AI, even when given bad data, the responses tend to still be elegant and complete-sounding. In some cases, they turn out to be made up. The quality of the answers will reflect the quality of the data it has been given. This also holds when the data it is fed is unstructured. In that case, it’s not that easy to implement data quality checks as you could on structured data, but nonetheless it is critical to verify that the right unstructured is provided.

Beyond these more general foundations around data, depending on the use case in question, there may be more specific requirements. The model might require annotated data, for example for training purposes. It might need a sufficient amount of historical data, or separate data that can be used for validation and testing. And if your use cases require real-time data, for example in many of the use cases with live call center agents, then data needs to be made available real-time, quickly and reliably. This is not just about integrating data sources, but also about making sure that only the right data is shared, and only with people or applications that should have access to it.

A lot of the things mentioned in previous paragraphs are age-old data governance challenges, and they have not gone away. They are very well understood problems with data about how it can be managed appropriately and brought to the right, appropriate use cases. Here we come back to data assets and data products, as this is a concept that is gaining ever more traction and many companies have been able to activate various use cases based on a selected set of data products. The key thing to understand is that it’s about not governing all data everywhere up to the same standards, but instead focusing on that particular data that is most strategic, most important. Once you know what data is the most critical, you can prioritize managing that exact data as an asset or product. This will drive a maximized ROI on your investments in foundational data capabilities.

How to quickly measure your organization’s data readiness for Gen AI

Our research has revealed distinct patterns and best practices among companies that have successfully built foundational maturity and achieved initial business impacts with their use of generative AI, compared to those that continue to struggle and lag behind. The following 13 capability areas were identified as critical for business success:

  1. Strategy and vision: Establishes the foundational framework for Generative AI initiatives, including the creation of a strategic plan, setting AI goals, and allocating investments and budgets.
  2. Organizational structure and operating model: Defines roles, responsibilities, and the centralization of operations. This includes setting up decision-making frameworks, implementing change management programs, and managing stakeholders.
  3. Center of excellence (CoE): Focuses on building a specialized team to lead and support Generative AI efforts, including training on best practices, and deploying tools and accelerators to streamline processes.
  4. Use cases and applications: Identifies potential Generative AI applications, links them with necessary data sources, assesses feasibility, and establishes business ownership for each use case.
  5. Data: Ensures the availability of diverse and high-quality data, maintaining historical and annotated datasets, and providing real-time data access for ongoing validation and testing.
  6. ROI and value generation: Develops methods to measure the benefits of Generative AI projects, defines relevant KPIs and metrics, and crafts detailed business cases to underscore the value.
  7. Model building and training: Involves selecting appropriate foundational models, training these models with robust datasets, and continuously evaluating and monitoring their performance.
  8. Deployment and operation: Reengineers processes to integrate Generative AI solutions, monitors performance and utilization, and automates workflows to enhance operational efficiency.
  9. Talent and skills: Focuses on attracting and retaining skilled professionals, providing training and opportunities for reskilling or upskilling, and fostering interdisciplinary teams.
  10. Governance, ethics and compliance: Addresses ethical considerations, ensures AI transparency, complies with regulatory standards, and sets policies for responsible AI usage.
  11. Technology infrastructure: Equips organizations with the necessary Generative AI tools, robust data platforms, adequate computational resources, and supports system integration and exploration.
  12. Data security: Implements stringent security measures such as encryption, strict access controls, safeguards against data leakage, and conducts regular security audits.
  13. Innovation, ecosystem and partnerships: Encourages ongoing research, fosters external collaborations, and forms technology alliances to stay at the forefront of Generative AI development and application.

In response to the rising interest in generative AI, at ZS we have built an accelerator to swiftly evaluate and identify maturity levels and gaps in the 13 above-referenced foundational data capabilities. For more information on this, reach out to Shri Salem or Willem Koenders. For a more detailed discussion of the foundational data capabilities required for generative AI use cases, read further here.

Gen AI supporting data management

We have established that robust data management and governance are critical to enabling generative AI within organizations to effectively activate relevant use cases, especially with data products as key enablers. However, it’s also interesting and important to explore the reverse interaction — how generative AI can be integrated into and enhance the data management landscape.

Within a modern domain-driven data lake architecture, as depicted in Figure 5, the data mesh lies at the heart of the system. This mesh connects various data products, typically organized within specific domains. Elements such as an augmented data catalog and knowledge graphs play crucial roles in managing metadata and democratizing access to these data products, which are showcased in a data marketplace and made available for diverse applications including AI/ML, business intelligence, or integration into downstream business processes.

Figure 5 — An example of a domain-driven reference architecture for a data lake. © ZS Associates.

Now, such a data lake architecture can help to establish and operate a data value chain, of which Figure 6 presents a simplified view. This data value chain involves several key stages that transform raw data into valuable insights for decision-making. It starts with Data Acquisition, where data is collected from various sources such as sales transactions, sensors, or user feedback. This is followed by Data Transformation, where the gathered data is cleaned to remove errors, transformed to standardize formats, and organized for easy analysis. After the data is processed, it moves to the Consumption and Analytics stage where it is analyzed to extract useful information, such as identifying trends or making predictions, which informs business decisions. Throughout these stages, Operations and Maintenance ensure that the data processes run smoothly, systems are updated, and issues are addressed promptly. This ongoing support enhances the efficiency and effectiveness of the data systems, ensuring the reliability and utility of data across the value chain. Generative AI has the potential to transform this data value chain in each of these 4 components, which is what we’ll explore next.

Figure 6 — An overview of the data value chain and how generative AI can be utilized to enhance it. © ZS Associates.

Data acquisition

Generative AI can significantly enhance the data acquisition process by analyzing and tagging existing data sources in relation to specific use cases. By evaluating data models and metadata details, it can auto-generate ontologies based on domain contexts provided as external inputs. This AI-driven approach acts as a prompt-driven catalog, storing intricate details such as data source specifics and KPI definitions, facilitating a deeper understanding and organization of data assets.

Additionally, generative AI can cross-reference available sources with those in the marketplace to identify gaps and create a prioritized list of suggestions. This not only streamlines the data acquisition strategy but also ensures that the data ecosystem is robust and aligned with organizational needs, making the process of integrating new data sources more efficient and targeted.

Data transformation

In the transformation stage, generative AI can revolutionize the way code is developed and maintained. By creating a cookbook of prompts, it enables the generation of a code library that can ingest industry-standard datasets, apply specific processes (such as those unique to the pharmaceutical industry), and produce a base orchestration code that is compatible across various cloud platforms. This capability also includes seamless migration of code from one programming language to another, such as from SAS to Python or Spark, by simply feeding the existing code library into the system.

Generative AI further enhances developer support by evaluating scripts, summarizing them, acting as a debugger during development, and automatically adding code comments. These features significantly reduce manual effort, minimize errors, and improve the efficiency of data transformation processes.

Consumption & analytics

Generative AI can transform the consumption and analytics stage by automating the configuration of business settings based on existing data points. This includes tasks such as product mastering, geo-tagging, and customer segmentation, which are typically resource-intensive.

By profiling external sources and mastered cross-references, generative AI can also suggest potential matches or merges with high accuracy, thus enhancing the quality of data integration.

Additionally, it contextualizes self-serve capabilities, enabling users to input natural language queries and receive automated insights. This augmented analytics approach reduces the burden of data interpretation and supports anomaly detection, making data more actionable and decision-making more informed.

Operations & maintenance

Generative AI can greatly improve operations and maintenance by automating routine activities and reducing the cost associated with “keep the lights on” (KTLO) operations. For example, it can provide detailed root cause analyses (RCA) of operational failures and share these insights with relevant stakeholders, enhancing transparency and accountability. Or, by analyzing historical data loads and comparing them with current run-timings, generative AI can predict potential SLA breaches and alert the necessary teams before issues become critical.

Additionally, generative AI can be used to govern access control and apply data restrictions based on user roles and personas, ensuring that data security and compliance are maintained across the board.

Closure

As we have explored in this article, the integration of generative AI within data management strategies is a transformative shift that offers advancements in how data is acquired, transformed, and utilized. As companies continue to navigate this terrain, the symbiotic relationship between data governance and AI technologies will become crucial for achieving long-term success.

Have a comment or question? Feel free to drop it in the comments, or reach out to either of us (Abhinav Batra, Willem Koenders).

Read more insights from ZS.

--

--

Willem Koenders
ZS Associates

Global leader in data strategy with ~12 years of experience advising leading organizations on how to leverage data to build and sustain a competitive advantage