Part 2: The Importance of Comprehensive Data Corpuses for AI and GenAI Model Development

Generative AI in life sciences

Collin Burdick
Slalom Daily Dose
3 min readOct 11, 2023

--

By Collin Burdick

Read part 1 of this series here.

The last half-decade has seen substantial progress in the implementation of artificial intelligence (AI) and generative artificial intelligence (GenAI) models within life sciences research and development (R&D). While these strides cannot be overstated, the future holds intricate challenges.

Historically, the principal source of data has been univariate analysis from historical, hypothesis-driven experiment designs and results. Despite its longstanding use — thought of as reactive use to the AI trend — this technique is insufficient for the complex demands of contemporary AI and machine learning (ML) models, necessitating a paradigm shift towards AI-conducive, multivariate, and time-aware datasets.

Top life science AI leaders are moving to a proactive posture today: they’re transitioning from a hypothesis-driven to a data-driven experimentation methodology which enables complete data capture via multivariate inputs and outputs along with time series data. These data sets then enable self-supervised learning approaches and enable AI/GenAI models to autonomously learn from and label data. Self-supervision can reduce biases that are associated with human labeling and improve the reliability of model outputs. Over time, data-driven experimentation and self-supervised learning are expected to enable a holistic understanding of biological systems, leading to breakthroughs such as drug repurposing, toxicity prediction, and biomarker discovery; even novel targets, pathways, or formulations are thought to be within reach.

The CEO of a major life sciences research organization explained:

“We spent five years developing AI models on historical data only to realize we have an upper bound constrained by effective data. We’re scrapping those models and, now, spend over 80% of our time and budget on creating purposeful data sets and are already seeing new models surpass our previous upper bounds given this training data.”

Compensating for common risks

Batch effects are a ubiquitous phenomenon in life sciences research. These effects create a disconnect from underlying biology, which can muddle biological variance, leading to an inflated experimental variance in data analyses. Failure to correctly identify and address batch effects could result in false interpretations of biological signals, hampering the development of AI and GenAI models. Through meticulous data management, including consistent processing protocols and robust statistical methods, researchers can distinguish batch effects, refining their data to reflect the true biological context more accurately.

Clinical data are not immune to complications. A significant amount of variation and potential bias can be introduced during the post-processing stage, where patient biology is translated into physician-understandable language and then transformed into International Classification of Diseases (ICD) codes. This potential for distortion can undermine the training and performance of models, impeding the extraction of meaningful insights. Conversely, imaging data, which typically undergo minimal pre-processing and can be self-labeled, serve as a robust, bias-reduced alternative for AI model training.

The way forward

The development of a comprehensive biological language could revolutionize the way data are interpreted and communicated. Grounded in the context of genes, proteins, and disease states, this lexicon could allow for a more accurate and nuanced understanding of biological data. As biological entities like genes and proteins have an inherent sequential nature, akin to phrases in human language, large language models (LLM) can utilize this biological language effectively. This application could facilitate the generation of novel hypotheses and therapeutic insights, ultimately improving patient outcomes.

Enhancing datasets is not merely a volume-based endeavor. It’s crucial to consider the quality, relevance, and real-world applicability of data, as these aspects can influence the training and performance of AI and GenAI models. By focusing on high-quality, real-world data, we can ensure that the models we develop are as accurate, reliable, and impactful as possible. A strategic approach to data collection and utilization will serve as the foundation for the future of AI in life sciences research and development.

What are our next steps to unlock the value of GenAI as it applies to life sciences? In the third part of this series, we address the challenges we’re solving for today to create better tomorrows.

Slalom is a global consulting firm that helps people and organizations dream bigger, move faster, and build better tomorrows for all. Learn more and reach out today.

Interested in joining our next Bay Area industry roundtable? Find more information here.

--

--

Collin Burdick
Slalom Daily Dose

Global Managing Director @ Slalom Leading Life Sciences and Go-to-Market