Lessons learned from working with genomic data from clinical trials

Nelly Hajizadeh
9 min readDec 6, 2021

--

Sina Rüeger & Nelly Hajizadeh

For a better understanding of drug development and its many aspects, such as disease progression or treatment response, it is fundamental to have access to a large, comprehensive dataset. Collecting such extensive data is beyond the scope of many institutions and requires years of collective effort backed by significant funds. At Novartis, we have invested into pooling, harmonizing, and making available our historical clinical trial and research data. data42 is Novartis’ transformational program to leverage patient, clinical and research data to accelerate drug discovery and development. Read more about data42 in articles from the program’s scientific [1] and technical [2] leads.

With the vision and direction of data42 in mind, we, two data scientists within data42, are sharing with you the insider’s view into the realities of digging into this pooled data — think of it as a letter from the trenches.

Having access to the unique data source offered by data42 has forced us to reconsider our ingrained approaches and provided a set of unique analytical challenges. To start us off, we would like to give you some background about clinical trials, genomic data and research before diving into how data42 has changed the life of us data scientists at Novartis:

Some important background on clinical trials

In a randomized clinical trial (RCT), the aim is to determine whether a drug is efficacious on certain set of at-risk individuals while being a safe treatment. While the specifics of what type of data a clinical trial may collect varies, it usually includes the set of lab values, demographics, drug response, safety events (incidence of e.g. a heart attack etc.), and medical history. Some trials also take specific note of current and previous medications of the patient, in addition to any substance use (e.g. smoking or alcohol). Clinical trials are highly systemized processes, the patients are randomized in a blinded manner, and followed up longitudinally to monitor safety and lab measurements. To limit the number of patients exposed to experimental treatment until the drug is proven safe and effective, the patient set is modest (ranging from 10–5000), depending on the phase of the trial.

This stands in comparison to alternative public data sources such as Biobank data (real world data/evidence — RWE), where the patient set is significantly larger and more diverse. They achieve this by collecting information from the general population (no strict inclusion/exclusion criteria), and at the expense of fully complete longitudinal data, i.e. the data is often self-reported from questionnaires. Specific drug responses and smaller safety events are not available. Clinical trials have high internal validity in their data, whereas Biobanks are lower, but this varies depending on the Biobank.

In summary, clinical trials collect data relevant for primarily establishing primary endpoints, such as safety and efficacy. Biobanks on the other hand are intended for generating hypothesis between real world phenomena and outcomes. Whenever a data source also collects genomic data, it opens up the possibility to design explorative studies to link genomic data to traits reported in the data source.

Brief introduction to genomic data and research

Given genetic and phenotypic data for a set of individuals, consider the following question: How can we determine which genomic regions are involved in a disease or a specific trait (e.g. high blood pressure)? This type of analysis is referred to as genome-wide association analysis (GWAS) [3]. A genome-wide association analysis (GWAS) systematically tests all the available locations in the genome for an association with a phenotype, for instance height, incidence of type II diabetes or hypertension. A phenotype can be any physical characteristic, even traits such as addiction or more indirect concepts such as drug responses. A genome-wide association analysis (GWAS) can be a very powerful analysis, it can uncover the genetic architecture of any trait, and since our genes don’t change during our lifetime, we can establish the base risk for a patient for any type of disease. However, findings from genome-wide association analyses (GWAS) can be deceptive because the signals we are looking for are extremely small, and our power to detect these is further compromised by the fact that we test for correlations on several million locations in the genome (multiple-testing problem [4]). So, to run a conclusive genome-wide association analysis (GWAS), we require genetic data for a large patient set. This is why most publications implicating new regions of a genome for a disease or trait feature datasets harvested from large Biobanks. So, if genomic research is contingent on having a large patient set, what value can we derive from genomic data from clinical trials?

Genomic data in clinical trials

If a clinical trial team decides to collect genomic data, they are typically looking for specific genetic biomarkers for a certain phenotype relevant to the disease. An example of a genetic biomarker would be, for instance a mutation in a gene for a specific lipid that will confer higher susceptibility for coronary artery disease. This narrow focus stands in contrast to more explorative type of analysis such as a genome-wide association analysis (GWAS) that broadly assesses the genetic architecture of a trait without prior knowledge. This focus is out of necessity however, as it is not plausible to design and run a clinical trial on a scale that could support a fully powered genome-wide association analysis (GWAS). A single trial therefore will almost always lead to an underpowered study.

Building a more nuanced understanding of drug response thanks to cross-trial and real world evidence data in one place

With data42, however, the pooled and harmonized data present us with the possibility of running genome-wide association analysis (GWAS) on clinical trials, and with the possibility to do so on relatively unique phenotypes, such as drug responses.

This brings us to the starting point for data scientists at data42. We have a set of clinical trials, complete with phenotypes ranging from vital signs, comorbidities to drug responses and questionnaires and their genomic data. We start digging into data, trying to reproduce some model excavations done on Biobank data. However, we realize that we need to take another approach to our data; even when combining data from multiple clinical trials we are still relatively underpowered compared to a Biobank and, in addition, the clinical data is of different nature — this must be explicitly acknowledged in how we approach our analysis. We put our shovels to the side and re-examine our strategy.

What have we learned?

1. Think about the patient population and how this fundamentally dictates the signals

The inclusion/exclusion criteria of randomized clinical trials (RCT) means that we have collected data on a certain profile from the general population. These individuals are relatively similar as they are all at-risk patients/suffering from a disease, and so within a study, they differ relatively little from each other (low intra-study difference), but in the context of pooled data, are quite different (large inter-study difference). In contrast, the Biobanks recruit from the general population at much larger scale, their set of genomic and phenotype data should approach a rather uniform difference. If we were to draw a picture of what the population looks like in Biobanks and randomized clinical trial (RCT), it would look something like Figure 2. This means that when doing genomic analysis on RCT data, we need to think extra carefully about which factors we should control for when doing our analysis, such as explicitly accounting for the trial itself (as a covariate), or sub-sampling the patients based on features such as age, or inflammatory markers. Failing to do so might result in the analysis returning spurious correlations, false positives and also false negatives (absence of signals).

a. Biobank patients are sampled from a larger population and can be thought of as largely independent. Biobanks of course, are not without artefacts, as it could be argued that the people who join a volunteer project (UK Biobank; UKBB), who can afford medical care (USA), who come from a genetically distinct population (FinnGen) are not a full/uniform representation of the population. b. Patients from clinical trial data are more homogenous within their indication but differ from each other significantly between indications. When seen in the context of a pooled data source, they give a distinctly clustered quality.

Related to the issue of the population characteristics is the concern about potential lack of diversity in clinical trials and more broadly, healthcare/medical research. In genomics, it is a well-known fact that most results produced to date are most relevant for individuals with European descent. This is because Biobanks, like clinical trials, feature a majority of these individuals in comparison to other populations, and while the results may translate to some degree to other ancestries, with increasing genetic distance the relevance of the findings decreases. The need of ensuring diversity is also a known topic in the pharmaceutical industry. At Novartis, we are running several initiatives to address this gap. Nonetheless, the bias is important when it comes to the practicalities of the analysis, where we explicitly account for trans-ancestries (as a covariate) or often simply by disregarding the minority population all together to observe reliable signals.

2. Redefining standard intentions

Running a genome-wide association analysis (GWAS) on data from a randomized clinical trial (RCT) will most likely be underpowered, especially when compared to the ones published from Biobanks. However, we argue that such analysis is still useful to scope our important leads and in a more general sense could be used controls/confirmation for signals that you expect to see from data from Biobanks. We call our analyses a “humble GWAS”, where the aim is not to push the state of the art in signal detection, but more a practical use. The humble GWAS guides the direction of research, giving hints on which genomic regions may be implicated, even signals below the threshold of significance may be seen as hints. Following up leads below the threshold of significance requires extra vigilance; often by connecting to other data sources like expression data or lab measurements, in short, any complementary method that could provide some information on the plausibility of the hit.

3. Borrow power from Biobanks

Parallel to the efforts of collecting Novartis clinical data into data42, we are also ingesting publicly available Biobanks that we have access to. This means that our clinical data is situated, so to speak, directly adjacent to these datasets, making it plausible to design investigations that directly combine results from these data modalities (e.g. through a meta-analysis [3]). Traditionally, clinical and Biobank data were segregated and treated independently by research teams, but we are observing the field transitioning into a more integrative approach, acknowledging how different data sources can be assessed holistically. In the case of genomic research, one approach would be that we could directly follow up our leads from clinical data in the more statistically powerful Biobanks.

Conclusions

Clinical trials are very directed processes, conducted in a highly regulated environment with the aim to establish the efficacy and safety of a new drug. While the data collection during a clinical trial remains largely focused on drug efficacy and safety signals, collection of genomic and other omics data is becoming more common. This is because the reality of drug efficacy is that every individual responds differently to treatment and the ultimate goal is to identify the right treatment for the right patient. So, the advantage of collecting genomic (and other omics data) during a clinical trial is that we can start deconstructing the genetic component of the drug response, thereby taking steps to individualize the treatment for the patient. Our clinical trials can thus be designed to include the right set of patients and increase the likelihood that the trial is meaningful and effective.

Investing efforts into other data modalities in clinical trials remains a forward-looking activity, its true potential only really coming into play when data is available across many trials. We observe this potential at data42 where we endeavor to leverage genomic data across all available trials to build a more nuanced understanding of drug response, disease states and clinical trial outcomes.

Stay tuned for more news from data42!

About data42

data42 is Novartis’ transformational program to leverage our resource of patient, clinical and research data — one of the largest and most diverse datasets in the pharma industry — with the ultimate moonshot goal of changing the way we develop medicines.

Contact us at data42.communications@novartis.com.

Read more about data42

Leadership lessons from a grand endeavor

Building the Map of Life, our single source of healthcare R&D data

data42 — Life(science), the universe, and everything agile

How to find and attract hidden gems in the R&D Tech talent universe

A transformative start-up in a corporate culture

References

[1] https://sam-khalil.medium.com/building-the-map-of-life-427bda4ad327

[2] https://pascal-bouquet.medium.com/data42-life-science-the-universe-and-everything-agile-fa2d6bad7562

[3] https://www.nature.com/articles/s43586-021-00056-9

[4] https://onlinelibrary.wiley.com/doi/full/10.1002/gepi.22032

--

--