Are you susceptible to adverse drug reactions?

Prediction and insights using LightGBM and SHAP — Part one

Published in

DNAnexus Science Frontiers

5 min readFeb 16, 2021

Adverse Drug Reactions (ADRs) are, by definition, harmful and can be lethal. Pharmacogenomics (PGx) is the study of how genes affect a person’s response to drugs. It is a literal embodiment of personalized/precision medicine. Advancement of PGx has led to the establishment of gene-drug specific dosing guidelines. Implementing such guidelines in clinical settings can reduce the incidence of ADR. Unfortunately, dosing guidelines are only available for 112 drugs — a small fraction of the over 20,000 prescription drug products approved for marketing. Furthermore, while some ADRs are predictable from the drug’s known pharmacology others are not (idiosyncratic ADRs).

Being able to predict whether some individuals are more susceptible to idiosyncratic ADRs than the others would be useful in research and clinical contexts. UK Biobank (UKB), with its wealth of real-world clinical data, questionnaires, and genomic data for nearly half-million volunteers, is a treasure trove of information that allows us to address this question. Machine learning is the most common approach for predicting certain outcomes from data. Boosting algorithms (which convert weak learners to strong ones via an iterative process) are powerful tools for classification and other machine learning tasks. We chose LightGBM (a decision-tree based gradient boosting framework) as our tool for modeling. As with many human traits, genetics plays a role in drug response but is not the only factor. Therefore, we include both genomic and non-genomic (phenotypic) data as features in our modeling.

Among machine learning algorithms, decision trees are more transparent than some others, such as deep learning. LightGBM’s “feature importance” functionality provides interpretability of the modeling to some extent. However, Lundberg et al. demonstrated that common methods used to calculate feature importance for decision tree algorithms (including those implemented in LightGBM), assign importance inconsistently. They developed SHAP (SHapley Additive exPlanation) and related tools to address interpretability for machine learning in general. In the first of this two-part blog, we used real-world clinical data and machine learning to build a prediction model. In part 2, we applied SHAP TreeExplainer to our LightGBM results to gain insights about ADR from the UKB participants.

Participants and Phenotypic features

UKB provides extensive health data from about 500,000 volunteer participants. The nearly 7,000 fields of data include electronic health records, images, and questionnaires. The structured text has various data types, such as continuous (e.g. height, blood pressure) and single/multiple categorical, which are encoded. Many fields have data that was collected at multiple time points.

Querying and extracting such non-uniformly structured and high-volume data is daunting. With the DNAnexus Apollo platform, we ingested UKB data into an Apache Spark database, along with a user friendly Cohort browser (Figure 1). We used the cohort browser to define our case and control participants and applied quality control (QC) metrics to build the cohorts. We classified positive cases as individuals with ICD10 codes related to poisoning by drug (T36-T50 etc.) in the main diagnoses or secondary diagnoses fields. The cohort browser provides an easy, intuitive interface that allowed us to quickly find our fields of interest, review the distributions of the data, and save the cohorts. It was a quick way to experiment with cohorts based on different criteria before finalizing our selections.

After creating our case and control cohorts, we brought the cohort objects into the DNAnexus JupterLab environment. Using the saved cohort objects and Apollo dxdata package, we extracted 450 UKB fields related to medical history, biological sample assay results, and anthropometry and more for phenotypic features (184 features after QC).

Figure 1: Selecting participants using complex criteria within a cohort browser on the DNAnexus Apollo platform

Genomic features

UKB provides SNParray data for nearly all the individuals in its database, and WES and WGS for subsets of individuals. To maximize the number of samples we could analyze, we chose the imputed SNParray data. We focused on protein-coding variants with moderate or higher impact within genes involved in drug related pathways curated by The Pharmacogenomics Knowledge Base (PharmGKB), an NIH-funded resource that provides information about how human genetic variation affects response to medications. PharmGKB collects, curates, and disseminates knowledge about clinically actionable gene-drug associations and genotype-phenotype relationships.

Having chosen an analysis strategy, we used popular tools — PLINK, snpEff, and bcftools — for QC, mutation effect annotation, and filtering. We also applied linkage-disequilibrium pruning to remove highly correlated variants. This yielded 513 genomic features.

LightGBM modeling

Installing and running LightGBM is straightforward. The documentation on its ReadTheDocs site and the examples at github are useful resources. Once we extracted phenotypic and genomic features, we performed two additional preprocessing steps before we starting LightGBM modeling in the JupyterLab environment: 1) We removed features that are invariable among all samples because such features do not help in classifying individuals. 2) We explicitly specified categorical features so that LightGBM could handle them correctly.

We experimented with different combinations of methods (binary and multiclass classifications) and predictive features (genomic only, phenotypic only, and geno+pheno). UKB provides two summary fields relating to diagnoses made during hospital inpatient admissions — main and secondary. Both are encoded with ICD10 codes. We designated 1) class1: individuals having ADR-related ICD10 codes in the main diagnosis (N=1,478), 2) class2: individuals having ADR-related ICD10 codes in the secondary diagnosis (N=26,024), 3) class3: individuals having ADR-related ICD10 codes in both fields (N=1,828), 4) class0: individuals not having ADR-related ICD10 codes in both fields. For binary classification modeling, class1, class2, and class3 were pooled together as the case class, and class0 was the control. We divided all samples into a training set and a testing set and evaluated GBM modeling with Area Under Curve (AUC).

Phenotypic vs. genomic: Among the experiments we ran, the phenotypic-only multi-class classification modeling achieved the best performance (AUC 0.73), while genomic-only performed poorly (AUC ~0.5). Pheno+geno performed slightly worse than phenotypic-only, suggesting that more data may introduce noise that confuses LightGBM. However, we are also mindful that because we have not applied this model to other datasets, we might be risking overfitting.

Binary vs. multi-class: Binary classification’s worse performance (AUC 0.69; not shown) than that of multi-class (AUC 0.72; Figure 2) probably reflected the varying severity of ADRs among the individuals who have ADR ICD10 codes in the main diagnosis field and those who have ADR codes in the secondary diagnosis field.

Figure 2: AUC plots for multi-class One-vs-All (binarized) classification

Now that we have built a model for predicting an individual’s susceptibility to ADRs, what can the model and the UKB data tell us about ADRs? We will explain in part 2 of this post.

Acknowledgements

This research has been conducted using the UK Biobank Resource under application number 46926.