UKB Research Analysis Platform Case Study

Perform genome-wide association analysis to identify exome variants in Alzheimer's disease in UK population

Yih-Chii Hwang, Ph.D.
DNAnexus Science Frontiers
6 min readSep 1, 2021

--

Prepared by: Yih-Chii Hwang, PhD, Ames Ma and Jason Chin, PhD

Identify exome variants in Alzheimer’s Disesase in UK population

The UK Biobank (UKB) has been collecting biosamples of half a million United Kingdom volunteers aged 40–69 since 2006. The initiative has collected extensive phenotypic data including but not limited to disease events, drug prescriptions, and deaths of the participants. It also contains various types of genetic data, including genome-wide genotype array, whole exome sequencing (WES), and whole-genome sequencing (WGS), for each participant.

Recently, UKB launched a cloud-based Research Analysis Platform (RAP), allowing researchers around the world to access and analyze the ultra-rich biobank data more readily. This platform provides storage for the large amount of biomedical data collected by the biobank and also a scalable computing resource for performing analyses. DNAnexus Research Lab, among a couple of early-access groups, are interested in exploring the platform’s abilities to run different types of analyses and gain new insights from this unique set of data. Because discovering genotype-phenotype associations is often one of the primary use cases for UKB researchers, we wanted to see if we could rediscover earlier GWAS results efficiently on RAP.

Alzheimer’s disease (AD) is a progressive brain disease that causes decline in memory and other mental functions. In the United States, 1 in 9 people aged 65 and older have Alzheimer’s dementia. By 2050, there will be a predicted number of almost 13 million individuals with Alzheimer’s [alz.org]. There are many environmental and genetic factors related to the development of Alzheimer’s, with no single factor currently known as the sole cause, making it difficult to establish comprehensive treatment or prevention. Many researchers have contributed to the field of AD research, and have discovered multiple genetic associations to AD [Lambert et al., 2013, Desikan et al., 2015, Jansen et al., 2019].

In this post, we present a case study in performing a GWAS on AD by analyzing UK Biobank phenotypic data and WES genetic data. This end-to-end analysis, from accessing data, performing sample QC, variant QC, to association testing and reporting summary statistics, is entirely executed on RAP, keeping the data and analyses conveniently stored on the secure cloud platform. We are working on releasing the entire step-by-step on how we did it to be found on RAP’s tutorial site (in progress).

Methods

Labeling case:control phenotype and sample QC

To identify cases and controls for AD, we explored the phenotypic database and searched for “Alzheimer’s disease” in our UKB application by exploring the cohort browser. Many data-fields returned upon the search and are under multiple categories such as “Hospital inpatient/Summary Diagnosis”, “Death register”, “Cancer register”, and “Family history”. One possible method is to identify participants as cases when they have main or secondary ICD-10 code (i.e. data-field 41202 and data-field 41204) with AD (i.e. G30: Alzheimer’s disease, and F00: Dementia in Alzheimer’s disease) (Figure 1.). However, given that the most common form of AD is late-onset (happens after age 65 and older) and the UKB population cohort is between 40–69, it is possible to mislabel at-risk cases as controls if we use the ICD-10 code as the only source for determining case/control status.

Figure 1. G30 as the ICD-10 code for Alzheimer’s disease (data-field 41202) presented in RAP Cohort Browser.

Certain inheritable genetic risk factors are associated with late-onset AD, so one method to calculate an individual’s risk of developing AD is to also reference each participant’s parents’ disease status and derive a proxy phenotype. Therefore, we derived the binary trait, AD-by-proxy, based on each participant’s (1) ICD-10 code, (2) both parents disease status, and (3) both parents ages [Liu et al., Jansen et al.]. After performing quality control metrics on the 200K WES genotype and phenotype data (tutorial under construction), we found 24,227 cases and 121,674 controls (1 out of 6.02). If we identify cases using the existence of G30 and F00 in their ICD-10 code, we would have 672 cases and 145,229 controls (1 out of 217.11) (Figure 2). The AD-by-proxy phenotype provides a more reasonable case:control ratio of the disease in the population and also improves the statistical power for association testing. We demonstrated using the dxdata library on how we can interact with the UKB data-fields and construct a certain phenotype based on scientific insights beyond entries recorded in UKB.

Figure 2. AD case-control ratio in (a) AD by ICD-10 code, (b) AD-by-proxy, and © in US population.

Variant QC and Sample QC

PLINK2 is used to filter out array genotypes that do not pass minor allele count (--mac) below 100, minor allele frequency (--maf) under 0.01, Hardy Weinberg equilibrium (--hwe) below 1e-15, missingness per individual (--mind) under 0.1, and missingness per marker (--geno) under 0.1. For WES genotypes, we applied the same QC thresholds as the array genotype, except we lowered minor allele count (--mac) to 10 and minor allele frequency (--maf) to 0.0001.

To minimize the impact of low quality observations and cofounders, we kept only the cohort of individuals who have their self-reported sex same as genetic sex, are self-reported white British ancestry, not shown with putative sex chromosome aneuploidy, have at most 10 putative third-degree relatives, and not marked as outliers for heterozygosity and missing rates. We also filtered out individuals where there is missing information for their parents’ age, and where they filled in “Do not know” or “Prefer not to answer” for either parent’s illness status. Since the GWAS software of choice handles relatedness, we are not excluding related individuals in our study.

GWAS by Regenie

We applied REGENIE (v2.2.4), a machine learning-based GWAS approach, that accounts for the population structure and relatedness by conducting GWAS in two steps. Step-1 fits a whole genome regression (WGR) model for the trait values. This step requires all SNPs across the genome and we used UKB genotype array data for fitting the model. Step-2 does association testing among each sequenced variant and is conditional upon the model from Step-1. The two steps allow us to first train the genetic background as a model (as local polygenic scores) and apply it as a covariate for association testing [Mbatchou et al., 2021].

Using an approximate Firth likelihood ratio test (--approx --firth), the GWAS for AD-by-proxy was finished in 3 hours wall-clock time. The efficiency of this method is made possible by both the type of algorithm and the scalability of the cloud environment.

Using the results of the GWAS, we rediscovered variants associated with AD, including the well-known variants in APOE [Farrer et al., 1997, Sherva & Farrer, 2011, Ma et al., 2019], and extended to more coding variants (Figure 3).

Figure 3. Manhattan plot for Alzhemier’s disease, with 200K WES data from UKB.

Discussion

In this case study, we rediscovered and reproduced a GWAS analysis. This result is one of the first validations to suggest the integration of the end-end GWAS protocol and RAP with UKB dispensed data is scientifically valid. One can directly utilize the featured toolkit (we utilized dxdata, Plink2, and regenie) on RAP and conduct research and interact with RAP data.

There are many potential applications that can be developed from this protocol. We look forward to streamlining this analysis flow as a WDL-based workflow and taking the advantage of scalability in cloud computing to perform GWAS to other dozens or thousands of phenotypes. We also look forward to exploring more scientific discoveries by interacting with this ultra-rich UKB data on RAP and performing machine learning and artificial intelligence analyses.

Acknowledgement

This research has been conducted using the UK Biobank Resource under Application Number ‘46926’. We thank all the participants in the UK Biobank study. We thank Regeneron Genetics Center’s Tony Marcketta (@AMarcketta) for his feedback and collaborative work for enabling regenie on RAP. We thank Chiao-Feng Lin (@chiaofenglin) and George Asimenos for advising the best practices of performing GWAS and utilizing RAP.

--

--