Trending topics in Bioinformatics & AI: the Magic of HLAs on Organs, COVID-19, and Love

Dan Quang
DNAnexus Science Frontiers
7 min readJul 14, 2020

The human leukocyte antigens (HLAs) are a set of cell-surface proteins that bind to short peptides. This binding process is responsible for enabling the immune system to distinguish your own cells from cells that do not belong in your body, and initiating the antibody immune response. HLAs are encoded by genes located on a five million base pair stretch within chromosome 6p21 called the major histocompatibility complex (MHC) region, which is both the most gene-dense and genetically diverse region in the human genome. Over 200 genes lie in the MHC region, and many of the HLA genes in the region have a multitude of documented alleles. At the time of writing, over 20,000 HLA alleles have been identified.

HLAs have multiple roles, including:

Determining compatibility for organ donations

HLAs are the major cause of organ transplant rejections. If the HLAs between the recipient’s and donor’s cells are incompatible, the recipient’s immune system will attack the new organ as foreign cells. Although many HLAs exist, doctors typically only check concordance for the most ubiquitously expressed HLAs, such as HLA -A, -B, and -C, for tissue matching. Immediate family members are the gold standard for organ donors.

Defending against disease

More than 100 diseases have been associated with different HLA alleles. For example, a study suggests that certain HLA alleles are associated with severe disease outcome for COVID-19[1]. Furthermore, HIV-positive individuals who are homozygous in certain HLA genes typically progress to AIDS much more rapidly than heterozygotes. In some homozygous individuals the rate of progression is double that of heterozygotes. This differential progression is correlated fairly tightly with the degree of heterozygosity. Thus, it is possible to correlate heterozygosity in HLA alleles to decreased rate of progression to AIDS[2].

Influencing our mate preferences

HLA may be related to people’s perception of the odor of other people, and may be involved in mate selection, as at least one study found a lower-than-expected rate of HLA similarity between spouses in an isolated community[3]. Combined with the fact that HLA heterozygotes are generally more resistant to some diseases, it can be said that we are biologically driven to seek out partners with whom we can have the healthiest children.

How HLA Typing is currently done

Because there exists many HLA alleles, and different HLA allele combinations can yield different outcomes for diseases and tissue compatibility, knowing which HLA alleles you have is clearly important. The process of determining your HLAs is called HLA typing. Unfortunately, HLAs are notoriously difficult to genotype, mainly due to the sheer number of alleles that exist for each HLA. Here are some of the current methods for HLA typing:

Serotyping is the crudest method for HLA typing. It involves taking a blood sample and introducing the blood cells to antibodies to screen for specific serotypes. Serotypes represent very broad groupings instead of any single allele, and this typing method is highly dependent on the quality of the screening antibodies.

Sequence-based typing (SBT) has been the gold standard for HLA typing for many years. It involves PCR amplification of specific coding regions of HLA genes and sequencing of the amplicons. The process can be quite labor intensive and time consuming. Moreover, SBT can have trouble disentangling ambiguous allele combinations for one of several reasons. One such reason is that SBT only amplifies and sequences select exons. Some alleles can only be distinguished by polymorphisms located in unsequenced exons, or even introns as is the case with null alleles that do not get expressed. In the following example, only the first allele can be uniquely identified via sequencing the first exon. The other alleles can only be identified by sequencing outside of this exon.

Newer methods for HLA typing employ high-throughput sequencing (HTS, AKA next generation sequencing) technologies that can resolve these ambiguities. Nevertheless, improving methods for HLA typing remains an ongoing field of research.

A machine learning model for HLA Typing

To demonstrate how machine learning can be used for HLA typing, I applied Microsoft’s LightGBM model to predict the number of copies of the HLA-A*01:01 allele (one of the most common alleles of HLA-A) from the genome-wide genomic profile. Because human genomes are diploid, you can potentially have 0, 1, or 2 copies of this allele. At the moment, my method is designed for only the HLA-A*01:01 allele, but it can easily be extended to all HLA genes and alleles. I used the 1000Genomes dataset, which has over 2,000 samples, to train the model. To evaluate model performance, I held out a portion of this dataset as a testing set. Ideally, evaluation should be performed on an external dataset; however access to these datasets outside of academia is restricted. Previous published academic methods for predicting HLA types directly from genotype data include HIBAG[4], HLA*IMP[5], and SNP2HLA[6]; these methods do include an evaluation on external datasets and I suggest checking them out if you are interested in learning more. Here is a partial code example showcasing parts of the Jupyter notebook I wrote:

import lightgbm as lgb
import allel
import pandas as pd
# read in HLA data table using pandas libraryhla_df = pd.read_csv(‘20181129_HLA_types_full_1000_Genomes_Project_panel.txt’, delimiter=’\t’, index_col=’Sample ID’)hla_df
# read in VCF using scikit-allel librarycallset = allel.read_vcf(‘ALL.chr6.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz’, region=’6:29677984–33485635')gt = allel.GenotypeArray(callset[‘calldata/GT’])
gt
# train a lightgbm model. train_data and test_data are feature,
# label pair datasets derived from the HLA data table and VCF files
parameters = {‘objective’: ‘cross_entropy’,
‘metric’: ‘cross_entropy’,
‘is_unbalance’: ‘true’,
‘boosting’: ‘gbdt’,
‘num_leaves’: 31,
‘feature_fraction’: 0.5,
‘bagging_fraction’: 0.5,
‘bagging_freq’: 20,
‘learning_rate’: 0.05,
‘verbose’: 0}
model = lgb.train(parameters, train_data, valid_sets=test_data,
num_boost_round=128)

One of the reasons why I picked LightGBM for this specific problem is that it has a feature importance function. Feature importance is a relative percentage measure of how much a feature contributes to the model’s performance, providing some interpretability. In this case, each feature is a variant located in the MHC region, hence each feature’s importance value can be plotted against its chromosome position to generate a figure akin to a Manhattan plot:

As expected, the genetic variants that are most important for predicting HLA-A*01:01 are located next to the HLA-A gene. But, many informative variants are located far away from HLA-A, which makes interpretation of the relationship between variants and HLA typing less clear. Nevertheless, if the primary goal is to accurately predict HLA type, this model is more than adequate.

The feature importance function is also useful for feature selection. The original training dataset contained 72k variants. Using the feature importances, we can filter away variants while minimizing impact to model performance. With this in mind, I reduced the dataset down to the 19 most informative variants. In the future, I expect this approach to be useful for designing optimal microarrays. Although HTS is largely favored over microarrays for discovery, microarrays are still widely adopted in genotyping common variants as they are substantially less expensive than HTS and much more conducive to processing thousands of samples required for typical genome-wide association studies. My method may also complement existing HLA typing methods as a validation step.

If you would like to play around with the model, you can check out the notebook here.

References

  1. Nguyen, A., David, J. K., Maden, S. K., Wood, M. A., Weeder, B. R., Nellore, A., & Thompson, R. F. (2020). Human leukocyte antigen susceptibility map for SARS-CoV-2. Journal of virology.
  2. Carrington, M., Nelson, G. W., Martin, M. P., Kissner, T., Vlahov, D., Goedert, J. J., … & O’Brien, S. J. (1999). HLA and HIV-1: heterozygote advantage and B* 35-Cw* 04 disadvantage. Science, 283(5408), 1748–1752.
  3. Brennan, P. A., & Kendrick, K. M. (2006). Mammalian social odours: attraction and individual recognition. Philosophical Transactions of the Royal Society B: Biological Sciences, 361(1476), 2061–2078.
  4. Zheng, X., Shen, J., Cox, C., Wakefield, J. C., Ehm, M. G., Nelson, M. R., & Weir, B. S. (2014). HIBAG — HLA genotype imputation with attribute bagging. The pharmacogenomics journal, 14(2), 192–200.
  5. Dilthey, A. T., Moutsianas, L., Leslie, S., & McVean, G. (2011). HLA* IMP — an integrated framework for imputing classical HLA alleles from SNP genotypes. Bioinformatics, 27(7), 968–972.
  6. Jia, X., Han, B., Onengut-Gumuscu, S., Chen, W. M., Concannon, P. J., Rich, S. S., … & de Bakker, P. I. (2013). Imputing amino acid polymorphisms in human leukocyte antigens. PloS one, 8(6).

--

--

Dan Quang
DNAnexus Science Frontiers

Recovering academic and freelance scientist. Machine Learning/Deep Learning Scientist at DNAnexus.