How well do we know our “reference genome”?

Published in

Musing’s of a Data Scientist in Medicine

9 min readAug 14, 2019

In August topic, I will cover most about the human “reference genome”, it is inception, development, state of art and more importantly what are we currently missing. This mostly stems out from some article that raises some valid concerns about the missingness and less representative diversity of our current reference genome. I do not completely agree with all that is mentioned in it but some of the papers that it referenced and other articles do raise some concerns. Another article from one of my recent reads that fits this space is here. To begin with, I will try to put a refresher for us.

Why “reference genome”? What is the Human Genome Project?

Before jumping onto the reference genome, I would like to provide some background and memory jog to the Human Genome Project(HGP). It is one landmark discovery made in 2003, that help us make a blueprint of the Human DNA. This means we were able to deep dive into the genetic make-up of a human being and get access to a large number of 4 letters[A, C, T, G] that build up the DNA(nucleotide resolution). This opened a lot of avenues for the entire world to get a first-hand understanding of developmental and disease biology at a single base level. To get detailed understanding one can look into the goals and the timeline of the projects.

However, by the time it finished, it gave us :

Figure 1: Shows the various feats by the end of the finished genomic sequence. The screenshot is taken from https://www.genome.gov/human-genome-project/results

Having said that, this was a momentous task that led to the below achievements.

Figure 2: Represents a tabular format of the various achievements done upon completion of HGP by 2003. This figure is adapted and modified from https://www.genome.gov/human-genome-project/results

HGP project has been successful. Currently, we have been using the knowledge and the data from it in various domains of biology and healthcare via population genetics, genomics, target discovery leading to potential druggable targets and biomarker development. For the more varied scope of the HGP can be found here.

There are a few nations that have come up and joined hands to make the 100K Genomics club. All these aim at Precision Medicine initiative via population genomics projects globally. More detailed information can be found here. Below is the list of countries that are currently in that club:

United Kingdom — 100,000 Genomes Project
Japan — Initiative on Rare and Undiagnosed Diseases
China — 100,000 Genomes Project
Australia — Australian Genomics Health Futures Mission
Saudi Arabia — Saudi Human Genome Program
United States — All of Us Research Program
Estonia — Personalized Medicine Programme
France — France Génomique (Médicine France Génomique 2025 or French Plan for Genomic Medicine 2025)
Dubai, United Arab Emirates — Dubai Genomics
Turkey — Turkish Genome Project

Having said all these one of the key aspects that we all use for any genomics-based research work is alignment to the “reference genome”. This reference genome is a joint collaborative effort and an initiative undertaken by Genome Reference Consortium.

Figure 3: This image is taken from Wikipedia link https://en.wikipedia.org/wiki/Reference_genome, that shows the recent human genomic assemblies we have currently along with their release dates.

Having mentioned the “reference genome” we should also try to understand how it is used and at which stage. Below figure represents how a set of a sequenced pipeline works starting from a maternal amniotic fluid.

Figure 4: This is a typical workflow of sequencing that is sourced and adapted from https://doi.org/10.1016/B978-0-12-813764-2.00012-X.

Figure 4 shows the typical workflow of Whole Genome and Whole Exome Sequencing. It is at the after SBS stage we receive raw sequenced read files that in the form of special formats FASTQ that are subjected to initial QC, alignment with reference genome followed by any downstream analysis.

Figure 5: This illustration is a mapping process that is adapted and sourced from https://shiltemann.github.io/training-material/topics/sequence-analysis/tutorials/mapping/tutorial.html. The figure shows short sequences set of reads as inputs that are then aligned to the reference genome. The highlighted region is showing the **Mapping** where **Read1** is mapped with 2 mismatches at starting position 100, the **Read2** at position 114 with clippings on either side and the **Read3** is aligned with 2-base insertion and 1-base deletion at position

Figure 5 shows short sequences set of reads as inputs that are then aligned to the “reference genome”. The concept of alignment is not new and stems from local and global sequence alignment alignments to more complex multiple sequence alignment. Various algorithms and tools have been developed over the years for reaching precision with the alignments. Below one can find some comparative metrics of popular aligners using DNA-Seq and RNA-Seq.

Figure 6: A comparative benchmark metrics of a few of the popular aligners. The image has been adapted, sourced from https://www.ecseq.com/support/benchmark.html. The images were downloaded individually and grouped together in Powerpoint to be modified and present as one comprehensive image. Please note the above benchmark does not take into account some other efficient aligners that have been proving very instrumental lately. For other methods please follow https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6042521/, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4914128/ (Both of which are based on Transcriptomics data)

However, with massive Genome Sequencing projects globally, several issues have been reported lately. One of which is with the current human reference genome. Below are some of the excerpts from experts with regards to the diversification issue with the current reference genome, what it misses and how we are often not been able to accurately pinpoint certain regions in our DNA as we miss the diversity in it. To begin with, I highlight some of the concerns raised.

“Computational biologist Steven Salzberg of Johns Hopkins University and colleagues sequenced the genomes of 910 African Americans and measured how many pieces are present in all of them but are missing from the reference genome. Their count: 296,485,284 base pairs — nearly 10 percent of the human genome — they reported last November. One missing fragment is 100,000 base pairs long, and millions are at least 1,000 long.”

The current reference genome “is good for many, many things, but it’s not as good or as complete as it could be.”

“The problem was, this 238-base-pair region isn’t in the reference,” said Dr. Heather Mefford, the UW pediatrician and geneticist who led the sequencing analysis: The abnormality was a nucleotide stutter, with CGG repeated hundreds of times in a segment of DNA that activates XYLT1.

Most of the projects are disproportionately slanted towards populations of European descent.

“If we keep focusing on the populations that are easy to study — those that have been studied before — we make existing disparities in health data worse,” warns Dr Lucia Hindorff, a programme director at the US National Human Genome Research Institute (NHGRI).

“We’re seeing evidence that adding 50,000 people that are European as opposed to 50,000 people who are not European adds a different value to the data. If you add 50,000 more non-Europeans, you end up discovering more variations,” she says.

Now let us check some published facts. What are the pieces of evidence available?

We all love our SNV/CNV/SV/InDel discoveries, using our favorite algorithms(or benchmarked ones at times) to realize what plagues our linear DNA structure from a WGS/WES study. However, there are evidence and reports that owing to the diversification issues we often miss such calls as far as precision is concerned.

Note: we can call some of them from DNA Methylation array and RNASeq as well, we can discuss them some other time.

Some metrics that focus on the issue:

The ***“Table 1”*** is a screenshot from (https://www.nature.com/articles/s41588-018-0273-y).

This screenshot already shows the bases that did not align with the GRCh38 in 910 humans of African descent.

Now let us look at our GWAS studies. How well are the GWAS representative of the diversification based on studies and individuals?

Figure 6: This is a screenshot adapted from https://www.sciencedirect.com/science/article/pii/S0092867419302314?via%3Dihub It is depicting the % distribution of ancestry categories included in GWAS (https://www.ebi.ac.uk/gwas/home). The left picture is based on “studies” while the right is based on total “individuals”.

Figure 6 clearly depicts the diversification issues when looking at the ancestry distribution category.

Does such discrepancy affect our understanding of disease?

Figure 7: It is a screenshot adapted from https://www.sciencedirect.com/science/article/pii/S0092867419302314?via%3Dihub . The figure is trying to portray the facts that are impacting the ability to “Replicate Genotype-Phenotype Associations across Populations”

Figure 7 clearly shows that it does impacts the “transferability” specifically when we use tagSNPs that are derived from a single European population and that may fall into the replication crisis. This crisis is more apparently affecting when one is performing or trying to generalize the Genotype-Phenotype associations across varied populations while studying diseases traits.

Another interesting work using deep single-molecule real-time(SMRT) sequencing from two haploid genomes showed discrepancies of their findings with the 1000 Genomes project while comparing SV’s and InDels (For details refer: https://genome.cshlp.org/content/27/5/677)

Figure 8: Screenshot of the abstract https://genome.cshlp.org/content/27/5/677

All these pointed some serious missingness of diversification with the current “reference genome”. These also points out that our current findings of SNVs/SVs can be limited and that we might be missing out on important regions of the genome.

Having said that, does this mean we are doing everything wrong? My take would be no. Science is an evolutionary process that gets better with time. We learn to unlearn and relearn. As we learn today about these issues we can actually work on them to get a better resolution of discoveries from our human genome. Improve our reference genome or build a way that the current reference genome can evolve by incorporating the information from various large scale genomics-based researches already performed on various diversified populations. This helps the quest to disprove once proven theories. Some recent works that have proposed a way to tackle this issue are using the “graph genome”. Some of the works related to graph genomes have been published recently.

For some cool video educational on graph genome concept refer to Seven Bridges video link below.

https://vimeo.com/7bridges/graph-genome

Some publications in the space:

There are already some aligners out that can do the needful. However, I will make a post on this in the future.

To this, I come to the end of my post. The idea of this post stemmed from the days of WES works, I have done earlier, however, my knowledge scope was limited back then. With years of learning, I got more understanding of WES/WGS/GWAS in the context of “reference genome” and its diversification issues. I hope, I was able to convey the message of the current “reference genome” issues. I also hope this post serves as an informative resource for all providing deeper insight into the field for one and all. I also think I was able to give some clarifications as what such diversification issues might lead to and in-theory its impact on our understanding of the disease and developmental biology at the population scale. With coming years, as scientific progress occurs, I also expect the concept of graph genome or some other new innovative ways “reference genome” can tackle the issue of diversification to better understand our human biology in “Genotype-Phenotype” associations.

I thank and acknowledge all the authors whose work I have cited/referenced along with their publication images in this story as they provided me enough food for thought to start with and pen down my understanding. I put my readings and experiences in a collective manner for myself and others to make use of it as a resource for future research. I would be happy to add more information for clarity if needed. Please reach out in comments or via any social media platforms where you find this article for any clarifications. I would be happy to make edits as per the request.

Edit 1: I just realized one of the Graph genome paper references had broken link “Genome graphs and the evolution of genome inference”, Hence I updated it.

Edit 2: Figure 6 and 7 are from the paper link. I have updated it again as the previous link has been reported to be broken. Please reach out if its still an issue.

Edit 3: Steven Salzberg commented that the dates of some of the genome drafts in Figure 2 were incorrect. I thank him for providing this information. I have now modified and changed it, linking with the sources.

How well do we know our “reference genome”?

Written by Vivek Das