Finding Human — by sequencing our Ape relatives

Liz T
PacBio
Published in
9 min readJul 23, 2018

I first met Zev when he was a postdoc in the Eichler lab at the University of Washington in 2016. At that time, the Eichler lab had just published a newer version of the Gorilla genome using PacBio data that was a drastic improvement — more than 180-fold improvement in contiguity and recovering 94% of the incomplete genes.

This generated some media buzz, including a Washington Post article that gave a good background on why we care about gorillas, or any other non-human primates, beyond biological curiosity: Because only through looking at our closest relatives can we begin to understand the genetic events that led to us having larger brains and other distinct neurological traits.

After talking to Zev, I realized something much bigger was in the works. The gorilla was just the beginning — they were sequencing the rest of the great apes for both the genome and the transcriptome! By combining high-quality genome assembly with full-length transcriptome sequencing, they were able to achieve a much more accurate and complete picture of the genomic and transcriptional landscape.

The Great Ape paper gracing the cover of Science, June 2018.

The paper, titled “High-resolution comparative analysis of great ape genomes”, was published in Science on June 8th, 2018. A week later, at the SMRT Leiden conference, I (in typical last-minute fashion) asked Zev to explain the figures from the paper so that I could select one to include in my talk. Only then did I realize how thick the story is.

The main article is four pages of text and… 164 pages of supplement! Though I was immersed in the technologies and tools he was using, it still took a long time to digest just one figure. Though there have been several summary articles written, including a UW press release, a Nature commentary, and a Science summary, I thought: If I’m going to read the paper and ask Zev a bunch of questions, I might as well write a blog about it!

So, I decided to “interview” Zev and write this post. It should be said that this post is no substitute for reading the paper itself. Rather, I hope this post will complement the article by providing my own perspective and some of Zev’s answers to my (often silly) questions.

It is also important to note that while I only talked to Zev, like all scientific studies this project is the result of team work. Zev wants me to use this opportunity to emphasize his gratitude towards all his co-authors, which not only includes lab mates and academic collaborators, but also industry employees from PacBio and BioNano.

NOTE: the Eichler lab has a webpage that hosts all the tables and data available for public download: https://eichlerlab.gs.washington.edu/great_ape_assembly/.

Better Genomes, Better Annotations, Better Answers

Zev and his co-authors sequenced one chimpanzee (Clint), one orangutan (Susie), the haploid human cell line CHM13, and the diploid human Yoruban (NA19240). The gorilla genome (also named Susie) is the previously published Gordon et al. assembly. Genome assembly was done using PacBio sequencing, followed by polishing using Illumina, then finally scaffolding using BioNano, Hi-C, and BAC sequencing. Zev wrote SMARTIE-SV to call structural variants. RNA samples were derived from iPSC cell lines and sequenced using PacBio (Iso-Seq analysis) and RNA-Seq.

Figure S43 from Kronenberg et al. showing the genome assembly and annotation workflow.

Central to getting better annotations was the CAT (Comparative Annotation Toolkit) software by Fiddes et al (paper, GitHub). It combines a multi-species alignment (using Progressive Cactus), existing reference annotations, RNA-seq, and PacBio Iso-Seq data. CAT then runs transMap to project annotations from well-annotated species to less-annotated species; it configures AUGUSTUS to do ab initio gene prediction and also utilize full-length transcript information from Iso-Seq data. I’m excited to see CAT because existing annotation tools (such as MAKER and AUGUSTUS) have been limited to only understanding short read transcript information or rely heavily on reference-based alignments. CAT addresses all of these shortcomings.

The new SMRT assemblies and gene annotations, especially for the primates, were night and day compared to the existing versions. The chimpanzee genome saw a 32-fold increase in contiguity with 52% of the 27,727 gaps closed; more than 2000 inversions caused by scaffolding errors in PanTro5 were corrected and more than 20 Mbps of sequences were removed. The orangutan genome saw a 533-fold increase in contiguity with 96.8% of the gaps closed. Additional sequences were added to the primate genomes and redundant sequences removed. The new CAT-based annotations added at least 300 new genes to each primate. New exons in known genes were discovered. More transcripts from both the great ape Iso-Seq data were mapped to the SMRT assemblies than against the old assemblies.

It is upon this great new foundation of genome and annotations that the great ape paper dives into “anthologies of interest”, each tale a study of character in evolution.

Re-Estimating the Human-Ape Divergence

Figure 2C from Kronenberg et al showing the phylogenetic tree between human and the great apes.

Zev pointed out that the 2016 gorilla assembly revealed that long-read assemblies are fundamentally different from previous short-read assemblies, and not just in terms of contiguity. With the new SMRT assemblies, the majority of the genomes — short of the most complex and large segmental duplications — can now be aligned. The new assemblies estimate a slightly higher divergence than previously reported. A random sampling of 10,000 coding regions across the specie estimates that 35.6% of the human is subject to incomplete lineage sorting. Confirming the hominid slowdown hypothesis, the chimpanzee-human branch lengths were shorter when compared to the rest of the great apes.

Structural Variations Hidden in Time

The high-quality assemblies allowed the identification of both shared and species-unique structural variations from orangutan to human. Using CHM13 and Yoruban to control for reference effects, they identified 17,789 fixed human-specific structural variations (fhSVs). Some of the fhSVs were projected to cause codon loss, while others would disrupt regulatory regions. An example was upstream of the androgen receptor (AR) gene, where, compared with gorilla, the human genome had a 61kb deletion followed by a 24kb inversion.

Figure 3A from Kronenberg et al. showing species- and lineage-specific structural variations.

When they compared the newly defined fhSVs against a previous definition of human-specific deletions called hCONDELS, they found discrepancies. The hCONDELS were defined as human deletion events that were conserved between chimpanzee, macaque, and mouse; it did not look at gorilla or orangutan. The new SMRT assemblies revealed that some hCONDELS turned out to be not human-specific — some were a result of hCONDEL not taking into account inversion events, some turned out to be missing in gorilla or orangutan as well, thus making them not human-specific.

Finding the Source of an Endogenous Retrovirus

Figure 2E from Kronenberg et al. showing the “source” PtERV1 that is found in chimpanzee and gorilla.

The story of PtERV1 is interesting because it is an endogenous retrovirus that is found only in chimpanzee and gorilla but absent in orangutan and human. Comparing the identified PtERV1s in chimp and gorilla, they determined that only one is orthologous. This “source PtERV1” is a 379 bp element inserted within another LTR, which is only discovered now because the SMRT assemblies resolved most repetitive regions. The gene tree supports the incomplete lineage sorting hypothesis that this PtERV1 element was integrated 4.7 million years ago and have drifted to extinction in the human lineage.

Human-specific Deletions lead to Coding Gene Changes

Figure 4A and 4B from Kronenberg et al. showing a 66 kb human-specific deletion in CARD8 leading to exon loss and a 62.5 kb deletion in FADS2 leading to relative isoform expression changes.

The CARD8 human-specific deletion was not previously described until this study. A 66 kb deletion in the coding region led to 13 exons being lost. The FADS1/FADS2 example is fascinating and shows the power of combining genome assembly, comparative genomics, Iso-Seq analysis, and RNA-seq. There’s a 62 kb deletion in the first intron of the human FADS2 gene. As a result, the first intron got a lot shorter. The “longer” isoforms (L1 and L2) are much more expressed (14% and 3%) than they are in chimpanzee (1.25% and 0.08%). Is it possible this is because shorter introns facilitate high expression? FADS1/FADS2 are associated with fatty acid biosynthesis and this could be associated with the evolution in diet changes. Zev thinks the human specific alleles at FADS1/FADS2 were critical to our evolution, but the how and why are still a mystery.

When I saw the FADS2 figure, I could not help but wonder: How many other genes are out there where a reduction in intron size leads to isoform expression changes? It should be possible to systematically interrogate the publicly available great ape dataset to get an answer.

What Makes Us Human?

Photo by Liane Metzler on Unsplash

Reading the paper, I got the feeling there’s an underlying theme to the vignettes chosen in this paper, and that is: The human-chimp divergence (or “human-chimp intelligence gap”) is likely the result of many large and small genomic events, all of which contributed to neocortical expansion. On the long end, Fiddes et al. recently found segmental duplication (SD) led to three functional NOTCH2NL genes that leads to the larger brain size in humans. SRGAP2C and ARHGAP11B are two other examples of SDs that contribute to larger brains. (Zev notes that many of the segmental duplications in great apes aren’t resolved in this project.) On the shorter end, the great ape study found WEE1 and CDC25C to contain human-specific sequence changes due to deletions that were only 107 bp and 1920 bp long. The single cell work revealed genes down-regulated in the radial glia to be enriched for human-specific structural variations.

But what makes us intelligently human may also make us susceptible. The same hot spots that gave us new genes that increase cell division, delay brain maturity, regulate DNA binding, are also likely responsible for schizophrenia, autism, and other intellectual disabilities [see Dennis et al. (2017); Dennis & Eichler (2016)]. This study revealed that while human and the rest of the NHPs have about the same number of short tandem repeats (STRs) in the coding region, there are 4920 loci where there are human-specific STR expansions, and those loci include ones associated with genomic instability and disease.

Beyond brain sizes, there’s also other traits that make us human. Humans are the only primates that are vocal learners. Our closest vocal learning relatives are bats, dolphins, and then elephants. Then there’s a distant group of birds that somehow also learned to sing. It is most likely that vocal learning was independently acquired, but it’s still staggering just how that could arise through evolution over such a vast amount of time! There’s also diet genes known to contribute to human evolution, including the salivary amylase gene (AMY1) (which was not looked at in this study) and the FADS1/FADS2 example. It would be interesting, once high-quality genomes and genome annotations are available, to trace the evolutionary history of each of these vocal or diet genes. Luckily, the answers may just be on the horizon: with the Bat1k project to sequence all extant bat species and the Vertebrate Genome Project to sequence all vertebrate species, it may not be too long before comparative genomics will deliver new insights.

The Relentless Progress in Sequencing and Bioinformatics

When I asked Zev if the combination of sequencing technologies and the computational tools used for this paper could be used as a guideline for future projects, I should have not been surprised when Zev outright said “No”.

No, Zev explained, because PacBio has improved its sequencing and its tools so much since the project. The project was done on the PacBio RS II platform and assembled using the FALCON assembly software that was not diploid aware. Now sequencing is done on the Sequel platform with higher throughput and longer reads, and there’s FALCON-Unzip and even FALCON-Phase. Consensus calling has also improved, likely eliminating the additional short read error correction Zev had to do for the great apes. “A lot of the pain parts of the technology has probably gone away by now.” he said, “The future is diploid assemblies where you can do linkage analysis.”

Zev is right. How many times have I myself read a PacBio paper and wanted to shout to anyone who reads it “Don’t use that tool/method! It’s outdated! We have a better thing now.” when the study was done a mere year or two ago? Just the other day, I referred to the HDF5 format that PacBio transitioned out of as “granny’s old socks”.

Things are moving so fast. And it is only the beginning.

Photo by Mike Arney on Unsplash

--

--

Liz T
PacBio
Writer for

All things RNA. Bioinformatics. Opinions are my own.