This is a guest post by Michelle Vierra (@the_mvierra)
New data type spawns curiosity at PAG
After another wonderful year at the Plant and Animal Genome Conference (PAG) in January, my colleagues and I were struck by the wide variety of genomes that have been sequenced with PacBio’s new HiFi reads. HiFi data, which is produced by highly accurate long-read sequencing, strikes a balance between read length — with reads up to 25 kb — and accuracy — with reads that are at least 99% accurate. This balance seems to be a winning recipe for assembling complex genomes: the long read lengths easily span shorter repeats, while the high accuracy helps distinguish large, complex repeats. It’s the best of both worlds.
HiFi data was the belle of the ball at PAG, where we saw extremely high-quality assemblies for everything from humans to cannabis, fish, and tetraploid plants.The informatics community jumped right on board too, with three different assembly tools (HiCanu, Hifiasm, Nighthawk phasing tool) debuted during the week specifically focused on optimizing assembly with HiFi data!
The number one question I was asked at PAG was whether we recommend using HiFi or traditional long reads for extremely large genomes, like the ~15 Gb hexaploid wheat genome. At that point, we didn’t have enough data to give a formal recommendation. However, I did get a boost of confidence in the ability of HiFi data to resolve even the largest plant genomes after Kevin Fengler, Comparative Genomics Lead at Corteva Agriscience, presented his assembly of the 11 Gb oat genome, done in only 12 hours, leading to a contig N50 of over 20 Mb!
With the oat genome success in mind, we contemplated what some of the most famously crazy genomes would look like with HiFi data. Would that balance of read length and accuracy tame even the wildest of genome challenges the world had to offer?
Go big or go home — tackling a giant genome
Then an idea struck. Being a California-based company sitting next door to Stanford University piqued our interest in one of the locally famous species of tree — the towering California redwood (also known as the coastal redwood). The California redwood genome is estimated to be around 27 Gb and hexaploid — a beast of a genome by any measure!
After calculating the tissue material required for DNA extraction and the number SMRT Cells we’d need to sequence, developing a list of software tools that would be up to the challenge, and cultivating a fearless team of PacBio scientists, we decided to go for it!
Luckily, California redwood trees were planted on the beautifully landscaped, public Stanford campus. Emily Hatas, our senior director of business development and fellow plant enthusiast, Greg Young, our Bay-Area-based senior field application scientist, and I packed up some ice, scissors, and a kitchen scale, and headed over to the trees one sunny Monday afternoon. After a quick rinse and flash freezing process, we accomplished step 1 — sample acquisition. We then enlisted the help of our applications development group to isolate DNA with the Circulomics Plant Nuclei kit and generate a HiFi library worthy of the cause, completing step 2 — sample preparation.
After a quick single SMRT Cell test to ensure library quality, we went into full production mode, sequencing 606 Gb of HiFi data over a period of 7 days. This data represented a 22-fold coverage of our anticipated 27 Gb genome. We have observed in many HiFi genome assembly projects thus far that the traditional method of generating high coverage to polish out errors isn’t needed, and excellent assemblies have been generated from only 20-fold coverage of HiFi reads. Thus, hitting our coverage target, we felt comfortable to crack on with the genome assembly.
Greg Concepcion, our staff engineer of bioinformatics and resident large genome wrangler, then took the reins for a first attempt at this giant genome assembly. Greg chose Hifiasm since it’s been reported to be one of the fastest assemblers and also focuses on resolving haplotypes, both features important to resolving a 27 Gb hexaploid genome.
After just 6 days on 64 cores with 512 Gb of RAM, the assembly finished with no issues along the way, a real testament to the clever coding by Haoyu Cheng and Heng Li in the Hifiasm assembler. The results were amazing with an assembly almost twice the size of the expected genome with a contig N50 of 1.92 megabases! The larger than expected assembly size, which appears to represent two similar haplotypes rather than the six expected for a hexaploid, seems to agree with the suggestion that the most recent polyploidization event for the California redwood is an autopolyploidy event, as described by Scott et al. Overall, we are very pleased to see the improvements that this genome assembly represents over other recent conifer genomes.
No genome too large for HiFi reads
So, what do we get out of our work on the California redwood genome? First, I am now confident in my recommendation to use HiFi data to generate high-quality genome assemblies from any organism. This redwood genome hits the mark on all 3 C’s of genome assembly quality — contiguity, completeness, and correctness. Secondly, it shifts the thinking around large, complex genome assemblies, previously thought to take a ton of time and compute resources for assembly — not to mention the sequencing time. This massive genome was put together in just 17 days — 4 days of sample prep, 7 days of sequencing, and 6 days for assembly. I remember not that long ago it took about the same amount of time to assemble a human genome! High-quality genome assemblies for any organism are now truly accessible to anyone wanting to do one.
Lastly and most importantly, we get to share a great new resource with the community so you can explore the data for yourself. We’ve opted to make the assembly and data totally public for anyone who wants to try new or additional tools, browse the genome for interesting biology, or just prove to themselves that the assembly can be done. We are hosting the data and assembly here, so have at it and happy genome exploring!
Although we make it sound easy, we could not have done this project without the amazing team of PacBio colleagues who dedicated their time and expertise to make this project happen. A big thank you to: Lee Chern, Primo Baybayan, Lei Zhu, Harsharan Dhillon, Greg Buck, Patrick McNamara, Richard Hall, Jonas Korlach, and all our bosses for letting us do it.