The End of A Chapter: A Memoir on Bioinformatics Development Work for Single Molecule Sequencing (Part 1/N)
Illumina acquiring PacBio on Nov. 2, 2018
I left PacBio a bit more than one year ago so I can work on different things. The announcement about Illumina acquiring PacBio last Thursday (Nov. 2, 2018) still created a lot of emotional impact on me. There were so many interesting stories that I had experienced myself in my 9 years tenure with PacBio. I just want to write something down before I forget. I decide to share some of these stories. I hope you might find some interesting to you if you are working in the DNA sequencing and/or genomics field.
From SOLiD to SMRT
Roughly about 10 years ago, I was working in Applied Biosystems doing what people might call “data science” today. What we did was analyzing data for scientific instrument technology development. More specifically, a DNA sequencing technology (SOLiD: Sequencing by Oligonucleotide Ligation and Detection) that Applied Biosystems was developing at that time. I had quite some fun on trying out interesting ideas, e.g. using non-parametric method than model based method for base calling, and designing synthetic DNA sequences that have interesting non-trivial graph properties for testing. (Anyone who did SOLiD sequencing remembers “SynBead”? I designed the first version of the sequences based on a couple of interesting properties of a de Bruijn graph.)
While I had some fun analyzing “big genomic data” at that time, I got a call from PacBio’s HR asking if I was interested in a job there. I went through the interview process. I was not sure whether the technology would work or not. Eventually, I decided to join PacBio to experience “Silicon Valley” start-up first hand and to fulfill my “curiosity” to see whether single molecule DNA sequencing is possible.
In fact, I had applied a job in PacBio one year ago, and I had never heard anything back from PacBio for the whole year I worked in Applied Biosystems. I was very surprised they kept my resume. Once I started as a software engineer/bioinformatist in PacBio, I understood why. There was not much need for bioinformatist before me as there was not much DNA sequence generated before I joined the company.
Our earliest work there was more like a “data science” job than a bioinformatist’s one, even though the fancy term “data science” was not invented or not popular yet. In many experiments, it was not even clear those DNA sequences we got were real or just random noises. Our work included building statistical analysis framework to discriminate real signal from “random” noise, to develop metrics for improving experimental processes and guide protocol decisions. We also built an internal web-based platform for analysis automation from scratch. For example, the current sample format “SMRTbell” used by PacBio invented by Kevin Travers is essential to enable circular consensus sequencing (CCS, recently it has rebranded as Hi-Fi reads for long DNA sequencing). In the earlier days, we were not sure if that was the best choice in comparison to alternatives. Also, there were many possible design choice for hairpin adapters used in “SMRTbell”, we, as bioinformatists or data scientists, needed to help Kevin to build pipeline and to perform analysis so we could help to pick the optimized solutions given any availiable evidence.
Sequence E. Coli Genome In Many Weekends
Internally, there were about 10 prototype machines that the scientists called “Astros”. The names of the machine were all from the characters in the comic book “Astro”. Each of them was capable to run “chips” that each of them had 3,000 ZMWs (nanostructure for sequencing DNA). At that time, we barely get any sequence longer than 1kb and the number of “useful” sequences that we can get was minimal (~ 100 to 500 useful sequences per chip). In fact, the first “big” project I was involved was to demonstrate that the single molecule sequencing method was capable to sequence an E. Coli genome so we can announce the achievement in the AGBT conference.
My good friend, Paul Peluso, led the multi-weekend (yes, weekends, as the prototype machine were occupied during weekdays for many other purposes) effort to achieve the goal. In fact, I remember that there were a lot suspicious voice wondering if that was achievable. The “useful sequencing capacity” of the systems was so low we would need a couple quantum jumps in the technologies to get it done in time. We eventually made those quantum jumps and did it. The CTO was able to show some of the data in the AGBT conference a couple months later. There was no algorithm and software for processing such “high noises” (~20 to 30% errors) data. My involvement was on “making the data working”. I pulled a couple of strings using ideas from Bayesian statistics utilizing simple k-mer context to call SNPs reliably from such high noise data. While those methods I developed was totally obsoleted today, it helped to demonstrate those high sequencing errors from single molecule sequence can be overcame so we could actually build applications on top of such single molecule sequencing method.
Cholera Project: A Project With Real Impact Even With Very Noisy Sequencing Data
The next “big” project I involved and had more visibility externally was the “Cholera” project. During the 2010 Haitian Cholera outbreak, the well respected Chief Scientific Officer at that time, Eric Schadt (now CEO of Sema4) and the company decided to help micro-biologists to identify the source the outbreak strain. This was one of the most exhausting and most exciting projects.
There were a couple challenges. First, there were not really any working machines running at “production” level. In fact, there were just a couple of “breadboard” machines as the prototypes of the first SMRT sequencing product “PacBio RS”. While those machines were capable for sequencing DNA, there was barely any robust hardware and software integration. Secondly, there was no mature software and algorithms for utilizing high noise data. The bioinformatics software team was just gradually figuring out how to handle single molecule sequencing data, for example, building a new sequence aligner that can handle the noisy reads efficiently. We had really not much clue how to achieve the goal from bioinformatics point of view initially. While we had a couple bioinformatists who were good software engineer and/or algorithm development, none of us were specialized in microbiology or pathogen detection.
Nevertheless, the sprit was high. We gathered the engineering team, chemistry team, biologists, and bioinformatics team together in one room to form a sequencing and analysis plan and we delivered. If I remember correctly, it took only about three weeks from the time we got sample to the time we got a manuscript accepted in New England Journal of Medicine. To me, it was unheard-of that it was possible to publish scientific work at this speed. Certainly, it was crucial for our results to be published so scientists can help patients in Haiti better.
I was lucky to be assigned as the first author of the paper. I guessed it was hard to use the high noisy data, and the “creative” thinking from my team (my ex-boss Jon Sorenson, my team member Dale Webster, Jim Bullard) on how to analyze the data to reach rigorous conclusions helped for a bioinformatist to get the first author spot for this kind of work.
Using “error correction” for high noisy reads with insertion and deletion errors seems “obvious” now, it was far from obvious those days. Today, even PacBio’s competitor (e,g. Oxford Nanopore) needs and uses it. For example, some of Oxford Nanopore customers are likely using the method and code I helps to develop in PacBio if they uses Canu to assemble a genome. Without error corrected sequences, there were no clear path on how to do “standard” bioinformatics on PacBio data to identify the origin of the outbreak strain and the whole team had to learn new knowledges of microbiology and pathogen detection in very brief period of time. We ended up solving the puzzle using a combination structure variation detection and SNP calling. One of the break-thought during the analysis were to figure there were actually typos in a table of SNPs for previous outbreak strains in one of the paper we used. Yes, even with those noises, PacBio’s data were more reliable than some publications :) .
The paper showed that it was possible to help microbiologist to do “almost” real time whole genome pathogen detection if the system performance and the product became mature. The data in the “cholera paper” was not “awesome” in term of accuracy and throughput. It perhaps had helped to draw some criticism and had damage from commercial aspect. PacBio’s competitors started campaign criticizing the accuracy of single molecule sequencing. Nevertheless, it was definitely a important milestone for the genomics field, as one can see there were a lot of similar publication with other sequencing systems after our work.
Later, we did it again for the 2011 Germany E. coli O104:H4 outbreak. Initially, we thought it could be a distraction that took away development resource to work on such scientific project. In the meantime, it became obvious that PacBio’s special single molecule long reads was very important to reveal important biological insight to help the doctor who treated infected patients. We did it again publishing the results in NJEM. This time, we were in a better shape as we had a real product, PacBio RS, and we actually knew how to assemble the genomics through my colleague’ (Dale, Ali Bashier, and Aaron Klammer) work. We resolved a couple nasty anti-biotic resistance genes and we learned more about bacteria genomes.
(I became fascinated with those very interesting repeat patterns in bacteria genomes and always wondering what was going on. One day I had learned some of the fascinating repeat pattern were related CRISP-Cas9 system.)
Although PacBio was a very demanding working place that time, it had very exciting and energetic culture. It was perfect for me to gain new knowledge in software development, algorithm and science. I got opportunities to innovate thinking on how to make the data more useful for various scientific problems even most genomics researchers still thought PacBio’s data were too noisy to be useful. The pace were fast. Many ideas were created and toasted as they were not useful every hour, every day. Fast idea creation and implementation were the key to some of my earlier success in PacBio. It was very satisfying to build real scientific instruments and to participate in scientific research in Silicon Valley than building SEO or doing A/B testing to increase ad revenue, although the financial reward for building “scientific hardware” may not be as good as working for other big “dotcom” companies.
The company were also trying to expand the product line even the PacBio RS machines were not robust in our customers’ hand. We had not really persuade the field that long reads were necessary and it would be fair to say 99% bioinformatists at that time did not know how to handle single molecule long read data. In the mean time, Illumina’s sequencer throughput was getting higher and higher. Starting with the 1,000 Genome Project, Illumina enabled many large scale projects while PacBio was still trying to find a niche to fill. With a plan of a new product, the company were actually expanding fast. Until one day, a decision was made and the company had decided to shift the priority. Instead of rolling out another product, we needed to make RS robust and reliable, we needed to show the world why long reads were useful, and probably the worst, the company needed to cut cost during the overall economics down term.
One day, the company laid off 28% employees and the stock price crashed. It was a very emotional days. Many great colleagues were let-go. People who stayed were not quite sure about the future. The overall economics was not bright. In the mean time, most bioinformatists / software and algorithm developers survived the layoff. I thought, because the company still needed to demonstrate its value, that was the reason that we had survived. I thought I was lucky to be able to work for analyzing such single molecule DNA sequencing data before anyone had seen it, we could identify so many new discoveries that no one had thought about. The flip side was that we still needed to overcome the hurdles that we were generating data most bioinformatics practitioners may not know how to work with and biologists had yet to understand those value of unforeseen discoveries.
It was a very sad day. However, we were be able to focus on the right path again.
A couple days later, I left my office around 7:30pm and looked around the company’s park lot, I was so moved that the park lot was still quite full. (In contrast, I still drive through the company sometime as it is in my commute route. There are rarely cars in the parking lot after 5:30pm these days. I guess It is a good thing the working culture becomes more healthy.) My colleagues did not give up. We believed we had one of the best technologies for DNA sequencing and it just needed some more time to make it mature. People remained working very hard and kept innovate through all fields, bio-chemistry, sample preparation, manufacturing improvement, better field support and diagnostic of problems, etc. The list can go on. But, there were also aftershocks. We had a really bright bioinformatics algorithm development team, but some people did not feel secure and left for more secure or mor financially rewarding positions in Silicon Valley (not hard if you have a CS degree.) Some decided to start up their own companies too. We had to start over.
Assemble Genomes with Very “Noisy” Reads Alone
One niche application that PacBio was focused on is for genome assembly. Aaron Klammer and Ali Bashir were the main developers for the genome assembly problem.
In the early day, there was a term “one-contig” that executives and marketing department were using it to push for development, but our bioinformatists were either scared about it or joked about it. We thought if we could make bacteria genome assembled to “one-contig”, the company would be successful. In the meantime, it was not clear if the necessary conditions to make that happening were fully understood.
We had proposed hybrid assembly approaches, using Illumina’s more accurate reads to correct PacBIo sequences. It worked but less than ideal for both technical or commercial reasons then. (Now, since PacBio will be absorbed bu Illumina, maybe it will come the future again.) Ali and Aaron were working very hard on a couple ideas and collaborations for genome assembly. After the layoff, Ali, who was in my group, decided to go back to academic and Aaron needed to take up software development responsibility. After this aftershock, I guessed assembly development became my responsibility. While I knew the general concept about genome assembly and might use others’ code the make one and did some simple analysis. I had not write single line of code for the genome assembly problem then and still got confused about various terminologies. I felt embarrassed that I had to ask Aaron, “what is the difference between ‘contigs’ and ‘scaffolds?”.
In fact, my role was a manager of the internal data science group and my “hobby” was to understand and build an error correction “theory” for correcting sequence with insertion and deletion errors effectively. Beside the formal development path of a consensus algorithm (named “EviCon”, I guess no one remember anymore), I was playing with a couple ideas and some implementation for “fun”. In fact, there were a number of design flaws for the first error correcting code base “EviCon” in my own opinion. The EviCon developer left the company soon after the layoff. It was not a bad idea to start from scratch anyway. Patrick Mark, a seasoned and talented algorithm developers and David Alexander, a new hire then, and myself had a number of different ideas. We dropped EviCon and started over. I developed a couple related algorithms for our initial full length HLA sequencing and haplotype work and David and Patrick developed Quiver/CCS consensus algorithm which was the first consensus algorithm that did the job right eventually.
After the experience working on HLA sequencing problem as a machine learning problem (clustering noisy data, cleaning up and validation), we had some confidence that we could constructing high fidelity consensus from high error rate reads. The systems’ performance were also improving, we got more longer reads with less errors. It was about time to try a couple new ideas.
The very first version of code that I wrote for error correction was very slow. It would not be possible to use it for genome assembly propose. It took quite a number of iterations and new ideas to get satisfactory computation efficiency for genome assembly purpose. If I recall correctly, Paul Peluso generated a new E. coli dataset to test out new chemistry and was used for various development purposes. I peeked the dataset and look at the raw accuracy estimation and length distribution and felt I should test the idea that we later called “hierarchical genome assembly process (HGAP)”out.
Most “formal” consensus algorithm development in PacBio focused using a reference sequence as the original templates. For de novo assembly, there is simple no reference. When I was working on the HLA related projects, as the diversity of HLA alleles were very high, I built some code for consensus / error correction without using a reference to avoid “reference bias”. Namely, if you use reference, your results will be more like the reference and you can loss some subtle but important information where your data is actually different from the reference.
Given I already had some code for that, all I needed were finding reads that should be bundled together to form a group so I could use them to correct each other. At that time, “blasr”, Mark Chassion’s long read aligner was already mature and feature complete. “Blasr” was not designed for align raw-high-error-rate sequences to each other, but it was quite flexible. I was able to use it for raw-to-raw alignments by tweaking the parameters. Then, I wrote a couple disposable scripts taking the alignment results and the reads sending to my code for error corrections. At that time, I would not call myself an “assembler developer”, but I knew how to get Celera Assembler running. Once all ingredients came together. Boom!! “One-contig”!! (Well, in fact, I don’t remember wether we got one-contig or not. Surely we got a very good assembly.).
I still remember the moment, when PacBio CEO Mike Hunkipillar visited our office for some other information and I took the opportunity to show him the results and explained to him what I did. And, I told him “This is big.” I thought Mike had a face expression that was somewhere in-between “perhaps” and “what does this mean?” 😀 There was no “formal” project, no product manager and no other “senior executive officer, science one or not science one”, seeing this was coming.
Later, I started to build a code base that is more maintainable by myself. There was still no real company project for “HGAP” for quite a while. We started to publicized the approach to some of our close partners, for example, JGI and Broad Institute, at that time. JGI loved it and we did a couple follow-up related development work together for a while. Broad Institute basically ignored it. I tested the general approaches on a plasmodium dataset for Sanger Institute. and a larger fugal (~60 Mb genome size) genome from JGI. Even with the 80% AT content of the plasmodium, the results looked really good comparing to any previous attempts.
Certainly, this achievement helped me to advance my career inside PacBio. More importantly, it helped me to build relationship to other scientists and algorithm developers outside the company and many of us becomes good friends. Priceless.
Now, once it was demonstrated, the fun began. Collaborated with Jonas Korlach, and Evan Eichler from University of Washington, Alex Copeland from Joint Genome Institute, we published a paper demonstrating assembling high quality bacteria genomes with only so-called “noisy” long-reads. One fun challenge was that it was relatively hard to explain the “correct” bioinformatics ideas and approaches to some C-suit officers in the company and how resources should be spent for bioinformatics development. Even with such difficulties, anyway, I had survived in PacBio.
Our Own Single Molecule Sequencing Human Genome Project
We worked to extend HGAP approaches into larger and larger genomes. Paul Peluso and David Rank were growing our own Arabidopsis strains in a office next to the CEO’s office from seeds so we can have our own DNA source for larger genomes. We were in panic mode once we found our assembly results was very different from what was expected. Liz Tseng and I did some analysis and we were be able to conclude that the seed provider must had wrong labels. I started to worry all research using the seeds from that provider. Mislabel happens, but how people knows about it. We would suggest seed providers to get a PacBio machine and check its strain regularly.
I think we then did Drosophila, and a couple other small genome size model organisms. Of course, ultimately, we needed to do a human genome. We worked with Evan Eichler deciding what human sample will be the most useful for the science community and also demonstrate the capability and the potential of PacBio RS II SMRT-technology. We ended up choice CHM1. Even today, the human genome reference is a mosaic. Namely, the sequences of different regions even in the same chromosome are actually from different individual. There were a couple individual genomes, e.g. Craig Venter’s genome, sequenced de novo then. However, human is diploid organism. We have two different copies of each chromosomes. CHM stands for “hydatidiform mole”. My understanding is that we can get DNA from the “hydatidiform mole” that is essentially haploid, one copy instead of per chromosome. Since it is simpler, it helps us for doing rigorous evaluation work.
As usual, Paul worked very hard to collecte the sequencing data. I was scratching my head on how we could handle the computation needs. The only tool for overlapping the reads I had was “blasr” at that time. “Blasr” was not fast for overlapping reads. In my estimation, the computing cpu hours was in 6 digits. Our IT just got 3 nodes of 16 or 24 CPU cores for us. Even with 72 cpus to spare, if the compute need is 100,000 cpu hours, I would need 578 days to finish the overlapping. It was less than 90day available from late Nov. to next year’s AGBT. I started to look into cloud computing solutions but the only tool I had was “Star-cluster”. While “star-cluster” was great for duplicating “very traditional” HPC in cloud, I doubted it would work well and most current could computing tools we can run customized large computing today was still in its early stage. Anyway, I had to do some experiments on AWS with my own credit card. Once I hinted that to Mike Hunkarpillar that my credit limit might not be high enough to carry out the necessary computation. Mike actually told me: “You can use my American Express Black Card, it has no limit.” Wow.
I ended up collaborating with Google, thanking an ex-PacBio employee who went to Google as product manager there to make the connection. Worked with a Google engineer and utilized some special computing sauce in Google, 450,000 core cpu hour computing with a customized version of blasr was carried out in one day. (In the meantime, Google was not interested in genome assembly after this work. It might be a good thing. With the following algorithm development, currently we can comfortable assembling human genome with 200–20,000 cpu hours depending on setting and data type.) It took me three days to transfer a couple hundred gigabytes data back to PacBio for error correction on our (relatively poor) three node cluster. Then, after another week, we had a first human genome assembled by SMRT sequencing with an assembly contig N50=4.Mb which was unprecedented.
Next: To Be Continued
Well, I am not a fast English writer. And this partial memoir is already in the TL;DR; category. Not that I can write in my native languge fast either, but extra layer of neural processing in my brain makes me tired faster 😀. I also suspect I have some form of dyslexia. Please forgive me on using non-native English speaker’s expression. Grammar/spell corrections are welcome. There are more stories I like to share another time.
A couple of ex-PacBio employees in the bioinformatics software / algorithm team thought about a re-union before the acquisition announcement. Along this line, the company had a lot of milestones many people other than bioinformatists helped to make it to true fruition have left the company. I think it may not be a bad idea to ask this shamelessly: it will be great that if PacBio has some sort of celebration for the achievement of the SMRT-technology, Mike Hunkapillar can invite us back for a re-union and tell some stories about many difficult milestone we have helped. Not sure if Mike will read this or read this all the way here though. 🤣
Also, this writing is about my personal experience, it is certainly biased to bioinformatics and data science stories that I know about. I just want to remind everyone what I said in this tweet. . There must be many other interesting war stories that the enzymology group, the surface group, the chip design group, the robotic group, the primary analysis group, the manufacturing group, commercial group, etc., can tell on how to solve very difficult problems for single molecule sequencing and the business. I know some but I am not qualified to talk about it. Maybe someday someone can tell us how we makes high genome possible from all aspects.
Also, I can’t say my memory is 100% correct. If you worked in PacBio or still work there, and think my description is inaccurate or improper (for legal reason? I don’t want to get into trouble), or, you don’t want your names to be used, please let me know too.