Simple Convolutional Neural Network for Genomic Variant Calling with TensorFlow
AI/Machine Learning in Biotech Startups
No doubt that the fast development of recent progress using deep neural network has change the way that we can solve various problem from image recognition to genomics. Many startups (Viome/CZ Biohub/Deep Genomics) has emphasize using “AI/machine learning” for their research and product. Other relative “big” biotech startups in Silicon Valley, e.g., Verily, Calico and Grail are either funded by Alphabet or have deep connection to Google.
Certainly, these companies will be focusing on applying machining learning, deep neural network and challenging statistical analysis on the large amount of data that they can collect (with the deep pocket of investment from various sources.). Google Research also has published a paper presenting “DeepVariant” for doing single nucleotide polymorphism (SNP) calling from DNA sequence data using Google’s internal computing power and provide it as a cloud service.
Working for a company providing research instrument (Single Molecule Real Time DNA Sequencer) for genomics and having day-to-day job on analyzing genomics data and algorithm development, the “DeepVariant” work certainly drew our attention. The work is impressive that making a general designed neutral network (Inception v2) for something it is not initially designed to do.
While I have been in the field of bioinformatics for more than 10 years, I do not work much in the subfield of “variant calling”. Algorithms and tools for variant calling focus on identifying small differences between genomes from sequencing data. In the last couple of years, I mostly work on re-constructing genome “de novo”, namely, building genome sequences without a priori knowledge of a genome. However, recently, I start to get interested in modern day machine learning tools and theory, especially in the form of artificial neutral network (ANN). It might be a good exercise to work out some small problems as an exercise, so I consider the variant calling as an application of ANN a little bit more recently.
What Is Variant Calling?
What is “variant calling” in Genomics? A human genome has about 6 billion letters. If you think the genome is like a book, it is a book about 6 billion letters of “A”, “C”, “G” and “T”. Now, everyone has their unique genome. However, scientists find most part of the human genomes are identical to each other. With the Human Genome Project, the scientific community has built a set of sequences as the human genome “reference” as the basic book for genomic analysis work. Instead of re-writing the book every time from scratch for everyone’s DNA sequencing technologies provide information, e.g, short piece of the text in the book, that can be used for identifying the differences of each personal genome “book” to the reference book.
With the human genome reference, scientists and commercial service providers can just sequence short DNA fragments and “map” those fragment reads to the reference identifying those small differences. As one don’t need to reconstruct the genome sequence from nothing like de novo approach, this is an easier and cheaper approach to get some useful genetic information. While this is cheaper, it is not without some caveats. If there are some more personal specific DNA sequences that are not in the references, it can cause errors in the mapping process for assigning the reads to the reference. There are also many identical or near identical sequences in the reference. If a read does not contain some “unique” signature, then we don’t know where to place the reads. There are also sequencing errors. Not all read we get from a sequencer can be 100% accuracy.
Basically, to call variants correctly, we need to identify (1) the correct location a read is from in a genome, (2) correct alignments between the mapped reads to the reference and (3) correct statistics used to identify the variants from the alignments. Usually, (1) and (2) are taken care of by “genome aligner” or “genome mapper”. For example, “bwa” is a popular genome aligner developed by Heng Li. Once we get the DNA fragment reads mapped, bioinformatics practitioners can visualize them by generating some “pile-up” views (see below). In most case the differences from the sequence data to the reference are obvious. With some simple counting algorithm, we will have a reasonable good variant caller. In the meantime, the genome and the bioinformatics processes can introduce artifacts, combining with sequencing errors, making reliable variant calls relative tricky.
If you read the “DeepVariant” preprint, you can see that the authors developed the work for a competition on “PrecisionFDA Truth Challenge”. I think one goal of the Challenge is to build a standard evaluation variants set so one can use it to test variant calling pipelines and algorithms. The organizers of the Challenge generated a “high-quality variant” datasets for those variants that can be called with multiple sequencing technologies in those regions that the read-reference mapping process is not an issue. These datasets are used for evaluating the quality for variant calling algorithms.
Using CNN (Inception v2) for Variant Calling
The DeepVariant preprint shows how it uses Google’s DNN infrastructure to make a variant caller that wins the PrecisionFDA Challenge. I don’t think I can get into more details of the DeepVariant method here. In the meantime, the concept is quite simple. As we see in the pileup view shown above, most cases will be simple but there will be some difficult case. In some of complicated alignment cases, most time, a human being looking at the alignment view carefully can call variants better than some simpler algorithms. If such images may contain the proper information got those fuzzier variant calling cases, a neural network that works well for image classification will likely be able to do a better job for calling those variants then hand-coded algorithm that may not catch some edge cases well.
The authors convert the sequence alignment data of potential variant candidates into the “pile-up” images. The different color and alpha channels are used to encode the read sequences, per base quality and the strand of a read. They use almost all high confident variant calls identified by the Challenge organizers in a human genome to train the neural network and classify the called variants into three categories: homozygous variants, heterozygous variants and none variants. The evaluation results are summarized in a table inside the preprint:
The DeepVariant shows good performance on variant calling for many different sequencing data types. Nevertheless, as the authors have pointed out in the preprint, it would be sub-optimal when one uses an image classifier for genomic variant calling. There could be many valuable information lost in the process converting the sequence data to images. With single molecule sequencing, the error rate can be higher. One can already visually see the pileup images from single molecule sequencing are quite different from those of short read sequencing. Due to the different nature of single molecule technologies, such pile-up images may not capture proper information for single molecule sequencing.
I have been playing with Tensorflow building neural network models for various sequencing related analysis last couple of months. Although I don’t typically work much on variant calling problem, I think it is useful to test it out as a small exercise to see if we can call variants from single molecule DNA sequencing data by encoding the data for a neural network to learn.
A Simpler CNN for Variant Calling
Different DNA sequencing methods have different error modes. In single molecule DNA sequencing, one gets weak raw signal as there is no chemistry step to amplify signal. Also, the single molecule DNA sequencing is typically done by observing single molecule events in real time. We get extra bases or miss some bases in the DNA fragment reads. We call such errors “indel (insertion-deletion)” errors. Most single molecule technologies have higher indel error rates than other shot read technologies now. However, single molecule technologies allow us to read really long pieces of DNA fragment. In comparison, typically, single molecule technology can read DNA 10 to 100 times longer than non-single molecule DNA sequencing technologies. More indel errors make many things more challenging. On the other hand, longer DNA fragment reads helps a lot to resolve ambiguities as the longer DNA fragments likely contain information that are unique in a genome. We will be able to identify the where a read comes from in a genome correctly.
The impact of the “indel” errors is that it makes the alignment (base to base corresponds between the DNA read fragments to the reference) “fuzzier”. Thus, information about a variant may not be concentrate at the variant site but being scattered apart. Like DeepVariant, if we can collect the information beyond the variant sites in a concise way, a neural network will be able to use that extra information for better classifications. We also do not have the resource like Google that can train a big neural network model. Can we make a small neural network variant caller that can be trained using finite amount of resource?
In fact, it is quite straight forward to build a small convolutional neural network for variant calling with Google’s Tensorflow and train a small neural network with a limited amount resource. I build a quick prototype that I call “VariantNET”. We don’t generate alignment pile-up images and use a big image classifier CNN for variant calling. We summary the alignment information into three 15 x 4 matrices. We first do some simple statistics to build a list of variant candidates. For each candidate, we add 7 base flanking bases on both side. Including the variant candidate site, we consider 15 bases in the reference at once.
For each position, we use one-hot-like code to generate the matrix. Namely, for each position, we track 4 counting number for each base of “A”/”C”/”G” and “T”. One matrix track the baseline encoding the reference sequence and the number of supported reads through a position. Two other 15 by 4 matrix track the differences between the sequencing data and the references. One matrix is design to track the difference from all bases from the sequence data and the other one focus on missing bases on the reference. We think such summary may capture the useful information more concisely than the alignment pileup image. For example, if there is an extra base in the reads, it is usually just a simple marker in the pile-up image and we might loss the information about what kind of DNA nucleotide the extra base is.
For each candidate variant site, a “tensor” of shape 15 by 4 by 3 is sent to an almost embarrassing simple convolutional neural network (2 convolution + maxpool layers followed by 3 full connect + drop-out layers) for variant calling. We design the network output to call a candidate as a “heterozygous variant”, “homozygous variant”, “non-variant”, or “complex-variant”. Beside classifying the variant type of the candidate sites, the VariantNET is also trained to predict to the possible variant bases. Our preliminary experimental results using such simpler CNN for variant calling show quite good performance with minimum tuning. In the example, we only use a small training set. The training typically runs less with 10 minutes when using pre-identified variant call from Chromosome 21. We test the prediction on Chromosome 22, we typically get both the recall rate and PPV above 98%. As we have not done any optimization on the alignments and using any other information other than the bases generated by PacBio’s sequencer, it won’t be surprise some further optimization can bring it to higher variant calling accuracy.
I have yet to perform larger scale variant-calling experiments, so a direct head-to-head comparison is not available now. A full evaluation that leads to an academic paper write-up could take a while. I don’t have plan to do that yet. I think this small exercise already show it is possible to use a small neutral network rather than a big one for improving variant calling accuracy for more noisy sequencing data, assuming the errors are not systematic.
The VariantNET could probably handle majority isolated SNP calls. When there are variant nearby or the variant is a small indel, then we current put the variant call into the “complex variant” bucket. This may not be ideal. For those complex variant, a classifier like VariantNET can serve the role for increasing the confidence of a candidate has indeed variants nearby. We can generate de novo consensus for final variant calling to avoid the alignment artifacts caused by the bigger differences between the reads and the references.
I am sure there are others who might be working on using neural network for variant calling too. Using ANN for variant calling seems obvious thing to do anyway. Fortunately, as we show here, we might not need a big machine to perform the work. We use a very simple neural network and very minimum information from the sequencing reads but we get reasonably good results in our first try. It will be exciting to see how incorporating other useful information beside the bases and using better preprocessing algorithms can further improve the variant-call qualities. I am not sure how much I will work along this line at this moment. Nevertheless, if any of the reads are interested in some discussion, please let me know. It is fun stuff combining today’s two hot topics: machine learning and genomics!!
Thanks Alexandr Kalinin for pointing out their great and comprehensive review paper for the current state of art on deep learning for biology and medicine: http://www.biorxiv.org/content/early/2017/05/28/142760, there are a lot of useful information and discussion. I highly recommend it.