Tumor-normal sequencing: is this variant real? — deep learning and Fast.AI library to the rescue
“All classifications in this world lack sharp boundaries, and all transitions are gradual.”
Solutions to many biological problems, especially cancer, are greatly advanced by the DNA sequencing. Cancer is a disease of the genome. The past decade has seen remarkable advances in the characterization of point mutations and structural alterations in a wide range of cancers; all thanks to the next-generation sequencing technologies.
Several factors complicate cancer variant calling, and often variants called by cancer workflows contain some percent of false positives.
Here is a list of just a few reasons why this may happen:
1. Low tumor DNA content
2. Tumor heterogeneity — different tumor cells have distinct morphological and phenotypic profiles
3. Aneuploidy — the presence of an abnormal number of chromosomes in a cell
4. Structural variation — large inversions and insertions, and translocations.
5. Matched normal contaminated with cancer DNA or circulating tumor DNA in blood normals
6. Sequencing errors
7. Alignment artifacts
Automated workflows can identify and filter many false positive variant calls. However, after an automated cancer workflow completes, there is often a need for a manual review of detected mutations to identify a high-quality list of cancer variants that are used in cancer treatment decisions. Typically, manual inspection involves examination of aligned sequencing reads in Integrative Genomics Viewer (IGV). Manual review incorporates information that is difficult to take into account in automated workflows, for e.g. poor alignment, reads for the variant that simultaneously support a variant detected in a normal, errors at the ends of reads, preferential amplification, and other factors.
I am participating in Fast.AI v3 in-person course this year taught by Jeremy Howard and Rachel Thomas. And what better way to experiment and learn than trying to solve the problem that one cares about and has some relevant domain knowledge in?
Here, I present my work on initial look to what degree manual review of cancer variants can be automated for tumor-normal calling. The scenario addressed here is when tumor and normal samples can be compared; automated cancer (somatic) variant calling workflow was run and we would like to alleviate the burden of manual inspection and pre-classify detected variants into true or false classes.
Here is an IGV snapshot of a true positive variant and a false positive variant:
I used the data for 1,413 non-silent variants (variants affecting proteins) from a handful of tumor-normal pairs (70% was used for training, 30% was used for testing, employing stratified sampling). IGV snapshots, similar to the ones shown in the figure above, were generated. I labeled the data as true or false after looking at the snapshots (1,010 — true, 403 — false, initial false positive rate is 28%).
Here are the 17 lines of code to train a deep learning network that uses transfer learning and achieves 93% accuracy. This classifier greatly reduces the false positive rate down to 4% at the expense of calling 3% of true variants false.
from fastai import *
from fastai.vision import *
data = ImageDataBunch.from_folder(path, ds_tfms=get_transforms(do_flip=False,max_rotate=None,max_zoom=1, max_lighting=None,max_warp=None,p_affine=0,p_lighting=0), size=512, bs=32)
learn = create_cnn(data, models.resnet34, metrics=accuracy)
#learning rate finder
# first stage of fine-tuning all but last custom fully connected layer of the ResNet34 are frozen; maximum learning rate was chosen learning rate finder, see Leslie Smith’s paper
# the second stage, discriminative fine-tuning
interp = ClassificationInterpretation.from_learner(learn)
Some closing thoughts:
- I do not claim that this is the-state-of the-art for this problem. As far as I can tell there are no standard benchmarks for this very interesting problem when manual and automatic tumor-normal variant re-classification is compared.
- Furthermore, the performance here is greatly depends on the initial set of automatic callers chosen since different somatic callers are good/bad at detecting particular sequencing and biological artifacts. To make this classifier more general one needs to use the union of as many somatic callers as possible.
- In this small example (< 1,500 snapshots) I show that we can match the expert at discerning if the variant is real [variants originate from automatic tumor-normal workflow] with the positive predictive value (PPV) = TP/(TP+FP) * 100 % = 230/(230+13)=94.6%.
- I keep coming back to the same conclusion over-and-over again but it’s worth repeating: Fast.AI library allows state-of-the-art transfer learning and fine-tuning relevant to biomedical domain. Thanks Jeremy, Rachel and Fast.AI!