Step 1: Genomes to Images

Leo
Convolutional Genomes
3 min readDec 7, 2017

--

I was recently introduced to the topic of variant calling. At that point I thought that by now we were able to just read a person’s genome and go from there. However the state of the art is still only generating noisy data that combines little chunks of DNA and the outcome very much depends on the choice of sequencing machines and algorithms. In practice there is no one way of doing this, leaving lots of room for experimentation, more so if you want to apply neural nets.

Colors & Shapes

If you read the preface, you know LSTMs aren’t cutting it and we are going to take a computer vision approach. First we need to decide how we generate the images. We could just generate images with roman characters, but training with those would add an additional OCR challenge to the task. Instead we can encode every letter as a colored pixel.

Let’s look in chromosome 20 at position 92,973 (GRCh38). On a black background this results in some really nice 90s artwork.

There are four letters (A, T, C, G) and three kinds of mutations (mismatch, insertion, deletion). If we just forget about the content of the insertions and deletions, we’re left with 6 colors: 4 mismatches, any insertion, and any deletion. To pick these colors uniformly, just use permutations of [0, 0.5, 1], which generates a nicely balanced color palette:

  • mismatched A: purple
  • mismatched T: pink
  • mismatched C: light green
  • mismatched G: orange
  • insertion: ugly color
  • deletion: light blue

Then there’s some additional information in the sequence data (CIGAR) that indicates regions that were probably poorly sequenced, so called soft clips. These will just add more noise to the image and are better left out for now.

To show what this looks like, we first need some sequence data. The NIST has recently began putting together resources to help us with standardized datasets [7]. A popular choice is NA12878 from Utah, daughter of NA12891 and NA12892.

There’s a lot happening in our generated image above. There are the vertical lines, which are identical values in different reads. Sprinkled around are mismatches, often singular, but sometimes many within a horizontal read. The horizontal black lines are the soft clips. The blue pixels in the center are deletions. The big pink block in the middle is caused by a long sequence of T’s, which usually throws of the sequencer, causing all of the subsequent soft clips.

Variant calling generally requires a reference genome that we can compare our sequence to. We could display the reference genome in the top like many genome browsers do, but that would require our model to learn to relate the top pixels to the rest of the image, which is quite a complicated task. Instead, we can just reduce the brightness of the pixels that are identical to the reference genome, highlighting all the interesting stuff.

Thinking next on how to train the model to tell us if there’s a variant in the image, similar to how CNNs are used to detect cats and hotdogs, would still leave us with the question what the exact position is. Instead we can generate an image for a specific position and train on the center column of that image. The model should be able to figure out this center column, but that might take extra training time. Instead we will just highlight the center.

To get a feel for what this looks like, here’s a short video at the start of chromosome 20:

In case you were wondering: a video of the full genome would take about a year to watch, assuming a daily 8 hours of sleep and no bathroom breaks.

--

--