The Walk of Life
Authored by Benjamin Lee of Lab41, an IQT Lab
Executive summary: Lab 41 has developed a novel method for representing DNA sequences in a compact way. In addition to enabling better visualization of DNA sequences, our approach to featurization could contribute to more powerful models in medical genetics, drug discovery, species identification, and industrial genetic synthesis. The method, which we’re calling Squiggle, represents sequences as a series of vectors in two dimensions and is ideal for learning algorithms that use a distance metric to assess sequence similarity — a common approach to a wide variety of problems.
In the course of our genomics research at Lab41, we often look at raw DNA sequences. Unfortunately, DNA sequences don’t come to us looking like this:
Rather, they come to us looking more like this:
In other words, basically gibberish. To the naked eye, the sequences appearto be identical.
There has to be a better way to visualize these complex sequences than just looking at the raw text.
Enter Squiggle, Lab41’s new DNA visualization algorithm. It turns the same sequences above into clean, unique two-dimensional graphs:
Let’s break down what it is that we’re seeing to get a sense for why this is such a powerful way to look at DNA sequences.
Using Squiggle, we represent each letter as its own distinct shape:
We then connect those shapes tip-to-tail to give each sequence of As, Ts, Gs, and Cs its own distinctive shape, which in mathematical parlance is called a two-dimensional walk.
Take, for example, the x-coordinate. In this scheme, the x-coordinate corresponds directly to the xᵗʰ letter of the DNA sequence. So, when we see the graphs of two DNA sequences start to diverge around position x=260 in the Figure 1, we can tell that the sequences start to differ more around letter 260.
But wait, there’s more! The ratio of Gs and Cs to As and Ts is an important feature of a DNA sequence. Because Gs and Cs have a net positive effect on the y-coordinate of the sequence, and As and Ts have a net negative effect on the y-coordinate of the sequence, whether there are more Gs and Cs or As and Ts can be inferred from whether the ending value of the graph is above or below the y-axis. Furthermore, variations in the ratio inside sequences can be seen as peaks and valleys.
Last but not least, consider the relationship between each sequence’s graph. Note that the blue and the red (human and chimpanzee, respectively) lines are really close to each other, followed by the rhesus in green and the rat in orange a bit further away.
It turns out that this relationship exactly matches the evolutionary relationship between the species.
So, to sum things up, the Squiggle algorithm allows you to quickly visualize DNA sequences’ relationships to each other, providing a snapshot of their similarity (and differences), and may prove useful for inferring infer their evolutionary relationships, all at a glance. There’s just one thing missing: an implementation.
One of the recurring problems in the two-dimensional DNA sequence visualization literature is a lack of open-source implementations. Lab41 is committed to creating open source software, so we made a Python library implementing the algorithm (as well as some of the other visualization algorithms, just to be safe) and made a snazzy command line interface allowing for quick visual inspection of files containing DNA sequences.
In the future, we’d like to make a web server version (let us know if you’re interested in collaborating!) to make it even more accessible. In the meantime, we’re eager to see how people use Squiggle, from research to art.
Lee, BD. (2018). Squiggle: a user-friendly two-dimensional DNA sequence visualization tool. Bioinformatics. DOI: 10.1093/bioinformatics/bty807
B.Next is designing a biodefense technology strategy, demonstrating the potential that innovative tools and techniques can provide, and supporting the investment strategies of these innovations.
Check out our work at www.bnext.org