Fixing typos the evolutionary way
How DNA sequences taught us to transcribe parking signs
When I was in school, I didn’t take any computational biology classes. “I don’t want to work for a biotech company,” I said. “What will this ever do for me?” I still don’t work in biotech, but imagine my surprise when some comp. bio popped up in the most unlikely of places: parking sign transcription.
You see, to code the curb, we first had to figure out what all the parking signs said. And to do a good job of this, we used tools that were created to understand DNA sequences.
What does DNA have to do with parking signs? Read on to find out!
What does the parking sign say?
As we survey a curb, we take pictures of parking signs, like the one below. The next step is to turn it into text that we can interpret. While one day we want to do this using computer vision, at the moment we rely on Amazon Mechanical Turk.
Our workers use a web interface to enter transcriptions of parking signs like the one above. But transcribing parking signs is a hard job: they can be complicated, and the images can be blurry or otherwise hard to read. So we expect — and see — typos and other missing information.
Getting the input data
We always send every parking sign to at least two workers. This way, we can always compare transcriptions with each other as a basic check on correctness.
The first step when we get a transcription is to clean it up. We upper-case the whole thing, normalize punctuation, remove line breaks, and make some other obvious fixes. If, after cleanup, the two workers agree exactly about what the sign says, we’re done! If not, we send it to a third worker, and if we can’t find an agreement with three workers, we send it to two more.
But with three or five workers, there are cases when every single transcriber returns a slightly different answer. For the sign above, here’s what we got:
In this situation, the transcriptions are close enough that we should be able to match them to something that’s pretty accurate. But how?
A very common metric in text processing is called edit distance. This is the smallest number of changes you have to make to one piece of text to turn it into another. For instance, turning CAT into GATE requires two edits: replacing a “C” with a “G”, and adding an “E” at the end. Finding the edit distance also tells you what parts are the same between the two pieces of text (in this case, the “AT”).
But what happens when you have more than two pieces of text? We can generalize edit distance by asking, what are the fewest edits we can make that make all of the texts identical? For example, say we started with CAFE, GAFFE, and CAFFEINE. We’d have to edit five characters total (removing INE from CAFFEINE, changing the G to a C in GAFFE, and adding an F to CAFE). This also lets you build an alignment matching these inputs to each other:
If we can make an alignment like this for our parking sign transcriptions, we could find the best match. Here’s how: every transcription gets one “vote” about what the sign says at each position, and we take the majority vote. (If there isn’t a majority, we give up and say that we can’t find the correct answer). If the majority of signs don’t have any character at a given position (like for
INE above), we just delete those characters.
There’s just one problem with this algorithm: while aligning two sequences is easy, finding the best sequence alignment for more than two sequences is very, very hard! In fact, it’s NP-hard, which means that if we knew how to do it efficiently, we’d be able to solve other very important computer-science problems efficiently too (like the Traveling Salesman Problem and integer factorization).
Bio to the rescue
It turns out we’re not the only ones with this problem. Biologists really want to find mutations in sequences of DNA. And it turns out they do this by aligning lots and lots of DNA sequences and looking for which bits are different. Since DNA sequences are just strings of characters, this is exactly the same problem we have! As it turns out, when I was looking at this problem I had the good fortune of running into Roger Craig, who taught me about the methods that bioinformaticians use to align DNA and protein sequences.
So how do they do it? Remember that aligning just two sequences is easy. So if you can align two sequences with each other, then align the result of that with a third sequence, and so forth, you could get an overall alignment without too much trouble.
The problem is, it matters which order you do the alignment in. You get much better results if you align similar sequences with each other first before going to more different ones. Going in the right order will get you a tree of comparisons that looks something like this:
This is called a “guide tree” since it guides your path of sequence alignments. Biologists like having this tree not only for creating the alignment, but also to help figure out the evolutionary relationships between the different DNA sequences!
There are different methods for constructing guide trees, most of which start with first finding all of the pairwise edit distances between sequences and starting with the ones that are closer together. We use one called neighbor joining.
The alignment produced by following the guide tree is probably not optimal, but in practice it turns out to be pretty good. In biology, there are methods that use heuristics to improve on this initial result, but we have found the result is good enough by stopping right here.
Alignment in action
Let’s go back to our example transcriptions above:
Using our guide-tree method, we’d start by aligning sequences 1 and 2, which are the most similar with each other:
Then we’d do sequences 3 and 5:
Then we’d align sequence 4 with sequences 1 and 2. Note that the relative alignment of sequences 1 and 2 is fixed, so we can align these sequences quickly.
Finally, we’d align sequences 1, 2, and 4 with sequences 3 and 5. Similarly to above, the relative alignment of each of these sets of sequences is fixed.
And so we’d come up with a result of:
Pretty good! We did miss “except as noted below”, though, since it’s only in two out of five input transcriptions.
When we started collecting curb data, we never thought it would lead us to evolutionary biology. We hope that this story shows how even a simple problem like fixing typos in parking signs can become something very deep! It’s always exciting when work from seemingly unrelated fields comes in handy, and in computer science, this happens all the time.
What are problems that you’ve found solutions for from other industries? Let us know on Twitter @coordcity @jacobbaskin.