Matching Genetic Sequences Through the BLAST and Karp-Rabin Algorithm

Published in

Analytics Vidhya

15 min readFeb 22, 2021

Note: I’ll be referencing big O notation in this article quite a bit; it’s essentially the formalized notation for discussing the time complexity of a code. If you’re unsure on what big O notation is, I highly suggest checking out the link here.

67 billion miles. That’s the length of all the uncoiled pieces of DNA inside of the human body. It’s crazy how something so simple, just four nitrogenous bases, can create such a complex and sophisticated organism.

With these 67 billion miles, there are approximately 3 billion of these DNA base pairs in the human genome, making up the genetic code that ultimately controls our thoughts, our actions, our emotions, etc.

Now understanding the complexity of the human body, it makes sense why scientists and mathematicians are struggling to understand this genetic code; there’s simply so much information to sift through.

There are more base pairs than fine grains of sand on a typical beach! Credit: NPR

One particular problem that these individuals face is mapping specific strings of nucleotides to the entire human genome just because of the underlying time complexity. Using conventional methods, a computer would probably take years to sift through the genome to find a particular section as it runs in quadratic time or O(n²). New solutions aim to address this problem by sifting through the genome in linear time or O(n).

Currently, there are two different solutions that severely decrease the time complexity of the sifting through the genome: linear-time exact string matching (matching the string exactly to a section of the genome) using the Karp-Rabin algorithm and hashing and linear-time inexact string matching (allowing for a few nucleotide differences) using the BLAST algorithm.

In this article, we’ll be going over these different algorithms and discussing specifically how these algorithms aim to reduce the time complexity of conventional database search.

Linear-Time Exact String Matching

This section will aim to address the problem of matching exact nucleotide sequences to the human genome in sets of billions of nucleotides. This differs from the inexact string matching where we can allow for a few inexact alterations as we aim to tackle it from an evolutionary perspective (more on this particular topic later).

Basic Solution

What we aim to do is find a pattern P of length m in text T of length n. The simplest solution would be simply sliding the pattern into the text and find those sequences that we are looking for. As shown in the image below, we would simply just scroll the pattern slowly through each possible consecutive combination of length m in text T until we find the desired combination.

By starting at the first number, we would find a pattern of length m and continue this process until we reach the (n-m+1)th number. However, this approach provides a non-linear solution that will grow worse exponentially.

Example of algorithmic interpretation of basic Karp-Rabin solution. Credit: Manolis Kellis

Modulo Operations

The computation of y can turn out to lead to undesirable time complexities as the pattern length n increases over time. This can be fixed through mathematical tricks to compute y in linear rather than quadratic time.

If you look carefully, you can see that the next number is simply the previous number shifted over by one plus another number n.

We can calculate the current y using the previous y using three operations (an example of 23590 to 35902 will be used in this instance):

Removing the left-most digit using the modulus operator (e.g. 23590 mod 10000 = 3590 → can be characterized as y mod 10^m)
Shifting all the digits to the left (e.g. 3590 would become 35900 → can be characterized as y x 10)
Adding the new digit to the right (e.g. 35900 + 2 = 35902 → can be characterized as y + n where n is the next number in the sequence).

Hashing — The Karp-Rabin Algorithm

The second issue arises when noticing that the sequences that we will comparing will often be much longer than five. When comparing strings with lengths in the millions, this form of comparison will prove to be inefficient.

To ensure the most efficient comparison, we can use hashing, a method of converting our strings of length m into more desirable lengths (e.g. converting strings with lengths of a million to lengths of less than ten). In order to perform this hashing using the Karp-Rabin algorithm, we do all of our computations modulo p where p is a chosen constant that is small enough so that the comparison is able to be done in constant time.

Note: In order for hashing to work, the text needs to be converted into a numerical format. This can easily be done with nucleotides by assigning adenine zero, cytosine one, guanine two, and thymine three.

Example of hashing. For these problems, we’ll be using modulo operators. Credit: Wikipedia

For example, let’s say that we make our p=13 and we want to hash the number 14152. Since 14152 mod 13 is equal to 8 (since the modulus operator simply computes the remainder of the division of two numbers), then we effectively converted a string of length 5 to a string of length 1.

This form of hashing is extremely powerful as the sequences that we will be comparing will often be thousands, even millions of nucleotides long. By using hashing, we are able to reduce these sequences to merely one to two digit long sequences, rather than 100.

The modulus p effectively constrains any sequence into a range of [0…p], regardless of what the particular alignment is. These comparisons can happen in much more linear time rather than previous methods.

Example of modulo operators to find significant matches. 26 mod 11 is 4. Credit: Javatpoint

As shown in the picture above, if the hash of the pattern equals the hash of the particular string that the algorithm is checking, then they are considered to be equal to one another. This greatly reduces the comparison time as we are still performing the comparison, just with numbers of much smaller length.

There are two main characteristics that are instrumental for hashing different sequences to make their length smaller:

Reproducibility: If two numbers are the same, then their mapped hashes will additionally remain the same.
Uniform output distribution: Regardless of the particular input distribution, the output distribution must remain uniform, meaning that there is an equal probability that each number in the hash gets chosen.

The immutability of mathematics operations allows for the reproducibility aspect to be covered; additionally, the way that the modulo operator effectively cycles through sets of numbers, ensuring an equal probability distribution, satisfies a uniform output distribution.

Example of spurious hits (not being equal to the query string). Credit: Javatpoint

Spurious hits, hits that form a match with the pattern even though they are different, are bound to occur while hashing a function. For example, 14 mod 13 and 27 mod 13 produce the same hash of one even though they are different values; this can lead to many problems down the future.

In the picture above, you can recognize this as our pattern of twenty-six and our checked string of fifteen produce the same hashed value of four even though they are different values. This creates many unwarranted false positives that can undermine the validity of our algorithm.

To circumvent this issue, simply check the original unhashed number with the unhashed query. Although we are performing a comparison with slower time complexity, we are doing this on a smaller set of numbers than before.

This altered Karp-Rabin algorithm provides an expected linear runtime since the probability of spurious hits remains small throughout the sequence considering that the value p is chosen carefully.

Finalized Karp-Rabin algorithm with minimized time complexity.

Educated String Matching

Along with this finalized Karp-Rabin algorithm, educated string matching can be used to reduce the time complexity to O(n) time.

As stated before, the original solution consisted of simply comparing a string of length m character by character with a string inside of the dataset to find which patterns are there in the data and then constantly cycling through these strings in an inefficient manner. This produced an algorithm that ran at a time complexity of O(m * n) time where n is the length of the pattern.

Naive string matching algorithm for database searching.

Slight alterations can be made to this algorithm by simply stopping the comparisons if a mismatch is already found; however, for a worst case scenario, this still leads to a slow O(m * n) time complexity if the string that we are comparing matches the entire sequence.

Using a similar ideology as the modulo operations above, we can take the redundancy from the last string comparison and use it in the current string comparison. Using this, we can make much larger shifts down the target sequences than before. This leads to a reduction in the number of comparisons required, causing a linear runtime of O(n).

As shown in the image above, the correct classification of the pattern with the string from the target database allowed for a skip of three spaces as we knew that the next three subsequent strings would be mismatches. We can classify the second, third, and fourth letters as ‘b’, ‘c’, and ‘d’ respectively to prove that none of these letters can equal ‘a’, leading to a mismatch.

Educated string matching can be used hand-in-hand with the Karp-Rabin algorithm to produce effective database query searches and cause fast computations resulting in linear time complexity.

Linear-Time Inexact String Matching

In inexact string matching, we will primarily focus on the problem of finding similar genetic sequences that only differ by a few nucleotides. This is particularly useful for finding evolutionary sequences in the genome.

Three types of mutations for evolutionary processes. Credit: University of Leicester

When going through evolutionary processes, it makes sense that a few nucleotides in the human genome will change whether it be through insertion (adding a nucleotide), deletion (removing a nucleotide), or substitution (changing a nucleotide letter to one of the other three).

The BLAST algorithm accounts for these processes as it finds relatively inexact strings throughout the human genome.

Some Considerations

The BLAST algorithm exploits the particular property that most of the target sequences that it will be checking will be unrelated to the target sequence.

Therefore, if you are looking for sequences of length 1,000 and you reject certain matches that are less than 95% identical, then you can simply skip sequences that don’t contain a consecutive stretch of 50 nucleotides as you know that these will be mismatches for the entire string.

Visual representation of Pigeonhole Principle. Credit: Medium

This is an extremely important property as it allows us to skip certain sequences that have absolutely no relation to the target pattern. This is a similar argument to the well-known Pigeonhole Principle, which states that if there are n-1 pigeons fitting into n holes, there must be at least one hole with no pigeon. The Pigeonhole Principle is a stunning application inside of mathematics, so if you’re interested in reading more, check this link.

A depiction of which regions have been conserved through evolution in different species. Credit: Wikipedia

The BLAST algorithm takes into account that there will be stretches of similarity (or conserved regions), mainly in the regions of functional DNA, so it doesn’t need to check the entire string for the comparison but just the clusters of functional DNA. Conserved regions are regions that remain untouched (with no mutations) during the process of evolution.

Along with this, the BLAST algorithm preprocesses the data (using hashing and other techniques discussed later in this article) in such a way that finding a pattern of length n can easily be found in O(n) time.

BLAST Algorithm Process

The BLAST algorithm splits the query/pattern of length n into overlapping words of length W called W-mers. For example, if the query word (e.g. ATCGCGATA) has length n=9 and W=3, then we would have three 3-mers for the query (ATC, GCG, and ATA).

After splitting the query up into these W-mers, the algorithm finds a neighborhood of similar words for each word. So, for instance, if there was a 3-mer of ACG, then some of these words would include ATG, ACA, and TCG since they barely differ from the original W-mer. We constantly modify these sequences until they fall below a pre-defined threshold T.

An example of putting the original sequences in the hash table. Credit: Rice University

These substrings are subsequently stored inside of a hash table or a W-mer database where all the neighborhood words and the original words are stored. This allows for the neighbors to be roughly similar to the original pattern. Each of these different strings have different hashes (using the hashing method that was shown above), often called seeds. These hashes use a similar process to what was seen in the Karp-Rabin algorithm.

To find alignments, we simply extend the seeds using our collection until our alignment score (a score measuring how similar the resulting seeds are to the database string) drops below a certain threshold X. We’ll discuss further how exactly it calculates that alignment score in the first place. Note that X and T are different as X signifies the similarity between the entire original pattern and the newfound collection of seeds while T simply checks the similarity between the original W-mer and the neighboring W-mer.

W, T, and X are hyperparameters, meaning that they can be tuned accordingly to fit the model better. For example, increasing W can lead to less spurious hits, but it can also lead to a larger space complexity due to a bigger hash table and fewer actual hits than before. Make sure to recognize the tradeoffs of each of these before increasing or decreasing them.

Extensions of the BLAST algorithm

Although the BLAST algorithm is one of the most well-renowned and cited algorithms in the field of computational biology, there are some modifications that can be made to increase the complexity of the model and decrease the number of spurious hits.

Filtering is often used by scientists who want to reduce the size of their query. If the query has low complexity regions, or regions of the same nucleotides (e.g. repeats of GC, A, etc.), and the database has long stretches of the same sets of nucleotides, there will be many unnecessary hits that will mess up the algorithm. Therefore, by simply filtering out these low complexity regions, we can reduce these hits and only generate meaningful ones.

On the other hand, to produce more hits (not spurious hits), we can use two-hit BLAST, where we hash two small W-mers rather than simply hashing one long W-mer. By splitting up the W-mer into two, it allows us to find smaller regions of similarity and gives us a higher sensitivity. Additionally, spurious hits decrease as a result of this since W decreases subsequently.

Lastly, combs can be particularly useful in rooting out useful sections of nucleotides in the original database string. Combs work by removing the nucleotides that we don’t care about and only focusing on the ones that we do care about. For example, we may consider to neglect the third nucleotide in a triplet (since they often don’t have an effect on the amino acid’s representation). We can represent this as a bitwise sequence of 110110… where 1 shows the nucleotide while 0 hides it.

These combs can also be decided randomly, where every n-th nucleotide will have a probability of being hidden determined by a probability matrix. This method is often called Random Projection, and it greatly increases the probability of finding a match in the database.

Sequence Alignment Probability

The BLAST algorithm uses a scoring matrix in order to determine whether a collection of seeds is roughly similar to the target pattern or the pattern presented on the database. This allows us to evaluate certain mismatches in the alignment based on a factual matrix.

The scoring matrix is centered around two hypotheses that try to show two sequences as homologous, meaning that they share a common ancestry because of their similar nucleotides:

The alignment is due to chance, and the sequences are unrelated.
The alignment is due to a common ancestry, and they are related.

To build the particular scoring matrix, we’ll be looking at the ratios between the probabilities of these two hypotheses. There are two main scoring matrices that are used: nucleotide scoring matrices and amino acid scoring matrices. Although amino acid scoring matrices are often used more for the BLAST algorithm, understanding nucleotide scoring matrices will give us the necessary general intuition behind computing the score.

Understanding Nucleotide Scoring Matrices

Example of the scoring matrix for nucleotides.

As shown above, this nucleotide scoring matrix is mainly defined based on interactions between the different nucleotides and how probable a mismatch is likely to occur. For example +1 is denoted for nucleotides that remain the same throughout evolution, meaning that they don’t change.

It is also shown that greater weights are held towards conversions between As and Ts rather than As and Gs; similar cases occur for the Ts, Cs, and Gs. Why is this the case? It is mainly due to the evolutionary likelihood that a certain nucleotide base would mutate into another one.

It makes sense that A would mutate into G since they are in the same family of purines instead of a T or C, which is in the family of pyrimidines. It is easier to mutate between families rather than crossing between them. Therefore, it’s given a less negative number than the other ones.

Note that a more negative number denotes a lesser likelihood that current mutation will occur inside of the human body. Since a mutation from A to G is much more common than a mutation from A to T, it will have a lower negative number than the latter.

Amino Acid Scoring Matrices

General table for converting three-nucleotide pairings to amino acids. Credit: Nature

Amino acid interaction/mutation probabilities are much harder to quantify than nucleotide interactions; therefore, we need to implement a different approach rather than using simple intuition like the nucleotide substitution table. Remember that we will mainly be using amino acid scoring matrices for the BLAST algorithm, so it’s important that we have an accurate index.

By using mathematical operations and understanding probabilistic aspects using the two hypotheses defined above, we can determine this substitution matrix by computing the score using known probabilities.

If you are interested in going more into the mathematical reasoning behind this, I highly suggest that you check out this link, but I will be leaving these computations for the purposes of this article. The final substitution matrix score for different amino acid pairs a and b can be given by:

The numerator discusses the likelihood that certain evolutionary processes changed a into b while the denominator discusses the likelihood of amino acid a and amino acid b are inside of the genome.

Using the substitution score formula given above, we can make substitution matrices for amino acids, allowing for a much more refined amino acid scoring matrix, similar to the one shown for nucleotides.

An example of an amino acid substitution matrix (BLOSUM62 matrix).

Applications

Aligning genetic sequences can have considerable applications in the field of healthcare. Not only can this be used to understand evolutionary processes in humans and other species, but it can also be used to match specific known pieces of DNA to pieces found inside of the human genome.

Understanding this alignment can pave the way for a more enhanced knowledge of the human genome and allow for more personalized strategies as well as potential cures for certain mutations.

Sequence alignment and its overarching field of computational biology is a fast and evolving field, and I’m excited to see how the BLAST algorithm and Karp-Rabin algorithm can be tailored to decrease time complexity even further. Let me know your thoughts down in the comments below!

TL;DR

Sequence alignment can occur through exact or inexact string matching with the Karp-Rabin algorithm and BLAST algorithm respectively.
The Karp-Rabin algorithm aims to use skillful modulo and arithmetic operations as well as hashing functions to decrease the time complexity of matching exact nucleotide sequences to the human genome.
The BLAST algorithm splits the query into overlapping words called W-mers, builds similar words according to that W-mer, and uses similar hashing functions to reduce the time complexity of matching more inexact sequences and patterns to understand evolutionary processes.
There are multiple extensions to the BLAST algorithm like filtering, two-hit BLAST, and combing that fix niche problems in the algorithmic implementation of the BLAST algorithm.
The BLAST algorithm uses a scoring matrix to match the alignment of two amino acids/nucleotides to reflect whether the two sequences are homologous (whether they share a common ancestry).

Additional Resources

Computational Biology: Genomes, Networks, Evolution MIT Course 6.047/6.878 Textbook (for a better overall understanding of the emerging field of computational biology)
The Pigeonhole Principle (to understand the intuition behind the optimization of the BLAST algorithm)
Probabilistic Foundations Of Sequence Alignment (to understand more about the substitution matrices behind the BLAST algorithm)
An Efficient Square-Root Algorithm For BLAST (the original research paper behind the BLAST algorithm)

Hi! I am a 16 year old currently interested in the fields of machine learning and biotechnology. If you are interested in seeing more of my content and what I publish, consider subscribing to my newsletter! Check out my January newsletter here! Also, check out my LinkedIn and Github pages. If you’re interested about personal mindsets or just stuff in general, sign up for a chat using my Calendly.