Rabin-Karp Algorithm Using Polynomial Hashing and Modular Arithmetic
Introduction
Created by Richard M. Karp and Michael O. Rabin, the Rabin-Karp algorithm or Karp-Rabin algorithm is a string-searching algorithm that utilises hashing to find matches between a given search pattern and a text.
A naive string-searching algorithm would compare the given pattern against all positions in the text. This would result in a less than ideal runtime complexity of O(nm) where n = length of text and m = length of pattern.
Rabin-Karp improves upon this concept by utilising the fact that comparing the hashes of two strings can be done in linear time and is far more efficient than comparing the individual characters of those strings to find a match. Thus, providing a best case runtime complexity of O(n+m).
Rabin-Karp Algorithm
- Compute the hash of the string pattern.
- Compute the substring hash of the string text starting from index 0 to m-1.
- Compare the substring hash of text with the hash of pattern.
- If they are a match, then compare the individual characters to ensure the two strings are an exact match.
- If they are not a match, then slide the substring window by incrementing the index and repeat step 3 to compute the hash of the next m characters until all n characters have been traversed.
Hash Function
First of all, the algorithm is only as good as its hash function. If a hash function which results in many false positives is used, then character comparisons will be done far too often to deem this method any more performant than a naive approach.
Secondly, you might have noticed that a new hash is computed each time the substring window traverses through the text. This is highly inefficient as it results in the same performance (if not worse) as a naive approach.
Both these problems can be solved using polynomial hashing with additions and multiplications. Although this is not a Rabin-fingerprint, it works equally well.
Polynomial Rolling Hash
We can compute the polynomial hash with multiplications and additions as shown below.
Example
For the sake of brevity, let’s use integers directly instead of character conversions in this example.
Given the pattern ‘135’ and a text ‘2135’ with a base b = 10.
First we compute the hash of the pattern ‘135’.
Next, we compute the hash of the first m = 3 characters of the text which is ‘213’.
This is clearly not a match. So, let’s slide the window by dropping the first character of the previous window and adding the next character to it. The window now represents ‘135’.
Now our hashes are a match and the algorithm essentially comes to an end.
Rolling Hash
Notice that we had to compute the entire hash of ‘213’ and ‘135’ after moving the sliding window. This is undesirable as we had to compute the hash of some integers we had already accounted for in the previous window.
The rolling hash function can effectively eliminate these additional computations by acknowledging that the hash of a new window skips the first character of a previous window and adds the computation of a new character.
In theory, we can get rid of the hash value of the skipped character, multiply the resulting value by the base (to restore the correct order of the exponents of the previous untouched characters), and finally add the value of the new character.
Therefore, we can compute the hash of the new window by using the equation shown below.
Using the previous example of moving from ‘213’ to ‘135’, we can plug in the values to get the new hash.
By using the rolling hash function, we can calculate the hash of each sliding window in linear time. This is the main component of the Rabin-Karp algorithm that provides its best case time complexity of O(n+m).
Modular Arithmetic
All math in the Rabin-Karp algorithm needs to be done in modulo Q to avoid manipulating large H values and integer overflows. This is done at the expense of increased hash collisions, also known as spurious hits.
The value for Q would usually be a large prime number — as large as it can be without compromising arithmetic performance. The smaller the value of Q, the higher the chances of spurious hits.
There is a potential problem with the above approach. To understand what that is, let’s have a look at a simple JavaScript code snippet of it.
function hash(pattern, Q) {
const b = 13;
const m = pattern.length; let hash = 0;
for (let i = 0; i < m; i++) {
const charCode = pattern.charCodeAt(i);
hash = (hash + charCode * (b ** (m - i - 1))) % Q;
}
return hash % Q;
}
Notice that there are two multiplications and an addition done inside the loop. Not only is that inefficient, but it also fails to prevent integer overflows as larger sums are calculated before the modulo operator is even used. We can overcome this problem by using Horner’s method.
Horner’s Method
Horner’s method simplifies the process of evaluating a polynomial by dividing it into monomials (polynomials of the 1st degree).
Using this method, we can eliminate one multiplication from our previous implementation. This leaves us with only one multiplication and one addition at each step in the loop, which in turn allows us to prevent integer overflows.
function hash(pattern, Q) {
const b = 13;
const m = pattern.length;let hash = 0;
for (let i = 0; i < m; i++) {
const charCode = pattern.charCodeAt(i);
hash = (hash * b + charCode) % Q;
}
return hash;
}
We can use the same approach to compute the rolling hash in linear time
function roll(previousHash, previousPattern, newPattern, base, Q) {
let hashCode = previousHash; // Precompute multiplier
let multiplier = 1;
for (let i = 1; i < previousPattern.length; i++) {
multiplier *= base;
multiplier %= Q;
} // Ensure non-negative hash by adding the modulus
hashCode += Q; hashCode -= (multiplier * previousPattern.charCodeAt(0)) % Q;
hashCode *= base;
hashCode += newPattern.charCodeAt(newPattern.length - 1);
hashCode %= Q; return hashCode;
}
Complexity
Given a word of length m and a text of length n, the best case time complexity is O(n + m) and space complexity is O(m). The worst case time complexity is O(nm). This can occur when an extremely poor performing hash function is chosen, but a good hash function, such as polynomial hashing, can fix this problem.
Conclusion
While there are better performing algorithms for single string-searching, the Rabin-Karp algorithm can be quite efficient at finding multiple patterns. It provides a strong base for extended use cases in this problem domain.