That one string matching algorithm

Elliott Jin
Python Pandemonium
Published in
4 min readJan 30, 2017

I’ve come across the Knuth-Morris-Pratt (or KMP) string matching algorithm several times. Every time, I somehow manage to forget how it works within minutes of seeing it (or even implementing it).

New plan:

  1. Write about it
  2. Actually remember how it works

Of course, here’s what will probably happen instead:

  1. Write about it
  2. Forget how it works
  3. Forget that I wrote about it
  4. goto 1

So I apologize in advance if I end up making you read an infinite number of posts about Knuth-Morris-Pratt.

Given two strings needle and haystack, how do we find all indexes at which needle appears in haystack?

We can start with a naive approach where we check every index:

but this approach is slow; how can we do better?

Sometimes, we don’t have to check every index. For example, if needle == "ABD" and haystack == "ABCABD":

checking index 0 tells us that needle[0:2] == haystack[0:2]. Since needle[0] != needle[1], we know that needle[0] != haystack[1], and we can skip the check for index 1.

Before we look for a general way to figure out which indexes we can skip, let’s restructure our naive approach to accommodate skipping indexes:

Let’s also replace is_match with something that preserves the information it gets from comparing needle with haystack:

And let’s pass needle, plus the result of match_len, as parameters to amount_to_advance:

OK, it’s time to look for a general way to figure out which indexes we can skip. In other words, it’s time to implement a better version of amount_to_advance.

If we’re given that needle[0:k] == haystack[i:i+k], where k is positive:

then when is it possible for needle to appear in haystack at index i + j?

If we look at the overlap between the two shaded regions of haystack, we see that we’d have to have needle[0:k-j] == needle[j:k]:

In other words, if needle[0:k-j] != needle[j:k], then we know that needle definitely doesn’t appear in haystack at index i + j, and we can skip the check for that index.

So instead of always returning 1, amount_to_advance can look for the smallest positive j such that needle[0:k-j] == needle[j:k]:

This is the same as looking for the largest nonnegative l such that l < k and needle[0:l] == needle[k-l:k], and then returning k — l:

l is the length of the longest prefix of needle[0:k] that’s also a suffix of needle[0:k], so we can express amount_to_advance like this:

A naive implementation of max_prefix_suffix might look like this:

But how we can implement max_prefix_suffix more efficiently?

(a) We can find the largest l such that needle[0:l] == needle[k-l:k] by finding the largest l-1 such that needle[0:l-1] == needle[k-l:k-1] and needle[l-1] == needle[k-1]:

(b) For a given value of k, we can iterate through the values of l for which needle[0:l] == needle[k-l:k] in descending order by making repeated calls to max_prefix_suffix:

Using (a) and (b), we can efficiently precompute all values of max_prefix_suffix in a single pass:

Now let’s use the precomputed version of max_prefix_suffix everywhere:

We’re almost done, but there’s one last optimization we can make. If we look at how we’re calculating amount_to_advance:

we can see that after going to the next iteration, match_len doesn’t have to start comparing from the beginning of needle; instead, it can start comparing after the overlapping region.

Since the size of the overlapping region is just the value of max_prefix_suffix[k] from the previous iteration, this optimization is straightforward.

Here’s the final implementation:

--

--