Today’s blog is about a famous algorithm called the Knuth-Morris-Pratt algorithm after these guys up top.
Legend has it that James Morris conceived of the algorithm and then “a few weeks later” Donald Knuth was creating the same thing in the field of automata theory. Morris and Vaughan Pratt created a report on it in 1970 and then all three of them got together to publish the algorithm in 1977. That is the very very brief history of it, but what is it?
The Knuth-Morris-Pratt, henceforth called KMP, is a string matching algorithm that takes advantage of the fact that there are often patterns in strings. For example, consider the strings “aefaefaefaedaefaedaefaefa” and “aefaedaefaefa”. If I am trying to see if the latter string is present in the former string, normally the typical string matching algorithm will run in an average time complexity of O(nm) where ’n’ and ‘m’ are the lengths of the strings respectively. I would match the two strings character for character from the beginning until I got to the ‘d’ in the smaller string. At that point I would have to start over at the smaller string and check that against the 2nd character of the larger string.
This brute-force approach goes through values that I have already computed. As a human person I can look at these two strings and see that the pattern of ‘aef’ is present throughout both strings. Instead of going back to the beginning I could start at the beginning of the substring and just move back to the previous iteration of ‘aef’. This is the spirit behind the KMP algorithm.
Creating the pattern
Before I go any further I want to stress that this is a difficult algorithm. I’m writing this blog for anyone who wants to learn about it (specifically in Java) but mostly so I can try to understand it better myself. With that being said, it isn’t the easiest thing to describe using static images and words. So, here goes nothing!
The key behind the KMP algorithm is that before we match the two strings together we will iterate through the string to match to see if any patterns exist within it. Inside of our function we are going to write a helper function that returns an array that will serve as a pattern.
First we initialize an array of integers that is the same length as the string to match (which we pass in as an argument). We fill the array with -1 at each index (don’t worry about this yet it’ll make more sense later). Then we create variables — ‘j’ will start at the first letter in the string and ‘i’ will start at the second letter. After that we create a while loop that iterates as long as variable i is less than our string.
Each time through the loop we start by checking to see if the characters at our variables match. If so we put the index number of ‘j’ inside of our pattern array at index ‘i’, and increment both of our variables. If the substring was “aab” and our variables were at 0 and 1 respectively, this would mean we would change our pattern array from being [-1, -1, -1] to [-1, 0, -1]. Essentially we are recognizing that the character at index 1 can also be found at index 0. The reason for this will become clearer shortly.
Alternatively, if the characters don’t match BUT ‘j’ is greater than 0 we set ‘j’ equal to the value in our pattern at index j minus 1, plus 1. This might be a bit confusing. If we have the substring ‘aefaed’ and we’ve progressed through our function so that ‘i’ is at the ‘d’ (substring.charAt(5)) and ‘j’ is at the ‘f’ (substring.charAt(2)), we see that the characters do not match but also that ‘j’ is greater than 0. At this point our pattern array would like this: [-1, -1, -1, 0, 1, -1]. We are going back to the last character where there was a match (the ‘e’ or ‘j’ minus 1 which is 1), grabbing the value from our pattern array at that index (at pattern in this case would be -1) and then adding one to that. Since we didn’t find a match at index 1 we let the value -1 remain in our pattern array. This is why we filled the array with -1; if we attempt to find a previous instance or pattern of a few letters and fail we bring ‘j’ back to the beginning of our substring.
Phew, ok. I’m going to explain that a little more. Imagine we had the substring “aefaedaefaefa”:
If variable ‘i’ is at 8 (‘f’) and variable ‘j’ is at 5 (‘d’), once again the characters do not match and ‘j’ is greater than 0. So we set ‘j’ equal to the value in the pattern at j minus 1, plus 1. ‘j’ minus 1 is 4 (where the two ‘e’s matched), the value in our pattern array at the index is 1 as we can see above, and then plus 1. This would set ‘j’ equal to 2. Moving forward when we compare the characters in the substring at 2 and 8 we see that they match and we the index of ‘j’ in our pattern array at index ‘i’. I’d suggest taking a second with that.
Lastly if the characters do not match and ‘j’ is not greater than 0 we simply increment ‘i’. Outside of the while loop we return our pattern array.
Looking for a match
Now that we’ve created our pattern, we’re ready to try to find a match between our string and our substring. Like before we will abstract this into a helper function.
This helper function takes in our string, substring and pattern array that we created. We initialize our variables. We set them both equal to 0 because one will start at the beginning of our string and the other at the beginning of our substring. Then we create a while loop which contains very similar logic to our buildPattern function and is responsible for determining if the substring is present anywhere in the string.
In this case I used the variable ‘i’ in the string and ‘j’ in the substring. The loop will iterate while the current value of ‘i’ plus the length of the substring minus ‘j’ (in other words our current position in the string plus the substring’s length minus our current position in the substring) is less than or equal to the length of our string. There are other conditions to use in this part, however this reduces unnecessary computations because if the remaining part of the substring to match is longer than the remainder of characters in the string we already know that there is no match.
We check if the characters in our respective strings at the variables match. If they do we then ask if ‘j’ is equal to the length of the substring minus 1, or if ‘j’ is at the last character in the substring. If this is true then we know that we’ve iterated through the whole substring successfully and that there is indeed a match. If this condition is not met then we just increment ‘i’ and ‘j’ and keep going.
Like our buildPattern function, if the characters don’t match we then check to see if ‘j’ is greater than 0 and adjust the value of ‘j’ just like we did in buildPattern. Similarly if ‘j’ is still at 0 we only increment ‘i’ and continue. If we get out of the while loop without hitting the end of the substring we know that we were unsuccessful in finding a match and return false.
I think this algorithm is very cool and interesting. I hope this blog helps you understand it and inspires you to try to implement it the next time you need to match strings!