Understanding The Knuth Morris Pratt Algorithm…In Bits

Fanan Dala
Analytics Vidhya
Published in
3 min readOct 13, 2019
Photo by Gerd Altmann (https://pixabay.com/users/geralt-9301/)

String matching is an important task in computer science with a wide variety of applications ranging from searching databases to genetics. There are many algorithms to accomplish this task and in this article, I will be taking you through the Knuth Morris Pratt algorithm (KMP).

The Naive Way

Before we go into the KMP algorithm, I would like to show how most people will attempt to solve a string matching problem. I do this so we can see how KMP optimizes searches.

The easiest, yet least efficient method of searching for matches is to loop through each character in a haystack and compare that character and the characters after it to the characters in a needle. The code snippet below shows how we do this.

Although we can optimize the snippet above by advancing the outer loop by one character immediately a character mismatch is found in the inner loop, this will, however, give us a worst-case time complexity of O(nm) when we have matching characters until the last character.

haystack = AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAneedle = AAAAAAAAB

The KMP Way

The KMP algorithm traverses the haystack and the needle from left to right until a match is found. Below is a code snippet with a special case of both strings having equal length. This is to get you accustomed to how KMP traverses strings to find a match.

Shifting the needle

Now that we see how KMP determines if a character is matched, we will then see how KMP shifts the needle through the haystack in O(n) time to search for matches. Take a look at the code snippet below.

note: the algorithm provided here only works for the test case given in the snippet after this and will skip possible matches in other test cases. This is done only to show how to move the needle through the haystack.

Whenever a mismatch is found and it is at an index greater than 0, we immediately move the needle to that character in the haystack and begin our comparisons. If the mismatch occurs at index 0 of the needle, we move the needle to the next character after the mismatch. This has a best and worst time complexity of O(n).

# HOW KMP MOVES THE NEEDLE THROUGH THE HAYSTACKCABCDEKLSDSABCDEABCD
ABCDEABCD
CABCDEKLSDSABCDEABCD
ABCDEABCD
CABCDEKLABCDEABCD
ABCDEABCD
CABCDEKLABCDEABCD
ABCDEABCD
CABCDEKLABCDEABCD
ABCDEABCD

Prefixes and Suffixes

Even though the algorithm above gives a reasonably good time complexity of O(n) it would skip over some possible matches, still make needless comparisons and hence does not give the best efficiency. To further improve the efficiency of the string matching algorithm above, KMP makes use of proper prefixes and suffixes contained in the needle to avoid unnecessary comparisons.

Given a string ABCDEFABCR, the proper prefixes we have in this string are A, AB, ABC, ABCD, ABCDE, ABCDEF, ABCDEFA, ABCDEFAB, ABCDEFABC and the suffixes we have are R, CR, BCR, ABCR, FABCR, EFABCR, DEFABCR, CDEFABCR, BCDEFABCR. Now, we can see that A, AB, ABC are proper prefixes that are also found in suffixes contained in the string ABCDEFABCR.

In KMP we build an array that tells us where these proper prefixes that are also suffixes of the needle occur and we use that to optimize our searches. The code snippet below builds an array that tells us where the “Longest proper prefix that is also a suffix(LPS)” occurs.

Using the LPS array

When we find a mismatch in the characters being compared and the index of that character is not the first index (0), we look for the LPS value of the character before it to know how many characters can be skipped in the needle. We then begin our comparison of the mismatched character with the character following the last skipped character. If the mismatch occurs at the first character we move the needle to the next character in the haystack.

# HOW KMP MOVES UTILIZES THE LPS
The LPS for ABCDEABCD = [0,0,0,0,0,1,2,3,4]
CABCDEKLSDSABCDEABCD
ABCDEABCD
CABCDEABRDABCDEABCD
ABCDEABCD
# A and B are italicized to show that they are skipped.
# Rather than begin comparing from index 0, we start our comparison from index 2
#
The possible match AB is however not skipped even though we never compare the two characters.
CABCDEABRDABCDEABCD
ABCDEABCD
CABCDEABRDABCDEABCD
ABCDEABCD
CABCDEABRDABCDEABCD
ABCDEABCD
The complete Knuth Morris Pratt Algorithm

--

--