Knuth-Morris-Pratt Algorithm

A057_Gagandeep Singh
Pattern Searching Algorithm
6 min readApr 13, 2021

The KMP algorithm is a solution to the string search problem wherein we are required to find if a given pattern string occurs in another main string. It is one of the advanced string matching algorithms that was conceived by Donald Knuth, James H. Morris and Vaughan Pratt, hence the name “KMP algorithm”.

The algorithm keeps a track of the comparison of characters between main text and pattern, thereby ensuring that comparisons that have already been done are not repeated, i.e. backtracking of the main string never occurs. All this results in a linear matching time and a more optimized solution.

Problem Description:

We are provided with the main string S[0..n-1] and a pattern string p[0..m-1]and our task is to find whether the pattern 'p' occurs in the main string 'S', and if it occurs, then we have to print all the occurrences of 'p' in string 'S'.

Examples:

Input: Main String: “CODESDOPE”, Pattern: “CODE”
Output: Pattern found at location: 0

Input: Main String: “ABAAABAAAAABBAAAABA”, Pattern: “AAAB”
Output: Pattern found at location: 2
Pattern found at location: 8
Pattern found at location: 14

Example

Approach:

The approach for the KMP algorithm is to build two functions:

  • Partial match function or prefix function:
    This function uses the pattern string to give the count of characters that need to be skipped while matching with the main string. It takes the pattern string as an input and returns a matching table in the form of an array that contains the lengths of longest proper prefix that is also a suffix(lps values). With the help of the lps value for a character, we determine how many characters to skip while matching. Let us take a look at some examples for lps values of string characters:
    s1 = “ABCD” lps[] = [0,0,0,0]
    s2 = “AABB” lps[] = [0,1,0,0]
    s3 = “AAAB” lps[] = [0,1,2,0]
    s4 = “AABBAA” lps[] = [0,1,0,0,1,2]
    Prefixes are substring containing the initial part of a string and suffixes are substring containing the ending part of a string. For example: “PQR” has prefixes “”,”P”,”PQ”,”PQR”, and suffixes “”,”R”,”QR”,”PQR”. A proper prefix or suffix refers to all substrings except for the complete string.
  • KMP matcher function:
    This function takes the main string, pattern string and the matching table as parameters and then compares both the strings while taking the matching table values into consideration, thereby skipping all characters that have already been compared once. In this way it finds all the occurrences of the pattern string in the main string and returns the position of the match.

Using the above two functions we ensure that backtracking of the main string never occurs.

Code in C++:

#include <iostream>
using namespace std;
void computePrefix(string p, int m, int lps[])
{
// length of the previous longest prefix suffix
int l = 0;

lps[0] = 0; // lps[0] is always 0

//calculating lps[i] for i = 1 to M-1
int i = 1;
while (i < m) {
if (p[i] == p[l]) {
l++;
lps[i] = l;
i++;
}
else
{
if (l != 0) {
l = lps[l - 1];

}
else
{
lps[i] = 0;
i++;
}
}
}
}
void kmpPatternSearch(string p, string S)
{
int m = p.length();
int n = S.length();

// creating lps array
int lps[m];

// finding prefix table
computePrefix(p, m, lps);

int i = 0;
int j = 0;
while (i < n) {
if (p[j] == S[i]) {
j++;
i++;
}

if (j == m) {
cout<<"Pattern found at location: "<<i - j<<"\n";
j = lps[j - 1];
}

// mismatch case
else if (i < n && p[j] != S[i]) {
// Skip matching lps[0..lps[j-1]] characters
if (j != 0)
j = lps[j - 1];
else
i = i + 1;
}
}
}
int main() {
string text = "ABAAABAAAAABBAAAABA";
string pat = "AAAB";
kmpPatternSearch(pat, text);
return 0;
}

Output:

OUTPUT

Working:

Let us see the working of the algorithm through an example:

computePrefix():
p = “AAAB”

l = 0, i = 1
lps[0] = 0

1. l = 0, i = 1:
p[i] and p[l] match so l++ and store it in lps[i] then i++
=> lps[1] = 1

2. l = 1, i = 2:
p[i] and p[l] match so l++ and store it in lps[i] then i++
=> lps[1] = 2

​ 3. l = 2, i = 3:
p[i] != p[l] and l>0
=> l = lps[l-1] = lps[1] = 1

​ 4. l = 1, i = 3:
p[i] != p[l] and l>0
=> l = lps[l-1] = lps[0] = 0

​ 5. l = 0, i = 3:
p[i] != p[j] and l = 0 so lps[i] = 0 and i++
=> lps[3] = 0

​Final prefix table:

kmpPatternSearch():

​ S = “ABAAABAAAAABBAAAABA”
p = “AAAB”
1. i = 0, j = 0:
S[i] and p[j] match so i++ and j++

​ 2. i = 1, j = 1: (Mismatch)
S[i] != p[j] and j != 0, so j = lps[j-1]
=> j = lps[0] = 0


3. i = 1, j = 0:
S[i] != p[j] and j = 0, so i++

4. i = 2, j = 0:
S[i] and p[j] match so i++ and j++

5. i = 3, j = 1:
S[i] and p[j] match so i++ and j++

6. i = 4, j = 2:
S[i] and p[j] match so i++ and j++

​ 7. i = 5, j = 3:
S[i] and p[j] match so i++ and j++

8. i = 6, j = 4: j = m, so print location of pattern i.e. (i-j) and j = lps[j-1]
=> Pattern found at location: (6–4) = 2, j = lps[3] = 0

​ Similarly, the loop iterates through the entire string S.

Complexity Analysis:

The first method computePrefix() contains a while loop that runs 'm' times where 'm' is the size of the pattern string. So the time complexity is O(m)O(m).

The second method kmpPatternSearch() just compares pattern through the entire main string of length 'n' using a while loop and so its time complexity is O(n)O(n).

Thus we can see that the total time complexity of KMP algorithm is O(m+n)O(m+n) which is a linear time complexity and where m is size of pattern string and n is size of main string.

--

--