String Searching Algorithm (Knuth–Morris–Pratt) | in Python

The Knuth–Morris–Pratt string-searching algorithm (or KMP algorithm) searches for occurrences of a “word” `W` within a main "text string" `S` by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

7 min readMar 2, 2023

The basic idea of the KMP algorithm is to use a partial match table, which is computed from the pattern string and provides information about the longest proper prefix of the pattern that is also a suffix of each prefix of the pattern.

This information allows the algorithm to skip over parts of the text that cannot match the pattern, based on what has already been matched. By doing so, the KMP algorithm can avoid unnecessary character comparisons and can run much faster than the brute-force approach.

The process:

I. Compute the partial match table for the pattern string “ABABAC”. This table tells us how much of the pattern matches a proper suffix of the pattern for each prefix of the pattern.

Generate the partial match table

The partial match table is computed using a modified version of the prefix function from the Z algorithm.

The partial match table (also called the failure function or the prefix function) for a pattern string is typically created using the following algorithm:

Initialize two pointers, i and j, to 0.
Create an empty list table with the same length as the pattern:

pattern: A B A B C A B A A 
table:   0 0 0 0 0 0 0 0 0

3. Set table[0] = 0, since the first character of the pattern has no proper prefix that is also a proper suffix:

pattern: A B A B C A B A A 
table:   0 0 0 0 0 0 0 0 0 
         ^

4. Loop over the remaining characters of the pattern

(i.e., for i from 1 to len(pattern) - 1) [i = 1 ; j = 0]:

a. If the character at position i is equal to the character at position j in the pattern, set table[i] = j + 1 and increment both i and j.

pattern: A B A B C A B A A 
table:   0 0 1 2 0 1 2 3 1  
               ^       ^

b. If the character at position i is not equal to the character at position j in the pattern, set table[i] to the maximum value of 0 and the value of table[j-1], where j is the previous value of j.

i. If j is equal to 0, set table[i] = 0 and increment i.

pattern: A B A B C A B A A   
table:   0 0 1 2 0 1 2 3 1
                 ^    ^

ii. Otherwise, set j to the value of table[j-1] and repeat step 4.

pattern: A B A B C A B A A   
table:   0 0 1 2 0 1 2 3 1  
                     ^ ^

The resulting table is:

pattern: A B A B C A B A A
table:   0 0 1 2 0 1 2 3 1

This table tells us, for each position in the pattern, what is the length of the longest proper prefix of the pattern that is also a proper suffix of the substring ending at that position. For example, table[3] = 2 because the substring ABAB has two proper prefixes that are also proper suffixes, namely AB and B, and their length is two.

Another simpler example:

II. Initialize two indices i and j to 0, which represent the current positions in the text and pattern, respectively.

III. Compare the character at position i in the text to the corresponding character at position j in the pattern. If they match, increment both i and j. If they do not match, we use the partial match table to determine the next position of j.

IV. In our example, we start by comparing the first characters of the text and pattern, which both have the value “A”. They match, so we increment both i and j. We continue in this way until we reach the fifth character of the pattern, which has the value "C". At this point, we have matched the prefix "ABABA" of the pattern to the substring "ABABA" of the text. However, the next character of the pattern is "C", which does not match the corresponding character in the text.

V. Use the partial match table to determine the next position of j. Specifically, we look up the value of the partial match table for the previous value of j (which is 3) to determine the next value of j. This value tells us how much of the pattern we have already matched that also matches the current suffix of the pattern. In this case, the value of the partial match table for j=3 is 2, which means that the prefix "ABA" of the pattern matches the suffix "BA" of the current substring "ABAB" of the text. Therefore, we set j to 2 and continue comparing characters.

VI. Repeat steps 3–4 until we reach the end of the text or find a match for the pattern. In this case, the next character of the pattern is “A”, which matches the corresponding character in the text. We continue comparing characters until we have matched the entire pattern in the text, starting at position 2.

Overall, the KMP algorithm is able to efficiently search for patterns in text by taking advantage of information about the pattern to avoid unnecessary comparisons. The partial match table allows the algorithm to skip over parts of the text that cannot match the pattern, based on what has already been matched.

The time complexity of the KMP algorithm is O(n + m), where n and m are the lengths of the text and pattern, respectively.

This is because the algorithm compares each character of the text to at most one character of the pattern, and it computes the partial match table in O(m) time. This can be much faster than the brute-force approach, which has a time complexity of O(nm) in the worst case.

Overall, the KMP algorithm is a powerful and efficient string matching algorithm that is widely used in practice.

It is a good choice for problems where the text and/or pattern strings are large, and where speed is a critical factor.

The Code:

def partial_match_table(pattern):
    """
     Constructs the partial match table for the KMP algorithm.

    Args:
        pattern (str): The pattern to construct the partial match table for.

    Returns:
        list[int]: The partial match table.
    """
    table = [0] * len(pattern)
    i, j = 1, 0
    while i < len(pattern):
        if pattern[i] == pattern[j]:
            j += 1
            table[i] = j
            i += 1
        elif j > 0:
            j = table[j-1]
        else:
            i += 1
    return table


def kmp_search(pattern: str, text: str) -> int:
    # Step 1: Construct the partial match table
    table = partial_match_table(pattern)

    # Step 2: Perform the search
    i, j = 0, 0
    while i < len(text):
        if pattern[j] == text[i]:
            i += 1
            j += 1
            if j == len(pattern):
                return i - j
        elif j > 0:
            j = table[j-1]
        else:
            i += 1

    return -1


def check_kmp(pattern, text):
    """
    Searches for the first occurrence of the pattern in the given text using the Knuth-Morris-Pratt algorithm and
    prints out different statements based on the result of the search.

    Args:
        pattern (str): The pattern to search for.
        text (str): The text to search in.
    """
    result = kmp_search(pattern, text)
    if result == -1:
        print(f"The pattern '{pattern}' was not found in the text '{text}'.")
    else:
        print(f"The pattern '{pattern}' was found in the text '{text}' starting at index {result}.")


if __name__ == "__main__":
    # Test 1)
    pattern = "abc1abc12"
    text1 = "alskfjaldsabc1abc1abc12k23adsfabcabc"
    text2 = "alskfjaldsk23adsfabcabc"
    check_kmp(pattern, text1)
    check_kmp(pattern, text2)

    # Test 2)
    pattern = "ABABX"
    text = "ABABZABABYABABX"
    check_kmp(pattern, text)

    # Test 3)
    pattern = "AAAB"
    text = "ABAAAAAB"
    check_kmp(pattern, text)

    # Test 4)
    pattern = "abcdabcy"
    text = "abcxabcdabxabcdabcdabcy"
    check_kmp(pattern, text)

Alternative implementation:

It follows the same two steps as the previous implementation:

constructing the failure array, and searching the text for the pattern while using the failure array to efficiently handle mismatches.

The only difference is in the implementation of the get_failure_array function, which is still creating the partial match table using a similar algorithm.

from __future__ import annotations


def kmp(pattern: str, text: str) -> bool:
    """
    The Knuth-Morris-Pratt Algorithm for finding a pattern within a piece of text
    with complexity O(n + m)
    1) Preprocess pattern to identify any suffixes that are identical to prefixes
        This tells us where to continue from if we get a mismatch between a character
        in our pattern and the text.
    2) Step through the text one character at a time and compare it to a character in
        the pattern updating our location within the pattern if necessary
    """

    # 1) Construct the failure array
    failure = get_failure_array(pattern)

    # 2) Step through text searching for pattern
    i, j = 0, 0  # index into text, pattern
    while i < len(text):
        if pattern[j] == text[i]:
            if j == (len(pattern) - 1):
                return True
            j += 1

        # if this is a prefix in our pattern
        # just go back far enough to continue
        elif j > 0:
            j = failure[j - 1]
            continue
        i += 1
    return False


def get_failure_array(pattern: str) -> list[int]:
    """
    Calculates the new index we should go to if we fail a comparison
    :param pattern:
    :return:
    """
    failure = [0]
    i = 0
    j = 1
    while j < len(pattern):
        if pattern[i] == pattern[j]:
            i += 1
        elif i > 0:
            i = failure[i - 1]
            continue
        j += 1
        failure.append(i)
    return failure


if __name__ == "__main__":
    # Test 1)
    pattern = "abc1abc12"
    text1 = "alskfjaldsabc1abc1abc12k23adsfabcabc"
    text2 = "alskfjaldsk23adsfabcabc"
    assert kmp(pattern, text1) and not kmp(pattern, text2)

    # Test 2)
    pattern = "ABABX"
    text = "ABABZABABYABABX"
    assert kmp(pattern, text)

    # Test 3)
    pattern = "AAAB"
    text = "ABAAAAAB"
    assert kmp(pattern, text)

    # Test 4)
    pattern = "abcdabcy"
    text = "abcxabcdabxabcdabcdabcy"
    assert kmp(pattern, text)

    # Test 5)
    pattern = "aabaabaaa"
    assert get_failure_array(pattern) == [0, 1, 0, 1, 2, 3, 4, 5, 2]

Visualized:

Bogdan Tudorache | Founder @ berrynews.org

If you like the article and would like to support me, make sure to:

👏 Clap for the story (50 Claps) to help this article be featured

🔔 Follow me Bogdan Tudorache

📰 Read more coding content on Coding Challenge

🔔 Connect w/ me: LinkedIn | Reddit