String algorithms: What is the prefix function and how to compute it

Published in

Geek Culture

5 min readAug 2, 2021

The prefix function is used by many string algorithms, including the Knuth-Morris-Pratt algorithm for string matching. This article derives, implements and analyses an algorithm which computes the prefix function of a given string in linear time.

Notations, terminology

I will use the following notations and terminology throughout this article:

Unless otherwise specified, capital letters refer to strings and their lowercase version to their length (e.g. S is a string of length s).
An index of S is an integer between 1 and s inclusive.
TU denotes the string made by concatenating T and U, where T and U may be strings or characters.
For all indices i, j of T, T[i] denotes the i-th character of T and T[i .. j] denotes the string T[i] … T[j].

Borders and the prefix function

We consider a string S. We can then define borders as follows:

Definition:
A border of S is a string B which is a proper prefix and a proper suffix of S.

For example, the empty string ε is a border of any non-empty string, but ε has no borders (because its only substring is itself, which is not a proper prefix or suffix). The longest border of the string baobaba is ba.

Definition:
The prefix function of S is the function 𝜋 that maps indices i of S to the length of the longest border of S[1 .. i].

For example if we consider the string baobaba then 𝜋(1) = 𝜋(2) = 𝜋(3) = 0, 𝜋(4) = 1, 𝜋(5) = 2, 𝜋(6) = 1 and 𝜋(7) = 2.

Our goal is to write an algorithm which takes a string and gives its prefix function.

Paving our way

The first interesting property of borders tells us that every border of S is the longest border of a substring of S. More precisely:

Property 1:
The n-th longest border of S is the longest border of the (n-1)-th longest border of S.

The proof is something along those lines:

Proof outline:
We can start by proving that if A and B are borders of S, then a < b implies that A is a border of B. In particular, the n-th longest border of S is a border of the (n-1)-th. We now need to prove that it is the longest. Suppose that it is not: the (n-1)-th longest border of S has a border B which is strictly longer than the n-th longest border of S. Since B is a border of a border of S, it must be a border of S as well. It is strictly shorter than the (n-1)-th longest and strictly longer than the n-th longest: this is impossible.

We will use this property to derive a recurrence relation to find all values of 𝜋.

The base case is simple: the longest border of a string containing one single character is ε. Therefore, 𝜋(1) = 0.

Suppose that S has length s ≥ 2. B is a non-empty border of S if and only if the following conditions are all satisfied:

B[b] = S[s];
B[1 .. b - 1] = S[s - b .. s - 1];
B[b] = S[b];
B[1 .. b - 1] = S[1 .. b - 1];
b ≤ s - 1.

We have just rephrased the definition in a way that separates the last symbol of the border from the rest. Statements 1. and 2. are equivalent to saying that B is a suffix of S, statements 3. and 4. are equivalent to saying that B is a prefix of S and statement 5. implies that B is a proper prefix/suffix of S. This enables to find a relation between borders of S and borders of S[1 .. s - 1]:

B is a non-empty border of S if and only if: B[1 .. b - 1] is a border of S[1 .. s - 1] and B[b] = S[b] = S[s].

It means that any border of S is constructed by appending S[s] to a border of S[1 .. s - 1], provided that S[s] = S[b]. Therefore:

Property 2:
S has a non-empty border of length b if and only if: S[1 .. s - 1] has a border of length b - 1 and S[b] = S[s].

Writing a solution

We can thus establish the following informal algorithm:

To find 𝜋(i), iterate over all border lengths b of S[1 .. i - 1] in decreasing order. The first time S[b + 1] = S[i], we set 𝜋(i) = b + 1. If it never happens, 𝜋(i) = 0.

So we start with the longest border of S[1 .. i-1], of length 𝜋(i-1). If S[𝜋(i - 1) + 1] = S[i], we can set 𝜋(i) = 𝜋(i-1) + 1. Otherwise, we need to find the length of the second longest border of S[1 .. i-1]. Property 1 tells us that it is given by 𝜋(𝜋(i - 1)): the second longest border of S is the longest border of the longest border of S. So we need to check if S[1 .. 𝜋(𝜋(i - 1)) + 1] = S[i], in which case we set 𝜋(i) = 𝜋(𝜋(i - 1)) + 1. We repeat until we find the longest border — if we never find it, we set 𝜋(i) = 0.

This gives the following algorithm in pseudocode (array and string indices start at 1):

We start by initialising an array which corresponds to the prefix function (line 4) and set its first value to 0 (it is the base case).

The while loop (line 10) iterates over all border lengths of S in a decreasing order. We make sure that borderLength remains positive (𝜋(0) would not make any sense). If the (borderLength + 1)-th character of the string is equal to the i-th one then we’ve found our longest border (it has length borderLength + 1).

If the condition line 13 is satisfied then we have found the length of the longest border of S[1 .. i] and set the value of borderLength accordingly. Otherwise, borderLength is zero, which corresponds to the maximum border length of S[1 .. i].

We can therefore define 𝜋(i) and move on.

Analysis

As promised, the algorithm runs in Θ(s) time with respect to the length of the string. This is not obvious, in particular because of the while loop.

The only variable which determines the number of iterations of the while loop is borderLength. It is initialised at 0 and can increase (line 14) or decrease (line 11).

Since borderLength is always nonnegative, it cannot decrease more than it increases. It increases by no more than s - 1 (once per iteration of the for loop), which proves that the while loop runs no more than s - 1 times throughout the execution of the programme.

Since the for loop runs exactly s - 1 times, the total complexity of compute-𝜋 is Θ(s).