Dynamic programming: how to solve the Longest Common Subsequence problem

Zakarie A
CodeX
Published in
7 min readOct 13, 2021

This article describes the longest common subsequence problem and derives and analyses an algorithm that solved it.

The longest common subsequence problem has applications in many fields, including version controlling. For example, given two texts A and B (represented by sequences of words) such that B is an updated version of A, finding the extracts from B which were left unchanged and those which were modified boils down to finding the longest common subsequence of A and B.

The image below illustrates the previous example. Highlighted words are part of the longest common subsequence.

More formally, we define a common subsequence of the sequences S and S’ of sizes N and M respectively, as a strictly increasing sequence X with values in [1, …, N ]×[1, …, M] such that for all values (i, j) of X, S[i] = S’[j] (indices start at 1). Increasing uses the relation defined by (a, b) ≤ (c, d) exactly when ac and bd.

A naïve approach would consist in examining all possible subsequences of the first sequence, and checking whether they can be found in the second sequence as well. There are as many subsequences of some sequence S as subsets of the indexing of S, i.e. 2, where n is the length of S. It would therefore take us Ω(2) time to solve this problem — but fortunately enough, we can do better using dynamic programming.

I will use the acronym LCS for longest common subsequence throughout this article.

Optimal substructure of the LCS Problem

This section shows how we can derive a recurrence relation to express a solution to an instance of the LCS problem in terms of solutions to smaller instances.

We consider two sequences: S, of size s, and T, of size t. We let Opt(a, b) denote the length of the LCS of S[1 .. a] and T[1 .. b].

If s = 0 or t = 0 then S and T have exactly one LCS, which is the empty sequence. Therefore, the LCS length is Opt(0, a) = Opt(a, 0) = 0 for all a.

Suppose now that both s and t are positive. We’ll consider two cases: either S[s] = T[t], or S[s] ≠ T[t].

If S[s] = T[t] then every LCS of S and T is constructed by appending (s, t) to any LCS of S[1 .. s-1] and T[1 .. t-1]. Therefore, Opt(s, t) = Opt(s-1, t-1) + 1.

Proof:
We’ll prove the equality by proving that the left-hand side is less than or equal to the right-hand side, and conversely, that the right-hand side is less than or equal to the left-hand side.

Proof that Opt(s-1, t-1) + 1 ≤ Opt(s, t) | Let X (of length x) be an LCS of S[1 .. s-1] and T[1 .. t-1]. Then the sequence X’ of length x + 1 defined by X’[i] = X[i] when i ≤ x and X’[x + 1] = (s, t) is a common subsequence of S and T, which proves that x + 1 ≤ Opt(s, t). Substitution x, we get Opt(s-1, t-1) + 1 ≤ Opt(s, t)

Proof that Opt(s, t) ≤ Opt(s-1, t-1) + 1 | Let Y (of length y) be an LCS of S and T.

(1) If Y[y] = (s, t) then Y’ = Y[1 .. y-1] is a common subsequence of S[1 .. s-1] and T[1 .. t-1]. Y’ has length y - 1 = Opt(s, t) - 1. This proves that Opt(s, t) - 1 ≤ Opt(s-1, t-1).

(2) If Y[y] = (u, v) with u ≠ s and v ≠ t then we can define Y* of length y + 1 by appending (s, t) to Y, thus contradicting the optimality of Y.

(3) If Y[y] = (u, t) with u ≠ s (respectively, Y[y] = (s, u) with u ≠ t) then S[u] = S[s] (resp. T[u] = T[t]), by definition of a common subsequence. We can therefore replace u with s (resp. with t) and conclude by (1).

Therefore, Opt(s, t) = Opt(s-1, t-1) + 1.

this completes the proof.

If S[s] ≠ T[t] then the LCS of S and T is the same as the LCS of S[1 .. s - 1] and T, or the same as the LCS of S and T[1 .. t - 1]. Therefore, Opt(s, t) = max {Opt(s-1, t), Opt(s, t-1)}.

Proof outline:

(1) If Y is a common subsequence of S[1 .. s-1] and T then Y is also a common subsequence of S and T, so Opt(s-1, t) ≤ Opt(s, t).

(2: Proof that LCSs of S and T are LCSs of S[1 .. s - 1] and T or S and T[1 .. t - 1]) Let X (of length x) be an LCS of S and T. The proof is trivial if x = 0, so let’s assume X is non-empty. Let u, v := X[x]. The hypothesis implies that (u, v) ≠ (s, t). Suppose u ≠ s (without loss of generality, a similar argument if v ≠ t yields the same result). Then X is a common subsequence of S[1 .. s - 1] and T. Therefore, x ≤ Opt(s-1, t).

It follows from (1) that Opt(s, t) = max{Opt(s-1, t), Opt(s, t-1)}, which proves that X is an LCS of [1 .. s-1] and T. This completes the proof.

We can summarise the last paragraphs with the following recurrence:

Writing a solution

Without using dynamic programming, calculating the length of the LCS of two sequences using the formula we derived earlier would require exponentially many steps in the worst case. This is because we would need to make two recursive calls, one to compute Opt(i-1, j) and another to compute Opt(i, j-1), until one of both indices reaches zero.

Since there are actually no more than s*t distinct values to compute (where s and t are the lengths of the sequences), dynamic programming allows us to solve this problem in polynomial time.

Here is the pseudo-code to calculate the LCS length of two sequences:

Every recursive call finds the longest common subsequence of S[1 .. i] and T[1 .. j]. The loop terminates when i = s and j = t, that is, when we’ve computed Opt(S, T). Note that indexing starts at 1.

Since the main loop runs in linear time, the algorithm runs in Θ(st) time and uses Θ(st) space.

For example, when the two input sequences are S = (1, 6, 3, 5, 10, 6, 8, 9) and T = (6, 10, 5, 8, 9), the algorithm builds the following matrix, row by row and then column by column:

There is one LCS. It has length 4 and corresponds to values (6, 5, 8, 9).

Constructing an LCS

Finding the maximum common subsequence length may not be very useful if we cannot explicitly calculate an LCS. As in the previous problems we studied, this can be done without altering the asymptotic runtime and space complexity. We will maintain an additional two-dimensional array A which keeps track of how we constructed the LCS. For all indices i, j, A[i, j] indicates whether the LCS of S[: i] and T[: j]:

  1. is constructed by adding (i, j) to the subsequence of S[1 .. i-1] and T[1 .. j-1] or
  2. is the same as the subsequence of S[1 .. i-1] and T[1 .. j] or
  3. is the same as the subsequence of S[1 .. i] and T[1 .. j-1].

To keep things visual, we set A[i, j] to ABOVE-LEFT in the first case, ABOVE in the second case and LEFT in the third case.

This gives the following implementation (modified lines are highlighted):

The only case where the element S[i] (or equivalently, T[t]) is part of the LCS is the first one, i.e. when A[i, j] = ABOVE-LEFT.

This enables us to construct an LCS from array A as follows:

--

--