Optimization

Duality theorems and their proofs

Part 2: Strong duality theorem

Khanh Nguyen

Published in

MTI Technology

19 min readOct 3, 2020

To see the code I wrote for this project, you can check out its Github repo
For other parts of the project: part 1, part 2

Introduction

In part 1 of the project, I introduced the concept of duality for linear programming. Particularly, for a given linear program:

The primal form tries to maximize a given objective function
The dual form tries to minimize the upper bound of this objective function, which means it is also a linear program.

Weak duality theorem

Given those two forms of a linear program, the weak duality theorem states that:

The primal objective function is always less than or equal to the dual objective function.

Corollary of weak duality theorem

This theorem gives rise to an interesting corollary that can help us detect when we’ve reached an optimal solution:

If we happen to get the same objective value for the primal and the dual, then we have arrived at the optimal solution for both problems.

However, this implies that we have to solve both the primal and dual problems at the same time and check to see if their objective functions are equal. This is quite tedious indeed.

Strong duality theorem

Therefore, in this part of the project, I will introduce and prove the strong duality theorem. This theorem allows us to solve only one form of the linear program (primal or dual), since:

If one form of the linear program is optimized, the other form is also optimized. Furthermore, their optimal values of both forms are equal.

Components of strong duality theorem

There are quite a few components to this theorem, which I have fleshed out in the diagram below:

To prove that if one form is optimized, the other form is also optimized, our proof needs to go in both directions:

An optimal dual implies an optimal primal
An optimal primal implies an optimal dual

The latter proof is pretty much the reverse of the former. Therefore, in this blog post, I will only explain the first proof: an optimal dual implies an optimal primal.

Prove that optimal dual implies optimal primal

To prove this statement, I will first prove an intermediate statement:

If the dual is optimized at some point, there is another point in the primal such that the objective values of the primal and dual are the same.

Intermediate statement when proving optimal dual → optimal primal (top left of strong duality theorem diagram)

There are 2 reasons for proving this intermediate statement:

We prove that the objective values of the primal and dual are equal when either form is optimized (in this case the dual). This is the second component of the strong duality theorem.
When the objective values of primal and dual are equal, then the primal is also optimized (see below). This is nothing but the corollary of the weak duality theorem, which was proven in part 1. Hence, this completes the proof that an optimal dual implies an optimal primal.

Proof of intermediate statement

To prove this intermediate statement, we use the powerful technique of proof by contradiction:

First, we negate the very thing we are trying to prove. In this case, we claim that when the dual is optimized, there does not exist a point in the primal such that the primal is equal to the dual.
Then, we will show that this claim leads to some terrible, illogical contradiction.
This means the negated claim is false, or the original claim is true: there indeed exists a point the primal such that the primal is equal to the dual.

Proof outline

It turns out, proving that this negated statement leads to a contradiction is quite complicated. It builds upon another famous theorem called the Farkas lemma.

Therefore, in the parts of that follow, I will break down this proof into 3 parts, each with a different proof technique:

1. Proof of Farkas lemma — proof by picture: instead of a rigorous proof, we will prove a simplified version of Farkas lemma using a picture. The geometry of the picture will intuitively show us that the Farkas lemma is true.

2. Proof of Farkas variant — proof by reduction: we will prove a variant of the Farkas lemma by reducing this variant to the original Farkas lemma. Since the original lemma itself is proven true in step 1, the Farkas variant will also be true.

3. Proof of intermediate statement — proof by contradiction: with the help of the Farkas variant, we will show that the negated version of the intermediate statement will lead to some contradiction. This implies that the intermediate statement itself is true.

Farkas lemma

We start with the proof of Farkas lemma, which was invented by the Hungarian mathematician Gyula Farkas in 1902:

The variables `x`, `y`, `A`, `b` are not related to the variables in the linear program, even though they share the same names.

This lemma involves two statements, of which only one statement can be true at a given time. For example, statement 1 is true implies statement 2 is false, while statement 1 is false implies statement 2 is true.

Proof of Farkas lemma — proof by picture

Rigorous proofs for the Farkas lemma are quite complex, and most involve either the hyperplane separation theorem or the Fourier–Motzkin elimination. However, a simplified version of the lemma, when expressed as a 2-D picture, can give us an intuitive “proof” of it.

Geometric interpretation of statement 1

For our simplified version, let’s assume that:

Our matrix A is only a 2×2 matrix. Hence, the columns of the matrix are the 2-D vectors a₁ and a₂.
Our vector x is only a length-2 vector, whose components are x₁ and x₂.
Consequently, statement 1 of Farkas lemma means that the vector b is some weighted combination of a₁ and a₂, with all the weights —x₁ and x₂ — positive.

Geometrically, this means b stays inside the cone bounded by the vectors a₁ and a₂ (when statement 1 of the lemma is true):

Geometric interpretation of statement 2

For statement 2, let’s also assume that vector y is also a length-2 vector.

Then, the above derivation means that when statement 2 is true, the dot products of a₁ and a₂ with y is non-negative, while the dot product of b with y is negative.

Geometrically, this means that a₁ and a₂ stays on the same half-plane as y (blue in below graph), while b stays on the opposite half-plane (orange):

Geometric interpretation of Farkas lemma

We can visualize the two statements together in the animation below:

On the left, statement 1 is true: b stays inside the cone bounded by the vectors a₁ and a₂. However, this arrangement means statement 2 is always false: no matter how much we rotate the vector y, we can never find one such that a₁ and a₂ stays on the same half-plane as y, while b stays on the opposite half-plane.
On the right, statement 2 is true: as we rotate vector y, we can find one such that a₁ and a₂ stays on the same half-plane as y, while b stays on the opposite half-plane. However, this means that b has to stay outside of the cone bounded by the vectors a₁ and a₂ i.e. statement 1 is false.

In short, this simple picture can help convince us that the Farkas lemma is true: if one of the two statements above is true, then the other is false. Vice versa, if one of the two statements above is false, then the other must be true.

Proof of Farkas variant — proof by reduction

Now that we have “proven” the Farkas lemma, let’s move on to a slightly modified version of this lemma, which I call Farkas variant for short. For comparison, below are the two versions shown side by side:

We see that the Farkas variant is very similar to the original Farkas lemma, except for two small differences:

The condition in statement 1 is Ax ≤ b rather than Ax = b
There is an extra condition of y ≥ 0 in statement 2

Given the similarity between the two forms, it's no surprise that the proof of the Farkas variant is done by reducing the variant to the Farkas lemma itself. To start with, the goal of the proof is to show when statement 1 of the Farkas variant is true, it is equivalent to its statement 2 being false:

The above statement also implies its inverse: when statement 1 is false, statement 2 is true. This is because if statement 2 is false in that case, then statement 1 must be true from the above statement. Since statement 1 cannot be both true and false at the same time, statement 2 must indeed be true. Astute readers will recognize that this is yet another proof by contradiction.

Step 1: Modify statement 1 of Farkas variant to that of Farkas lemma

To prove the equivalence between the above two statements, we start from the left-hand side: statement 1 of the Farkas variant is true. Next, we will transform it to a modified version of statement 1 of the original Farkas lemma:

1. Given the condition Ax ≤ b, we add a non-negative vector s — a vector whose elements are all greater or equal to 0 — to Ax such that it is exactly equal to b. The vector s is commonly called a slack variable, as it picks up the “slack” a.k.a. the difference between Ax and b.

2. We multiply s by an identity matrix in front of it. This does not change the equation Ax + s = b by one bit, since multiplying a vector by the identity matrix still returns the vector itself.

3. However, the introduction of the identity matrix allows us to stack the matrices A and I together horizontally, and the vectors x and s together vertically. Since both x and s are non-negative vectors, their concatenated vector is also non-negative.

This stacking might look weird at first, but if we look more closely, we see that each element of b is still the sum of:

The dot product between a row of A and x
The dot product between a row of I and s

More generally, this is called block matrix multiplication: think of A, I, x, and s as blocks of larger matrices. As a result, the multiplication of these matrices can be written as multiplications of their smaller blocks.

4. For simplicity, we name the concatenated matrix as A’, and the concatenated vector as x’. This allows us to transform statement 1 of the Farkas variant into statement 1 of the Farkas lemma, albeit with a slight modification: we now have A’ instead of A, and x’ instead of x.

Step 2: Apply Farkas lemma

Since statement 1 of the Farkas lemma is true, its statement 2 must be false: there does not exist a y such that A’ᵀy ≥ 0 and bᵀy < 0. Notice that since A’ appears in statement 1 instead of A, it must also appear in statement 2.

Step 3: Modify statement 2 of Farkas lemma to that of Farkas variant

Knowing statement 2 of the Farkas lemma is false, the only thing left to do is to modify it to become statement 2 of the Farkas variant. This turns out to be straightforward, since:

We already have one condition for statement 2 of the variant: bᵀy < 0
For A’ᵀy ≥ 0, we can write out the components of A’ᵀ: the matrix Aᵀ stacked on top of the identity matrix I*. This is because A’ is composed of the matrix A and I stacked together horizontally. Its transpose, then, is the transpose of these two matrices stacked together vertically.

* Technically, the bottom matrix should be Iᵀ, but Iᵀ is the same as I.

With A’ᵀ written this way, it is clear that A’ᵀy ≥ 0 also implies Aᵀy ≥ 0 and y ≥ 0. These two conditions, along with the existing condition bᵀy < 0, means that statement 2 of the Farkas variant is false: there does not exist a y such that Aᵀy ≥ 0, y ≥ 0 and bᵀy < 0.

Overall proof

The combination of the above three steps can prove that when statement 1 of the Farkas variant is true (top left in image below), that’s equivalent to its statement 2 being false (top right).

This is because:

Statement 1 of the Farkas variant is true implies statement 1 of the Farkas lemma is also true (step 1)
Statement 1 of the Farkas lemma is true implies statement 2 of the Farkas lemma is false (step 2)
Statement 2 of the Farkas lemma is false implies statement 2 of the Farkas variant is also false (step 3)

Finally, the reason why this proof is called a proof by reduction is because we have reduced the Farkas variant to the original Farkas lemma (via step 1 and 3). Then, since the Farkas lemma had been proven true (step 2), the Farkas variant must also be true.

Proof of intermediate statement — proof by contradiction

With the Farkas variant proven, let’s move on to the last and most important step of the proof of strong duality theorem. Here, we must prove its intermediate statement (copied below as reminder):

Modification of intermediate statement

For the proof of the intermediate statement to work, we must modify the statement somewhat.

1. First, for convenience, we will let d* be equal to the dual optimal objective value bᵀy*, where y* is the corresponding optimal solution for the dual.

2. Then, the end result of the statement can be rewritten as: there exists an x* in the primal form such that cᵀx* = d*, since d* = bᵀy*.

3. We can expand on this statement even more by noticing that since x* belongs in the primal form, it must satisfy the primal constraints. In other words, there exists an x* such that cᵀx* = d*, Ax* ≤ b and x* ≥ 0.

4. Here, I will do something very strange: I’ll modify one condition for x* in step 3 such that cᵀx* = d* becomes cᵀx* ≥ d*.

Although this seems to defy the laws of mathematics, the reason I am allowed to this is due to the weak duality theorem:

We know that cᵀx* > d* is impossible to happen, since that would imply cᵀx* > bᵀy* . However, weak duality theorem dictates that cᵀx ≤ bᵀy for any x and y.
Therefore, writing cᵀx* ≥ d* is essentially the same as writing cᵀx* = d*, since we know that the scenario of cᵀx* > d* will never happen.

5. However, by writing cᵀx* = d* as the inequality cᵀx* ≥ d* , I can reverse the sign of this inequality by negating both sides:

6. This allows me to combine it with the existing inequality Ax* ≤ b to:

Combine A and -cᵀ into a larger A’ matrix
Combine b and -d* into a larger b’ vector

7. Finally, all of the above modifications mean the intermediate statement can be restated as below:

Negating the intermediate statement

With the intermediate statement rewritten, we can finally start the proof by contradiction. This is done by negating the very thing we are trying to proof: there does not exist x* such that A’x* ≤ b’ and x* ≥ 0. Our goal, then, is to show that this leads to an illogical contradiction, which will affirm that such an x* indeed exists.

Notice in the contradiction above, the negated statement is nothing but saying statement 1 of the Farkas variant is false (with some minor renaming of variables). Then, statement 2 of the Farkas variant must be true, as we have proven above.

Therefore, our next step is to spot the contradiction that underlies statement 2 of the Farkas variant in this case.

Spotting the contradiction

To figure out the contradiction underneath this statement, let’s first:

Break the vector y into a smaller vector yₐ and a scalar y꜀.
This transforms the inequalities in statement 2, which previously involved y, into inequalities involving yₐ and a scalar y꜀.
Dividing both sides of these inequalities by y꜀ and rearranging terms leads to inequalities involving yₐ/y꜀ . Note that y꜀ is always non-negative since it’s a part of the non-negative vector y. Hence, all the inequality signs remain the same when we divide them by y꜀.

Transform inequalities of y into inequalities involving `yₐ/y꜀`

For convenience, let y’ be the vector that stands for yₐ/y꜀. This means statement of 2 the Farkas variant can be rewritten as:

Notice that the first two requirements for y’ simply mean that y’ satisfies the constraints of the dual form, while the last requirement means that bᵀy’ < bᵀy*, since d* is the dual objective value at y*.

However, this implies that there exists a y’ in the dual form whose objective value bᵀy’ is even lower than that of y*. Therefore, y* is no longer the optimal solution of the dual form, even when we assumed it to be in the first place! This is precisely the contradiction that entails when we negate the intermediate statement of the strong duality theorem.

As a result, the intermediate statement must be true:

if y* is indeed the optimal solution of the dual form, then there exists an x* in the primal form such that cᵀx* = bᵀy*.

When y꜀ = 0

One small caveat: along this proof by contradiction, we divided the inequalities by y꜀ to transform them into more useful forms. Therefore, we need to come up with an alternative proof to deal with cases where y꜀ = 0.

Let’s go back to the step where we had divided by y꜀: even though we can’t divide by y꜀ anymore, it being zero allows us to simplify the inequalities of statement 2 of Farkas variant into inequalities involving only yₐ.

Next, instead of constructing a y’ = yₐ/y꜀ as in the cases when y꜀ > 0, we let y’ = y* + yₐ. This allows us to transform the inequalities involving yₐ into inequalities involving y’. This works by combining the conditions of yₐ above with the conditions of y*, which must satisfy the constraints of the dual form.

These inequalities involving y’ are once again used to demonstrate the contradiction when we negate the intermediate statement: there exists a y’ in the dual form whose objective value bᵀy’ is even lower than that of y*. In other words, y* is no longer the optimal solution of the dual form even though we initially assumed it was. Therefore, the intermediate statement itself must be true.

Summary

Proof summary

To summarize the proof of the intermediate statement:

1. First, we start by negating the very statement we’re trying to prove. In other words, if y* is the optimal solution for the dual form, then there does not exist an x* in the primal form such that cᵀx* = bᵀy*.

2. Then, we modify the negated statement such that it is equivalent to statement 1 of the Farkas variant being false. The Farkas variant is proven true by reduction from the Farkas lemma, which itself is “proven” by picture. As a result, statement 2 of the Farkas variant must be true.

3. However, this statement 2 implies a contradiction: y* is no longer the optimal solution of the dual form even though we assumed it was. This contradiction holds regardless of whether y꜀ is positive or zero.

The above proof by contradiction implies that the intermediate statement is true: if y* is the optimal solution for the dual form, then there exists an x* in the primal form such that cᵀx* = bᵀy*.

Combine intermediate statement with corollary of weak duality

This is then combined with the corollary of the weak duality theorem to imply that x* itself is also the optimal solution of the primal form. In other words, an optimal dual implies an optimal primal. The proof in the other direction — an optimal primal implies an optimal dual — follows pretty much the same strategy as the one I have shown above.

This means we have finally proven the strong duality theorem: if one form of the linear program is optimized, the other form is also optimized. Furthermore, their optimal values of both forms — cᵀx* and bᵀy*— are equal.

Comparison to weak duality theorem

When we compare the strong duality theorem with the weak duality theorem (proven in part 1), we can see where it gets its name:

The corollary of the weak duality theorem only claims that when the objective function of the primal and dual forms are equal, both forms are optimized.
However, the strong duality theorem goes one step further: it claims that as long as one form is optimized, then the other form is also optimized, and their objective functions are equal at the respective optimal solutions.
In other words, all 3 statements in the theorem are equivalent to one another: as long as one statement is true, the other two statements must also be true.

Application of strong duality theorem

The fact the primal form is optimized implies the dual form is also optimized, and vice versa, means that we can pick and choose which form of a linear program we can optimize.

This is because for some problems, solving one form might be more convenient than solving the other. For example, the simplex algorithm to solve linear programs has its own primal and dual version, each of whom is suitable for different linear programs.

For a example that is more data science-friendly, support vector machines (SVM) also has its own primal and dual objective functions.

At first glance, the primal form of SVM seems much more intuitive: it seeks to maximize the margin between the separation boundary and the nearest data point, while minimizing margin violations i.e. data points that enter the wrong side of the margin.
In contrast, the dual form of SVM is not only less intuitive, but its objective function also involves the dot product between every pairs of samples in our training data. Hence, it does not scale well when we have many data points in our training set.

Primal and dual objective functions of SVM (reference)

However, the saving grace of the dual form is when the number of features starts to grow, such as when we transform the original features (such as x₁ and x₂) into polynomial expansion of those features (such as x₁², x₂², or x₁x₂). In that case, the dot product of every pair of samples in the transformed feature space can be expressed as a function of their simple dot product in the original feature space of x₁ and x₂. This is also known as the “kernel trick”.

Kernel trick of polynomial expansion of degree d (reference)

This trick helps tremendously with training SVM for these transformed features, even when there are infinite numbers of them (such as SVM with the radial basis function kernel)! However, since SVM is not technically a linear program, the theorem that ensures the equivalence between its primal and dual form is no longer the strong duality theorem, but rather the Slater’s condition.

References

The two main resources that I found the most useful for understanding the proof of the strong duality theorem are the video lectures on this topic from professor Ankur Moitra from Cornell University, and the 2006 book “Understanding and Using Linear Programming” from Matoušek & Gärtner. My derivation of the proof is actually an adaptation of these two resources, with some slight modifications to make the proof even simpler.
For more details on block matrix multiplication, please watch this linear algebra lecture by the famous Gilbert Strang from MIT.
For more details on the dual forms of SVM and the Slater’s condition, you can watch this lecture from the machine learning course offered by Bloomberg. Warning: this lecture is much more mathematically intensive than I’m comfortable with, so if there’s any error in my interpretation of SVM and its duality, please feel free to leave me a message at this blog post 🤗