Optimization

Duality theorems and their proofs

Part 1: Weak duality theorem

Khanh Nguyen

Published in

MTI Technology

13 min readJul 15, 2020

To see the code I wrote for this project, you can check out its Github repo
For other parts of the project: part 1, part 2

Introduction

Find x₁ and x₂ to minimize f(x₁, x₂). Source

Optimization shows up everywhere in machine learning, from the ubiquitous gradient descent, to quadratic programming in SVM, to expectation-maximization algorithm in Gaussian mixture models.

However, one aspect of optimization that always puzzled me is duality: what on earth is a primal form and dual form of an optimization problem, and what good do they really serve?

Therefore, in this project, I will:

Go over the primal and dual forms for the most basic of optimization problem: linear programming.
Show that by solving one form of the optimization problem, we will have also solved the other one. This requires us to prove two fundamental duality theorems in linear programming: weak duality theorem and strong duality theorem. The former theorem will be proven in this part, while the latter will be proven in the next part of the project.
Explain why we should care about duality by showing its application to some data science problems.

Linear programming

Definition

All (constrained) optimization problems have three components:

1. Objective function: the function whose value we are trying to optimize, which can mean minimize or maximize depending on the problem. The value of the objective function will be called objective value from here on.

2. Decision variables: the variables in the objective function whose value will be fine-tuned to give the objective function its optimal value.

3. Constraints: additional equations or inequalities that the decision variables must conform to.

With these components, we can define linear programming as such:

Linear programming is an optimization problem where the objective function and constraints are all linear functions of the decision variables.

This principle can be seen in the following formulation of a linear program:

where

x: vector containing the decision variables

c: vector containing coefficients for each decision variable in the objective function. For simplicity, we will call these coefficients the objective coefficients from here on.

A: matrix in which each row contains the coefficients of each constraint

b: vector containing the limiting values of each constraint

Note that the vector inequalities in the above formula implies element-wise inequalities. For example, x ≥ 0 means every element of x must be greater or equal to zero.

Geometric interpretation of a linear program

Although the above formula of a linear program seems quite abstract, let’s see what it looks like using a simple example.

Suppose we have only have 2 decision variables, x₁ and x₂. Therefore, our vector x is simply [x₁, x₂]

Furthermore, suppose that the matrix A and the vectors b & c take the following values:

As a result, our linear program is nothing but linear functions and inequalities of the decision variables x₁ and x₂:

These linear functions and inequalities can be visualized in the 2-D plot below. There are several important components within this plot, so let’s go through them one by one:

1. Objective value

The intensity of the blue color in the plot background shows how high the objective value is at every [x₁, x₂] point in the plot. For example:

At x₁ = -0.5 and x₂ = -0.5 in the lower left corner, the objective value of x₁+x₂ is a lowly -1. Thus, it is represented by a very light blue color.
In contrast, the upper right corner has x₁ = 2.5 and x₂ = 2.5. This translates to an objective value of 5, which explains why the background is deeply blue in the upper right corner.

2. Level curves

More interestingly, we see that the color gradient of the objective values is perpendicular to the coefficient vector c of [1, 1].

In fact, if you draw any line perpendicular to c, all points on this line will have the same objective value. We call these the level curves of the objective function. The plot above shows two such level curves, one whose objective value is 0, and the other one whose objective value is 1.
From these two lines, we see that the more we slide the level curve along the direction of c, the higher the objective value will be. This will come in handy later when solving this linear program.

3. Feasible region

The 4 constraints of our linear program have created a gray polygon near in the origin.

In other words, all of the [x₁, x₂] points in this polygon satisfies all 4 linear constraints.
This explains why this region is often called the feasible region, since only the points within this region satisfy all constraints and can feasibly be the solution of the optimization problem.

Solving the linear program: graphical method

One question remains:

How can we find a point within the feasible region where the objective function is the highest?

There are several ways to solve a linear program, notably the simplex algorithm. However, for 2-D linear programs such as this one, we can simply use a graphical method to spot the optimal solution i.e. the point [x₁, x₂] with the highest objective value:

1. We slide the level curve along the direction of c until it exits the feasible region.

2. The last point in the feasible region that the level curve touches is the optimal solution. In rare cases, the level curve touches an entire line before exiting. This means all points on the line are optimal solutions to the linear program.

For our 2-D example:

As we slide the level curve along the direction of c, the level curve enters the feasible region at the origin, and exits at the point [2/7, 3/7]. Therefore, the point x₁ = 2/7, x₂ = 3/7 is the optimal solution. This corresponds to an optimal value of 2/7 + 3/7 = 5/7.
This makes sense, as all other points in the feasible region came in contact with the level curve before that point. As a result, their values of the objective function are all less than that of the optimal solution (5/7).

Upper bound of objective function

Although this linear program is easy to solve with the graphical method, some more complex linear program might take a long time to converge. In those cases, we need some way to know when we will converge to an optimal solution, so that we can stop the optimization without worry. One way to do so is to construct a theoretical upper bound of the objective function.

Intuitively speaking, if our objective function hits this upper bound during optimization, then we know that we have arrived at the optimal solution.

This is because the objective function cannot go higher than this upper bound, hence the upper bound itself is the highest possible value that the objective function can attain i.e. its maximum.

Upper bound of objective coefficients

We can construct this upper bound of the objective function by first constructing an upper bound of the objective coefficients c. Here’s one way to do so:

We find a weighed combination of constraint equations such that the coefficients of the combined equation are all greater or equal to the objective coefficients.

Or, in vector form:

We find a weighted combination of rows of A that is element-wise greater or equal to c.

This is equivalent to choosing a vector y such that Aᵀy ≥ c. Another important requirement: all the elements of y must be non-negative — y ≥ 0— as will be explained later.

Here’s an example of such a y that creates an upper bound of the objective coefficients c, which translates to an upper bound of the objective function:

Upper bound of objective function (with x)

Notice that in the above step, an upper bound on the objective coefficients will translate to an upper bound on the objective function only if x₁ and x₂ are non-negative. For a change, try x₁ = -3 and x₂ = -4, and you will see that the upper bound of the objective function — 1.25 x₁ + x₂ — becomes less than the objective function itself — x₁ + x₂.

Formally, this is the same as multiplying x to both sides of the c ≤ Aᵀy inequality. Since every elements of x is positive, the inequality sign — less than or equal to — is maintained:

As a result, the upper bound of the objective function is yᵀAx.

Upper bound of objective function (without x)

However, this upper bound still involves x, which is what we are trying to optimize in the first place. One way to remove x from this expression is by exploiting the original constraint of Ax ≤ b. Here, since all elements of y were chosen earlier to be non-negative, we maintain the inequality sign:

In short, combining all the above steps gives us the upper bound of the objective function cᵀx, which is just bᵀy.

Minimizing the upper bound

For this theoretical upper bound to be any good in signaling that the optimization has converged, it must be tight against the objective function itself. Otherwise, one could think of an unrealistically high upper bound that the objective function can never reach, which means convergence is impossible to detect.

Therefore, the goal is to minimize the upper bound as much as possible, preferably toward the maximum value of the objective function itself:

This is done by minimizing bᵀy, subject to the requirements of y chosen earlier: Aᵀy ≥ c and y ≥ 0.

From the above formulation, we see that minimizing the upper bound is itself a linear program — its objective function is a linear function of the decision variables y and its constraints are linear inequalities involving y.

This is called the dual form of a linear program, while the original linear program is called the primal form. Lastly, the phenomenon in which an optimization problem has two analogous forms is called duality.

Weak duality theorem

Now that we know what the primal and dual form of a linear program are, let’s move on to the first theorem that describes their relationship — the weak duality theorem. It states:

The primal objective function is always less than or equal to the dual objective function.

More formally:

In other words, the dual objective function bᵀy is always the upper bound of the primal objective function cᵀx.

Wait, is that it?!

Some of you might scratch your head and think “But isn’t cᵀx ≤ bᵀy what we just showed earlier?”. To be honest, the first time I learned about the weak duality theorem, I also found it quite anti-climactic.

However, this comes as no surprise, since the weak duality theorem is a direct consequence of us choosing an appropriate y — Aᵀy ≥ c and y ≥ 0— in order to create the upper bound bᵀy of the primal objective function cᵀx. In other words, the theorem is only true because we made it so.

Proof of weak duality theorem

In fact, the “proof” for weak duality theorem is exactly the same as our earlier construction of the upper bound for the primal objective function, which is summarized below:

Corollary of weak duality theorem

Despite the tautology-like nature of the weak duality theorem, it gives us a rather interesting corollary:

If we happen to get the same objective value for the primal and the dual, then we have arrived at the optimal solutions for both problems.

More formally:

where x* and y* are points in the respective feasible regions of the primal and dual, such that the objective values of the two forms are equal at those points.

Proof of corollary

The proof of this corollary is actually quite intuitive. To prove that cᵀx* = bᵀy* implies x* is the optimal solution for the primal:

R: the line of real numbers that include the primal and dual objective values

Suppose that you could find another x such that cᵀx is greater than cᵀx*. This means x* must not be the optimal solution of the primal form, since there is another x with an even higher objective value.
This means that cᵀx is greater than bᵀy*, given cᵀx* = bᵀy*.
However, this violates the weak duality theorem, which dictates that cᵀx ≤ bᵀy for any x and y in the respective primal and dual feasible regions.
In other words, a more optimal x than x* is impossible to find. This means x* is indeed the optimal solution of the primal.

Intuitively, this makes sense, since weak duality theorem says that the dual is the upper bound of the primal. Therefore, when the primal is equal to the dual, we are certain that the primal has reached its maximum value possible.

The proof in the other direction — cᵀx* = bᵀy* implies y* is the optimal solution for the dual — is exactly the reverse of the above proof:

Notice that in the above two proofs:

1. We start out by negating the very claim that we are trying to proof: we claim that x* is not the optimal solution of the primal form, or y* is not the optimal solution of the dual form.

2. We then show that this negation leads to some terrible contradiction: they violate the weak duality theorem.

3. This implies that the negated claim is false, which means the original claim must be true.

This proof technique is called proof by contradiction, and it is extremely powerful. In fact, it will be the core of the proof for strong duality theorem in the next part.

Application of corollary

This corollary of the weak duality theorem gives us one method to check if our optimization algorithm has converged.

Let’s return to our 2-D example to see how we can perform this check. Below are our primal form (left) and dual form (right) of the same linear program:

**Background gradient:** objective value. **White vector:** objective coefficients. **Gray area:** feasible region.

Now, let’s imagine that we don’t know how to solve linear programs at all, neither by the earlier graphical method nor by more sophisticated methods such as simplex.

However, one thing we can do — very naively — is to randomly pick two points in the respective primal and dual feasible regions and check to see if the primal is equal to the dual at those points.
If they are, those points are guaranteed to be the optimal solutions for their respective problems, as dictated by the corollary of weak duality theorem. Then, we can then stop optimizing.

For example, in our 2-D case:

We pick the point [x₁, x₂] = [2/7, 3/7] in the primal feasible region. The primal objective value at that point is x₁ + x₂ = 2/7 + 3/7 = 5/7.
Then, we pick the point [y₁, y₂] = [1/7, 3/7] in the dual feasible region. The dual objective value at that point is 2y₁ + y₂ = 2×1/7 + 3/7 = 5/7.
Since the value of the primal is equal to the dual at those 2 points, they are the optimal solutions to their respective problems. Namely, the optimal solution for the primal form is x₁ = 2/7 and x₂ = 3/7, while the optimal solution for the dual form is y₁ = 1/7 and y₂ = 3/7.

Double check with graphical method

To confirm that those two points are indeed the optimal solutions for their respective problems, we can optimize the primal and dual separately using the graphical method:

This method was demonstrated earlier to maximize the primal form (left), in which the point where the level curve exits the feasible region is the optimal solution.
In contrast, for the dual form (right), we are minimizing its objective function. Therefore, the optimal solution is the point where the level curve enters the feasible region as it slides along the direction of b. This is because any point that level curve encounters after that will have a higher objective value than the entering point.

A preview of strong duality theorem

The above example shows that we can optimize a linear program using the corollary of the weak duality theorem — by looking for points in the primal and dual form whose respective values of the objective function are equal.

However, this begs a bigger question:

Why?!

If solving one optimization problem is hard enough, why do we need to solve two problems to arrive at the optimal solution?

This is where strong duality theorem comes in, since it states that as long as you arrive at the optimal solution at one form of the linear program (primal or dual), you have also optimized the value of the other form. This will be the subject of the next part of the project.

References

There are many online resources for duality of linear programs. However, what I find the most understandable, and what I based most of my derivations in my blog post form, is this series of lectures on linear programming by professor Tim Roughgarden for his algorithm course.

In particular, his lecture on the weak duality theorem is exceedingly clear, and emphasizes on the fact that the theorem is true simply by construction: it is only true because we made it so.

Acknowledgement

This was a seminar at work done in collaboration with Phuc Su. Thank you Phuc for your contribution to the project, especially on the weak duality theorem! I’d also like to thank Dr. Hashimoto for assigning us to research on this interesting topic, as well as the AI team at MTI for listening to our presentation on the project 🙏