Mathematics for Machine learning

Gentle introduction of Continuous Optimization for machine learning

Published in

Intuition

9 min readMay 13, 2024

This blog will introduce the basics of continuous optimization, gradient descent for unconstrained optimization, and Lagrange multiplier for constrained optimization with visualizations.

As you delve into machine learning, you’ll encounter numerous optimization techniques to find suitable parameters for your chosen model. In the machine learning context, we need to understand continuous optimization because we often assume the continuous function for models. However, continuous optimization techniques are usually complicated and sometimes confuse us about which algorithms to use in certain situations. In this blog, I will introduce some basic continuous optimization techniques such as gradient descent and Lagrange multipliers with visualization using this book [1] as a reference.

Overview of continuous optimization
Unconstrained optimization technique — Gradient descent
Constrained optimization technique — Lagrange multipliers

1. Overview of continuous optimization

Firstly, why do we need to use optimization techniques in machine learning? One reason is that we cannot generally find parameters in analytic solutions. However, in machine learning, we want to find parameters to maximize or minimize the objective functions, such as mean-squared error, maximum likelihood, cross-entropy loss, etc. Another reason is that analytic solutions are computationally expensive, especially when you have plentiful data. So, many continuous optimization techniques are developed to overcome this issue.

Continuous optimization has two main categories: unconstrained and constrained optimization. The figure below provides a mind map of continuous optimization as referenced in [1].

Continuous optimization overview referenced from [1]

As you can see, gradient descent is the central concept for unconstrained optimization. On the other hand, the Lagrange multiplier is the main concept for constrained optimization. These are the basics of other applied continuous optimization.

Before we proceed with these topics, we need to know the local and global optima. Please look at the graph below.

The visualization of global and local optimum by the author

As its name suggests, the global optimum is the largest or smallest point in the given function. In contrast, the local minimum is not the largest or smallest but larger or smaller than the surrounding points. Basically, continuous optimization techniques can find a local optimum except when the function is convex or concave. If the function is convex or concave, we already know that the local optimum equals the global optimum because there is only one extreme value.

How do we find these optimums? We use the gradient, a vector with both a magnitude and a direction to the steepest ascent or descent. Intuitively, let’s consider that we are on the hill and want to find the minimal point. The gradient works like a compass and always directs to the steepest point, so you can go to the lowest point if you go in the direction the gradient directs.

The visualization of gradient intuition by the author

Now that we have all the tools to understand gradient descent and Lagrange multipliers, let’s dive into them.

2. Unconstrained optimization technique — Gradient descent

Unconstrained optimization is often necessary in deep learning because its assumed functions are complex. Then, gradient descent (or its variants) is the most used algorithm, so knowing its basic concept is essential to understanding deep learning. In this section, we will learn about the basics of gradient descent.

Gradient descent is a convenient algorithm when the assumed model cannot find an analytical solution. Intuitively, gradient descent changes parameters gradually to the direction in which the parameters are optimized. Mathematically, we can describe gradient descent as follows:

𝛾 is called step size or learning rate. We need to adjust the learning rate to converge the calculation. From the mathematical formula, you can see that gradient descent changes parameters gradually from the current value according to the gradient of the objective function with respect to parameters. Isn’t it very simple? Now, let’s see a concrete example. We assume that we have a quadratic function in two dimensions.

We need to set initial parameters randomly, so we choose (-3.0, -1.0). You can implement gradient descent in Python as follows:

import torch
import numpy as np
import matplotlib.pyplot as plt

from celluloid import Camera

def generate_contour_data(step: float = 0.1):
    # generate contour plot data
    x1 = np.arange(-5.0, 5.0, step)
    x2 = np.arange(-3.0, 3.0, step)
    
    X1, X2 = np.meshgrid(x1, x2)
    
    Z = np.multiply(X1, X1) + np.multiply(X1, X2) + 10 * np.multiply(X2, X2) - 5 * X1 + 3 * X2
    
    return X1, X2, Z

B = np.array([[2, 1], [1, 20]])
b = np.array([5.0, 3.0]).reshape(2, 1)
x = np.array([-3.0, -1.0]).reshape(2, 1)

lr = 0.05
n_iter = 30

parameter_list = [x.reshape(2)]

for i in range(n_iter):

    # calculate gradient descent
    gradient = (x.T @ B) - b.T
    x -= lr * gradient.T
    
    parameter_list.append(np.copy(x.reshape(2)))

contour_data = generate_contour_data()

# create an animation
fig, ax = plt.subplots()
camera = Camera(fig)

for i in range(len(parameter_list)):
    current_x = parameter_list[i]
    
    ax.contour(*contour_data)
    ax.scatter(x=current_x[0], y=current_x[1])
    ax.annotate(f'iteration {str(i)}', (0.05, 0.9), xycoords='axes fraction')
    ax.annotate(f'x0={format(current_x[0], ".2f")}, x1={format(current_x[1], ".2f")}', (0.05, 0.8), xycoords='axes fraction')

    camera.snap()
    
animation = camera.animate()
animation.save('output.mp4')

The contour plot shows the quadratic function’s surface. The x-axis refers to the parameter x0, and the y-axis does the parameter x1. The red dot is the result of gradient descent in each iteration. As you can see, parameters converge the minimized value of the quadratic function.

For simple functions, this basic gradient descent is enough to estimate parameters. However, if the function is complicated, gradient descent often falls into the local minimum, which means estimated parameters are not the most optimized. To overcome this problem, there are many gradient descent variants, such as stochastic gradient descent, gradient descent with momentum, Adam, etc. You can see these great posts about these algorithms [2][3].

3. Constrained optimization technique — Lagrange multipliers

As its name suggests, constrained optimization techniques are used when constraints exist. When we learn about machine learning, we encounter the Lagrange multiplier application, such as Support Vector Machine, L1/L2 regularization, Gaussian Mixture Model, PCA, etc. Let’s dive into it!

When we assume additional constraints, real-valued function gᵢ, we consider the constrained optimization problem as follows:

How do we solve it? Let’s consider the example below. When we use an indicator function where 1(z) is an infinite step function, we can convert constrained optimization to unconstrained one.

This equation provides an infinite penalty if the constraint is unsatisfied, so it seems to give the correct answer. However, optimizing an infinite step function is challenging. So, the Lagrange multipliers are developed to overcome this difficulty. The definition is as follows:

𝛌 is called Lagrange multipliers and the second line shows the vector form of the first equation. The last line shows that we can find a local extremum when a set of parameters (x, y, 𝛌) satisfies the equation in the last line.

How does a Mathematician come up with this idea? Let’s see a concrete example. We have a function f(x,y) below to maximize with constraint g(x,y).

These functions’ visualizations are below.

The visualization of the example problem (1)

The contour line shows the f(x,y), and the red line shows the constraint function g(x,y). When we see the above graph, f(x,y) takes the maximum value when f(x,y) touches the g(x,y) (ellipse), which means f(x,y)=2. What are the circumstances under which functions touch with each other? Please take a look at the graph below.

The expanded version of the above contour plot

As you can see, the gradient of the constraint function g(x,y) is perpendicular to the constraint surface when the two graphs touch (not cross). On the other hand, the gradient of the function f(x,y) is also perpendicular to the f(x,y) surface. So, both gradients are perpendicular to both surfaces. Thus, the gradient of g(x,y) and f(x,y) are anti-parallel relationships. Now, we describe this relationship in a mathematical way as follows:

I adjusted the lambda sign to correspond with the Lagrange Multiplier definition in the second equation. You can see this equation is the same as the definition of the Lagrange multiplier. This is how mathematicians came up with this idea. Next, let’s construct the Lagrange multiplier for this example.

Lagrange multipliers for the example problem (1)

We can derive three equations from the above equation.

Compute the parameters as follows:

Now, we get the four extremal points of f(x,y) subject to g(x,y): (2, 1), (-2, -1), (2, -1), and (-2, -1). Let’s revisit the contour plot to confirm the result is correct.

Review the contour plot of the example problem (1)

As you can see, f(x,y) values are maximized when f(x,y) takes the above values. In this case, we can compute the optimization analytically; how can we do the same thing in Python? We can use sympy library!

from sympy import *

x, y, lamb = symbols('x y lamb', real=True) 

# solving the gradient equations calculated
solve([
    Eq(y + lamb * x / 4, 0),
    Eq(x + lamb * y, 0),
    Eq(x**2 / 8 + y ** 2/ 2, 1)
], [x, y, lamb])

# answer
# [(-2, -1, -2), (2, 1, -2), (-2, 1, 2), (2, -1, 2)]

Firstly, you need to define variables. In this case, we use x, y, and lambda; thus, we define three variables. Then, we give three gradient equations to the solver function. We can quickly get the answer and confirm that it is the same as we calculated.

Let’s solve another example to get used to the concept. We assume we have the function below to maximize and the constraint.

Based on the above equations, we can construct the Lagrange multiplier below.

Thus, we can get the values below:

How about the result of sympy? You can confirm that we can get the same results as follows.

x, y, z, lamb1, lamb2 = symbols('x y z lamb1 lamb2', real=True) 

# solving the gradient equations calculated
solve([
    Eq(2 * lamb1 + 2 *lamb2 * x, 0),
    Eq(4 - lamb1 + 2 * lamb2 * y, 0),
    Eq(-2 - lamb1, 0),
    Eq(2*x - y - z - 2, 0),
    Eq(x**2 + y **2 - 1, 0)
], [x, y, z, lamb1, lamb2])

# answer
# [(-2*sqrt(13)/13, 3*sqrt(13)/13, -2 - 7*sqrt(13)/13, -2, -sqrt(13)),
# (2*sqrt(13)/13, -3*sqrt(13)/13, -2 + 7*sqrt(13)/13, -2, sqrt(13))]

So far we review the Lagrange multipliers with equality conditions. For the inequality condition, we need to understand additional concepts, such as the KKT condition, so I will skip it. You can find some information for the Lagrange multipliers with inequality conditions from [5][6].

In this blog, I show you the basics of continuous optimization. There are important topics, such as gradient descent for unconstrained optimization and Lagrange multiplier for constrained optimization. Optimization methods need to be changed depending on whether constraints are present or not. Thank you for reading!

References

[1] Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong., Mathematics for Machine Learning, Cambridge University Press

[2] Oppermann, A., Optimization in Deep Learning: AdaGrad, RMSProp, ADAM

[3] Jiang, L., A Visual Explanation of Gradient Descent Methods (Momentum, AdaGrad, RMSProp, Adam), Towards Data Science

[4] 4452 Mathematical Modeling Lecture 4: Lagrange Multipliers, University of Minnesota Twin Cities lecture

[5] Tam, A., Lagrange Multiplier Approach with Inequality Constraints, Machine Learning Mastery

[6] Pandey, R., Lagrange multipliers with visualizations and code, Towards Data Science