Constrained and Unconstrained Optimization, Theory and Implementations along with SVM

Published in

Intel Student Ambassadors

14 min readFeb 16, 2019

Disclaimer: This is a very lengthy blog post and involves mathematical proofs and python implementations for various optimization algorithms

Optimization, one of the most interesting topics in the field of Machine learning. Most of the problems we encounter in our daily life are solved using numerical optimization methods. Here in this blog, let us look at some basic numerical optimization algorithms extensively in finding local optima of any given function (which works best with convex functions). Let’s start with simple convex functions, where local and global minima are one and the same, and move towards highly non-linear functions with multiple local
and global minima’s. Entire optimization revolves around basic concepts of linear algebra and differential calculus. Recent updates in deep learning have created huge interest in the field of numerical and stochastic optimization algorithms, to provide theoretical backing for amazing qualitative results shown by deep learning networks. In these kinds of learning algorithms, there doesn’t exist any explicitly known function for optimization, but we only have access to the 0th and 1st order oracles. Oracles are the black boxes which return function value (0th order), gradient (1st order) or Hessian (2nd order) of the function at any given point. This blog provides the basic theoretical and numerical understanding of unconstrained and constrained optimization functions and also includes a python implementation of them.

Necessary and sufficient conditions for a point to be a local minima:

Let f(.) be a continuous and twice differentiable function. For any point to be minima it should satisfy the following conditions:

First order necessary condition:

Second order necessary condition:

Second order sufficiency condition:

Gradient Descent:

Gradient descent is the backbone for all the advancements in the field of learning algorithms (machine learning, deep learning or deep reinforcement learning). In this section, we shall see the various modifications on gradient descent for faster and better convergence. Let us consider the case of linear regression where we estimate the coefficients of the equation:

Assuming the function to be the linear combination of all features. The optimal set of coefficients is determined by minimizing loss function:

which turns out to be the maximum likelihood estimate for the task of linear regression. Minimizing OLS (Ordinary Least Square) involves finding the inverse of the feature matrix given by:

For a practical problem, the dimensionality of the data is huge which would easily blow up the computational power. For example, let us consider the problem of image feature analysis: generally images are of size 1024x1024 which means a number of features are of the order 10⁶. Due to a large number of features, such optimization problems are solved only in an iterative way, and this leads us to the very popularly known algorithm called gradient descent and Newton-Raphson method.

Gradient Descent Algorithm:

Gradient descent algorithm updates an iterate(X) in the direction of the negative gradient (hence, the steepest descent direction) with a previously specified learning rate (eta). Learning rate is used to prevent the overshoot
of the local minima at any given iteration.
Below figure shows the convergence of gradient descent algorithm for function f(x) = x² with eta = 0.25

Finding out an optimal eta is the task at hand, which requires the prior knowledge of functions understanding and the domain of operation.

# python implementation of vanilla gradient descent update ruledef  def gradient_descent_update (x, eta):
    """
    get_gradient is 1st order oracle
    """
    return x - eta*get_gradient(x)

Gradient Descent with Armijo Goldstein condition:
To reduce the work of setting manually, Armijo Goldstein (AG) conditions are applied to find the (eta) for the next iteration. Formal derivation of AG condition requires the knowledge of linear approximation, Lipchitz conditions, and basic calculus, which I won’t be covering in this blog.

Let’s define two functions f1(x) and f2(x) as a linear approximation of f(x) with two different coefficients alpha and beta, given by:

In each iteration of the AG condition, a specific eta which satisfies the relation:

is found and the current iterate is updated accordingly.

Below figure shows the range of next iterate, for the convergence of function f(x) = x² with alpha = 0.25, and beta = 0.5:

Red, Blue and Green line in the above figure corresponds to
the green show’s the range of next possible iterate.

# python implementation of gradient descent with AG condition update rule
def gradient_descent_update_AG(x,  alpha =0.5,  beta =0.25):
    eta =0.5
    max_eta =np.inf
    min_eta =0.
    value = get_value(x)
    grad = get_gradient(x)
    while  True :
        x_cand = x - (eta)*grad
        f = get_value(x_cand)
        f1 = value - eta *a lpha*np.sum(np.abs(grad)* *2 )
        f2 = value - eta *be ta *np.sum(np.abs(grad)* *2 )
        if f<=f2  and f>=f1:
            return x_cand        if f <= f1:
            if eta == max_eta:
                eta = np.min([2.*eta, eta + max_eta/2.])
            else:
                eta = np.min([2.*eta, (eta + max_eta)/2.])
        if f>=f2:
            max_eta = eta
            eta = (eta+min_eta)/2.0

Gradient Descent with Full Relaxation condition:
In the case of Full relaxation condition, new a function g(eta) is minimized to obtain eta which is subsequently used to find next iterate.

This method involves solving the optimization problem for finding every next iterate, which is done as:

# The python implementation for FR is shown below:
def  gradient_descent_update_FR (x):
    eta =  0.5
    thresh =  1e-6
    max_eta = np.inf
    min_eta =  0.
    while  True :
        x_cand = x - eta*get_gradient(x)
        g_diff =  -1. *get_gradient(x)*get_gradient(x_cand)
        if np.sum(np.abs(g_diff)) < thresh  and eta >  0 :
            return x_cand
        if g_diff > 0:
            if eta == max_eta:
               eta = np.min([2.*eta, eta + max_eta/2.])
            else:
               eta = np.min([2.*eta, (eta + max_eta)/2.])
        else:
            max_eta = eta
            eta = (min_eta + eta)/2.0

Stochastic Gradient Descent
As we know, in a real setup the dimension of the data will be quite huge which makes further gradient computation for all the features expensive. In such cases, a random batch of points (features) are selected and the expected gradient is computed. This whole setup convergence only in the expectational sense.

Mathematically it would mean:
Randomly select a point (p) for estimating gradients.

In the above iterate, wt can be treated as noise. This iterate leads towards local optima only when E(wt) tends to 0.

Similarly, it can be shown that a variance of wt is also finite. By the above proof, the convergence of SGD is guaranteed.

# SGD implementation in python
def  SGD (self, X, Y, batch_size, thresh= 1  ):
    loss =  100
    step =  0
    if self.plot: losses = []
    while loss >= thresh:
        # mini_batch formation
        index = np.random.randint( 0  , len(X), size = batch_size)
        trainX, trainY = np.array(X)[index], np.array(Y)[index]
        
        self.forward(trainX)
        loss = self.loss(trainY)
        self.gradients(trainY)        # update
        self.w0 -= np.squeeze(self.alpha*self.grad_w0)
        self.weights -= np.squeeze(self.alpha*self.grad_w)
        
        if self.plot: losses.append(loss)
        if step %  1000 ==  999 :  
            print  "Batch number: {}" .format(step)+ " current loss: {}" .format(loss)        step +=  1    if self.plot : self.visualization(X, Y, losses)
    pass

AdaGrad
Adagrad is an optimizer which helps in the automatic adaptation of the learning rate for each and every feature involved in the optimization problem. This is achieved by keeping the track of the history of all the gradients. This method also converges only in the expectation sense.

def  AdaGrad (self, X, Y, batch_size,thresh=0.5 ,epsilon=1e-6 ):
    loss =  100
    step =  0
    if self.plot: losses = []
    G = np.zeros((X.shape[ 1  ], X.shape[ 1  ]))
    G0 =  0
    while loss >= thresh:
        # mini_batch formation
        index = np.random.randint( 0  , len(X), size = batch_size)
        trainX, trainY = np.array(X)[index], np.array(Y)[index]        self.forward(trainX)
        loss = self.loss(trainY)
        self.gradients(trainY)        G += self.grad_w.T*self.grad_w
        G0 += self.grad_w0** 2        den = np.sqrt(np.diag(G)+epsilon)
        delta_w = self.alpha*self.grad_w / den
        delta_w0 = self.alpha*self.grad_w0 / np.sqrt(G0 + epsilon)        # update parameters
        self.w0 -= np.squeeze(delta_w0)
        self.weights -= np.squeeze(delta_w)        if self.plot: losses.append(loss)
        if step %  500 ==  0  :  
            print  "Batch number: {}".format (step)+ " current loss: {}".format(loss)
        
        step +=  1
    if self.plot : self.visualization(X, Y, losses)
    pass

Let us move towards Hessian-based methods, Newton and quasi-Newton method:

Hessian-based methods are second-order gradient-based methods for optimization, geometrically it involves gradient and curvature information for updating weights, Due to this reason convergence is much faster when compared to just gradient-based methods Update rule for Newton’s method is defined as:

The convergence of this algorithm is much faster than gradient-based methods. Mathematically, the convergence rate of gradient descent is proportional to O(1/t), while for Newton’s method it is proportional to O(1/t²). But in case of higher dimensional data, estimating 2nd order oracle for each iterate is computationally expensive, leading to the 2nd order oracle being simulated using the 1st order oracle. This gives us the quasi-Newton class of algorithms. Most frequently used algorithm in this class of quasi-Newton methods are BFGS and LMFGS algorithm. In this section, we only discuss the BFGS algorithm which involves rank one matrix update of the Hessian of the function. The overall idea of the method is to randomly initialize the Hessian and keep updating the Hessian with each iteration using a rank one update rule. Mathematically this is shown as:

# python implementation for BFGSdef  BFGS_update (H, s, y):
    smooth =  1e-3
    s = np.expand_dims(s, axis=  -1 )
    y = np.expand_dims(y, axis=  -1 )
    rho =  1. /(s.T.dot(y) + smooth)
    p = (np.eye(H.shape[ 0  ]) - rho*s.dot(y.T))
    return p.dot(H.dot(p.T)) + rho*s.dot(s.T)
def  quasi_Newton (x0, H0, num_iter= 100 , eta= 1  ):
    xo, H = x0, H0
    for _  in range(num_iter):
        xn = xo - eta*H.dot(get_gradient(xo))
        y = get_gradient(xn) - get_gradient(xo)
        s = xn - xo
        H = BFGS_update(H, s, y)
        xo = xn
    return xo

Constrained Optimization

Now, this is the time for the discussion of a few key concepts revolving around constrained optimization (which includes problem formulation and solving strategies). In this section also discusses the theory and Python implementation of an algorithm known as SVM (Support Vector Machine). When we come across real-life problems, coming up with an ideal optimization function is quite hard and sometimes not feasible, so we relax on optimization function by imposing additional constraints on the problem, and optimizing this constrained setup would provide an approximation which we would be as close a possible to the actual solution of the problem, but would also be feasible. To solve constrained optimization problems methods like Lagrangian formulation, penalty methods, projected gradient descent, interior points, and many other methods are used. In this section, we’ll go through Lagrangian formulation and projected gradient descent. This section also covers the details about open-source toolbox used for optimization CVXOPT, and also covers SVM implementation using this toolbox.

The general form of constrained optimization problems:

where f(x) is the objective function, g(x) and h(x) are inequality and equality constraints respectively. If f(x) is convex and the constraints form a convex set, (i.e g(x) is convex and h(x) is affine), the optimization is guaranteed to converge at a global minima. For any other set of problems, it converges to
local minima.

Projected Gradient Descent

The preliminary(and most obvious) step while solving constrained optimization setup is to take the projection of an iterate over a constrained set. This turns out to be the most powerful algorithm in solving the constrained
optimization problem. This involves two steps (1) to find the next possible iterate in minimization (descent) direction, (2) Finding projection of the iterate on constrained set.

# projected gradient descent implementationdef  projection_oracle_l2 (w, l2_norm):
    return (l2_norm/np.linalg.norm(w))*wdef  projection_oracle_l1 (w, l1_norm):
    # first remember signs and store them. Modify w
    signs = np.sign(w)
    w = w*signs
    
    # project this modified w onto the simplex in first orthant.
    d=len(w)    if np.sum(w)<=l1_norm:
        return w*signs    for i  in range(d):
        w_next = w+ 0
        w_next[w> 1e-7 ] = w[w> 1e-7 ] - np.min(w[w> 1e-7 ])        if np.sum(w_next)<=l1_norm:
            w = ((l1_norm - np.sum(w_next))*w + (np.sum(w) -
                     l1_norm)*w_next)/(np.sum(w)-np.sum(w_next))
            return w*signs        else :
            w=w_nextdef  main ():
    eta= 1.0 /smoothness_coeff
    for l1_norm  in np.arange( 0  ,  10 ,  0.5 ):
        w=np.random.randn(data_dim)
        for i  in range( 1000 ):
            w = w - eta*get_gradient(w)
            w = projection_oracle_l1(w, l1_norm)
    pass

Understanding Lagrangian formulation

In most of the optimization problems, finding the projection of an iterate over a constrained set is a difficult problem (especially in the case of a complex constrained set). It is similar to solving an optimization problem at each iterate, which would be non-convex in most of the cases. In reality, people try to get rid of the constraints by solving the dual problem rather than the primal problem.

Before diving into Lagrangian dual and primal formulation let’s understand KKT conditions and it’s the significance

For any point to be local/global minima with equality constraints:

Similarly in case of inequality constraints:

These two conditions can be easily observed by considering a unit circle as a constrained set. In the first part, let us consider only a boundary where the sign of (\mu) doesn’t matter, this is the result of equality constraints. In the second case, consider an interior set of a unit circle where -ve sign for (\lambda) signifies the feasible solution region.

KKT (Karush–Kuhn–Tucker) conditions are considered as first-order necessary conditions, which a point should satisfy to be considered as a stationary point (local minima, local maxima, a saddle point. For x^*
to be local minima:

LICQ condition: all active constraints

should be linearly independent of each another.

Lagrangian of a function

For any function f(x) with inequality constraints g_i(x) ≤ 0 and equality constraints h_i(x) = 0, Lagrangian L(x, \lambda, \mu)

Optimization function:

above optimization is equivalent to:

Above formulation is known as the primal problem (p^*) by the claim of game theory (i.e second player will always have a better chance for optimization) it can be easily seen that:

This formulation is known as the dual problem (d^*).

Both primal and dual formulations lead to the same optimal value if and only if the objective function is convex and the constrained set is also convex. Proof of this claim is justified below:
let the function be convex which implies

SVM (Support Vector Machines)

SVM’s belong to a supervised learning class of algorithms used in classification and regression problems. SVM’s are easily scalable and can solve linear and non-linear problems (by making use of kernel-based trick). In most of the problems, we can’t draw a single line differentiating two different class of data, due to which we need to provide a slight margin in the construction of decision boundaries, which is easily formulated using SVM. The idea of SVM is to create the separating hyperplane between two different sets of data points. Once the separating hyperplane is obtained, it becomes relatively easier to classify data points (test case) in one of the classes. SVM works well even for higher dimensional data. Advantages ofSVM’s are that they are memory efficient, accurate and fast as compared with other ML models. Let’s look at the mathematics of SVM

SVM Primal Problem

SVM Dual Problem using Lagrangian

Derivation of Dual

CVXOPT

In the section, we’ll discuss the implementation of the above SVM dual algorithm in python using CVXOPT library. You can find more about the library (http://cvxopt.org/userguide/intro.html)

General CVXOPT format for such problems

# SVM using CVXOPT
import numpy  as np
from cvxopt  import matrix,solvers
def  solve_SVM_dual_CVXOPT (x_train, y_train, x_test):
    """
    Solves the SVM training optimisation problem (the Arguments:
    x_train: A numpy array with shape (n,d), denoting \R^d.
    y_train: numpy array with shape (n,) Each element
    x_train: A numpy array with shape (m,d), denoting dual) using      cvxopt.
    
    n training samples in takes +1 or -1
    m test samples in \R^d.
    Limits:
    n<200, d<100, m<1000
   
    Returns:
    y_pred_test : A numpy array with shape (m,). This is the result   of running the learned model on the
    test instances x_test. Each element is +1 or -1.
    """    n, d = x_train.shape
    c =  10  # let max \lambda value be 10
    y_train = np.expand_dims(y_train, axis= 1  )* 1.    P = (y_train * x_train).dot((y_train * x_train).T)
    q = -1. *np.ones((n,  1  ))    G = np.vstack((np.eye(n)* -1 ,np.eye(n)))
    h = np.hstack((np.zeros(n), np.ones(n) * c))    A = y_train.reshape( 1  ,  -1 )
    b = np.array([ 0.0 ])    P = matrix(P); q = matrix(q)
    G = matrix(G); h = matrix(h)
    A = matrix(A); b = matrix(b)    sol = solvers.qp(P, q, G, h, A, b)
    lambdas = np.array(sol[ 'x' ])    w = ((y_train * lambdas).T.dot(x_train)).reshape( -1 ,  1  )
    b = y_train - np.dot(x_train, w)    prediction =  lambda x, w, b: np.sign(np.sum(w.T.dot(x) + b))    y_test = np.array([prediction(x_, w, b)  for x_  in x_test])    return y_testif __name__ ==  "__main__" :
    # Example format of input to the functions
    n= 100
    m= 100
    d= 10    x_train = np.random.randn(n,d)
    x_test = np.random.randn(m,d)
    w = np.random.randn(d)    y_train = np.sign(np.dot(x_train, w))
    y_test = np.sign(np.dot(x_test, w))    y_pred_test = solve_SVM_dual_CVXOPT(x_train, y_train, x_test)
    check1 = np.sum(y_pred_test == y_test)
    print ( "Score: {}/{}" .format(check1, len(y_Test)))

For more deeper understanding refer to:
● Numerical Optimization (by Jorge Nocedal and Stephen J. Wright)
● Nonlinear Programming (by Yu Nestrov)

The entire code used in this post can be found
(https://github.com/koriavinash1/Numerical-Optimization)