Peek Into ML Optimization Techniques

Most used Optimization Techniques explained plus code snippets and additional resources.

Published in

The Startup

9 min readMay 24, 2020

Let’s start by making a couple of clarifications. In Machine Learning, the goal is to arrive at a set of Parameters modeling a given problem with reasonable accuracy. This is done via a training process where the mentioned parameters are tested and corrected in a continuous forward and backward fashion. Next to the parameters, one finds a close relative: the Hyperparameters. Unlike their folks, these are set before the training or learning process begins. Their objective is to make the process as efficient, in terms of speed and accuracy, as possible. Adjusting the Hyperparameters is what we call Optimization. This brings us to the very purpose of this post, where we will discuss some of the common techniques with its mechanics, pros, and cons.

https://www.publicdomainpictures.net/en/view-image.php?image=5016&picture=skiing-downhill

Feature Scaling

It is a method used to normalize the range of independent variables or features of data. The idea is to make the value range of the contribution of every feature relatively equal so that it does not disproportionally affect the loss function. In other words, all features’ values are brought to the same scale so that they do not excessively weight on the optimization algorithm.

When to use it. When the algorithm is based upon minimizing a distance such as k-nearest neighbors, Principal Component Analysis(PCA), gradient descent. In other kinds of algorithms, such as Tree-based models, Linear Discriminant Analysis(LDA), Naive Bayes, it may not have any effect.

Next, a Python implementation of feature scaling.

def normalize(X, m, s):
    """
    normalizes (standardizes) a matrix
    :param X: numpy.ndarray of shape (d, nx) to normalize
        d is the number of data points
        nx is the number of features
    :param m: numpy.ndarray of shape (nx,)
        that contains the mean of all features of X
    :param s: numpy.ndarray of shape (nx,)
        that contains the standard deviation of all features of X
    :return: The normalized X matrix
    """
    Z = (X - m) / s
    return Z

Batch normalization

Feature scaling made sense, didn’t it? The bad news is that this is lost once we go through the first activation layer in a deep neural network. The activation function works its magic and we are again with values in a wide-scale range and with different distributions. Would it be nice to preserve the normalization we had on the features throughout the whole neural network? That is in fact what Batch normalization does.

Conceptually the normalization happens after the activation, but in practice it been shown that normalizing the input (or Z made of the weights and the previous activation) has the same effect with less computational effort. We do not necesarly want the values to have zero mean and one variance, a different distribution could be beneficial -this has to do with the eventual region-wise close to linearity behaviour of the activation function-. That is the reason we calculate Z-tilde based upon trainable hyperparameters, gamma and beta. What they do is another scaling-offseting operation, fix the mean and variance, to take the activation into a region more beneficial for the overall training.

To get a good understanding of the mechanics of this method, great sources are the videos: Normalizing Activations in a Network (C2W3L04) and Fitting Batch Norm Into Neural Networks (C2W3L05) from Course 2: Improving Deep Neural Networks: Hyperparameters tunning, regularization, and Optimization from of the Deep Learning Specialization by deeplearning.ai taught by Andrew Ng.

The key takeaway here is normalization between layers. Feature scaling does it only to the input, in batch normalization, normalization and later adjustment is also performed between layers.

Following is a Python implementation of the method. Note that in Frameworks such as Tensorflow this reduces to the use of tf.nn.batch_normalization.

def batch_norm(Z, gamma, beta, epsilon):
    """
    normalizes an unactivated output of a neural network using
    batch normalization
    :param Z: numpy.ndarray of shape (m, n) that should be normalized
        m is the number of data points
        n is the number of features in Z
    :param gamma: numpy.ndarray of shape (1, n)
        containing the scales used for batch normalization
    :param beta: numpy.ndarray of shape (1, n) containing
        the offsets used for batch normalization
    :param epsilon: small number used to avoid division by zero
    :return: normalized Z matrix
    """
    m = np.mean(Z, axis=0)
    s = np.var(Z, axis=0)

    # normalization step
    Z_norm = (Z - m) / ((s + epsilon)**(1/2))

    # introduction of trainable parameters gamma for scale and beta for offset
    # allows to take advantage of a non strictly normalized distribution
    # non zero mean (offset) and non one std (scale)
    Z_tilde = gamma * Z_norm + beta

    return Z_tilde

Mini-batch gradient descent

Until now we have been doing what is called batch gradient descent. We are taking into account every single sample data we have at disposal to perform one step or iteration of the training, that is, to move a little step (the size of our learning rate) in the direction opposite to the gradient of our loss function. This approach, while relatively safe, is computational very costly. What mini-batch does is to move with less data, so instead of using say 10.000 samples to calculate one step, we use some x amount of samples. This x amount is what we will call the batch size.

So the calculation of the update operation will be less costly since we are considering fewer samples. Mini-batch gradient descent lays in between batch gradient descent, taking all the available sample data at every iteration step, and stochastic gradient descent, which takes only one at a time. The fact that batch sizes are usually multiples of two, has to do with memory allocation for the calculation. While splitting the available samples into the mini-batches the last one may result in smaller batch size, It is up to you whether to use or not. Also note that a new term comes into play, the epoch, which is the equivalent to the batch gradient descent iteration, in the sense that all available data was used. That is we iterated through all mini-batches.

Gradient descent with momentum

While performing the network training the convergence behavior may turn out relatively erratic. The search for the loss function minimizing parameters at each iteration although generally right in the “long-run” may point not necessarily point in the direction of the optimizing minimum. What the present algorithm does is to use an exponentially weighted average of the gradient. While fancy-sounding it means considering not only the current gradient but also a contribution for the previously calculated gradients. The overall effect is a smoothed out convergence, that carries momentum or impulse from the previous iterations.

Find next, a Python implentation of it.

def update_variables_momentum(alpha, beta1, var, grad, v):
    """
    updates a variable using the gradient descent with
    momentum optimization algorithm
    :param alpha: learning rate
    :param beta1: momentum weight
    :param var: numpy.ndarray containing the variable to be updated
    :param grad: numpy.ndarray containing the gradient of var
    :param v: previous first moment of var
    :return: updated variable and the new moment, respectively
    """
    # Exponentially Weighted Averages
    v = beta1 * v + (1 - beta1) * grad

    # variable update
    var = var - alpha * v
    return var, v

momentum — data from exponentially weighed averages. https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d

RMSProp optimization

This algorithm which stands for Root Mean Square Propagation has the particularity that it was never published in an academic paper. Instead, it was first proposed by Geoff Hinton in lecture 6 of the online course “Neural Networks for Machine Learning”.

It uses the gradient second momentum, aka, the variance to smooth out the variability in the activation functions based on a weighted average just as gradient descent with momentum does with the mean.

def update_variables_RMSProp(alpha, beta2, epsilon, var, grad, s):
    """
    updates a variable using the RMSProp optimization algorithm
    :param alpha: learning rate
    :param beta2: RMSProp weight
    :param epsilon:  small number to avoid division by zero
    :param var: numpy.ndarray containing the variable to be updated
    :param grad: numpy.ndarray containing the gradient of var
    :param s: previous second moment of var
    :return: updated variable and the new moment, respectively
    """
    # RMSProp
    s = beta2 * s + (1 - beta2) * (grad ** 2)

    # variable update
    var = var - alpha * grad / (s ** (1/2) + epsilon)

    return var, s

Adam optimization

Adam stands for Adaptive Moment Estimation. At its core, it is the junction between Gradient descent with momentum and RMSProp optimization. Following is a Python implementation of Adam.

def update_variables_Adam(alpha, beta1, beta2, epsilon, var, grad, v, s, t):
    """
    updates a variable in place using the Adam optimization algorithm
    :param alpha: learning rate
    :param beta1: weight used for the first moment
    :param beta2: weight used for the second moment
    :param epsilon: small number to avoid division by zero
    :param var: numpy.ndarray containing the variable to be updated
    :param grad: numpy.ndarray containing the gradient of var
    :param v: previous first moment of var
    :param s: previous second moment of var
    :param t: time step used for bias correction
    :return: updated variable, the new first moment,
        and the new second moment, respectively
    """
    # Exponentially Weighted Averages (momentum)
    v = beta1 * v + (1 - beta1) * grad

    # RMSProp
    s = beta2 * s + (1 - beta2) * (grad ** 2)

    # bias correction
    v_corrected = v / (1 - (beta1 ** t))
    s_corrected = s / (1 - (beta2 ** t))

    # variable update with ADAM (adaptive moment estimation)
    var = var - alpha * v_corrected / (s_corrected ** (1/2) + epsilon)

    return var, v, s

Learning rate decay

Intuitively, while training improvement is easier at the beginning. General characteristics or features present in most of the samples are easy to pick up and learn by the network. The speed at which the algorithm learns is dictated by the learning rate -the factor by which the parameters are updated-. So at the beginning it is ok to use a higher speed. What learning rate decay proposes is to gradually reduce the speed as we train in more difficult features, that is, as we approach the loss function minimizing parameters. It also a matter of convergence, one we are in close vicinity of our optimum, we do not want to overshoot and go over it again and again.

def learning_rate_decay(alpha, decay_rate, global_step, decay_step):
    """
    updates the learning rate using inverse time decay in numpy
    :param alpha: original learning rate
    :param decay_rate: weight used to determine
        the rate at which alpha will decay
    :param global_step: number of passes of gradient descent
        that have elapsed
    :param decay_step: number of passes of gradient descent that
        should occur before alpha is decayed further
    :return: updated value for alpha
    """
    return alpha / (1 + decay_rate * (global_step // decay_step))

Closing

We presented the most common optimization techniques for artificial neural networks. In retrospective once their basic mechanisms are understood they are intuitive and make good sense. We add operations to our computation that in turn represent effort savings. If you are interested in the implemntation details you may want to take a look at my Github repository here.

Remember, there are many ways to go down the hill!