Peek Into ML Optimization Techniques

Most used Optimization Techniques explained plus code snippets and additional resources.

Santiago Velez Garcia
The Startup
9 min readMay 24, 2020

--

Let’s start by making a couple of clarifications. In Machine Learning, the goal is to arrive at a set of Parameters modeling a given problem with reasonable accuracy. This is done via a training process where the mentioned parameters are tested and corrected in a continuous forward and backward fashion. Next to the parameters, one finds a close relative: the Hyperparameters. Unlike their folks, these are set before the training or learning process begins. Their objective is to make the process as efficient, in terms of speed and accuracy, as possible. Adjusting the Hyperparameters is what we call Optimization. This brings us to the very purpose of this post, where we will discuss some of the common techniques with its mechanics, pros, and cons.

https://www.publicdomainpictures.net/en/view-image.php?image=5016&picture=skiing-downhill

Feature Scaling

It is a method used to normalize the range of independent variables or features of data. The idea is to make the value range of the contribution of every feature relatively equal so that it does not disproportionally affect the loss function. In other words, all features’ values are brought to the same scale so that they do not excessively weight on the optimization algorithm.

When to use it. When the algorithm is based upon minimizing a distance such as k-nearest neighbors, Principal Component Analysis(PCA), gradient descent. In other kinds of algorithms, such as Tree-based models, Linear Discriminant Analysis(LDA), Naive Bayes, it may not have any effect.

Next, a Python implementation of feature scaling.

https://www.jeremyjordan.me/batch-normalization/

Batch normalization

Feature scaling made sense, didn’t it? The bad news is that this is lost once we go through the first activation layer in a deep neural network. The activation function works its magic and we are again with values in a wide-scale range and with different distributions. Would it be nice to preserve the normalization we had on the features throughout the whole neural network? That is in fact what Batch normalization does.

Conceptually the normalization happens after the activation, but in practice it been shown that normalizing the input (or Z made of the weights and the previous activation) has the same effect with less computational effort. We do not necesarly want the values to have zero mean and one variance, a different distribution could be beneficial -this has to do with the eventual region-wise close to linearity behaviour of the activation function-. That is the reason we calculate Z-tilde based upon trainable hyperparameters, gamma and beta. What they do is another scaling-offseting operation, fix the mean and variance, to take the activation into a region more beneficial for the overall training.

To get a good understanding of the mechanics of this method, great sources are the videos: Normalizing Activations in a Network (C2W3L04) and Fitting Batch Norm Into Neural Networks (C2W3L05) from Course 2: Improving Deep Neural Networks: Hyperparameters tunning, regularization, and Optimization from of the Deep Learning Specialization by deeplearning.ai taught by Andrew Ng.

The key takeaway here is normalization between layers. Feature scaling does it only to the input, in batch normalization, normalization and later adjustment is also performed between layers.

Following is a Python implementation of the method. Note that in Frameworks such as Tensorflow this reduces to the use of tf.nn.batch_normalization.

Mini-batch gradient descent

Until now we have been doing what is called batch gradient descent. We are taking into account every single sample data we have at disposal to perform one step or iteration of the training, that is, to move a little step (the size of our learning rate) in the direction opposite to the gradient of our loss function. This approach, while relatively safe, is computational very costly. What mini-batch does is to move with less data, so instead of using say 10.000 samples to calculate one step, we use some x amount of samples. This x amount is what we will call the batch size.

So the calculation of the update operation will be less costly since we are considering fewer samples. Mini-batch gradient descent lays in between batch gradient descent, taking all the available sample data at every iteration step, and stochastic gradient descent, which takes only one at a time. The fact that batch sizes are usually multiples of two, has to do with memory allocation for the calculation. While splitting the available samples into the mini-batches the last one may result in smaller batch size, It is up to you whether to use or not. Also note that a new term comes into play, the epoch, which is the equivalent to the batch gradient descent iteration, in the sense that all available data was used. That is we iterated through all mini-batches.

Gradient descent with momentum

While performing the network training the convergence behavior may turn out relatively erratic. The search for the loss function minimizing parameters at each iteration although generally right in the “long-run” may point not necessarily point in the direction of the optimizing minimum. What the present algorithm does is to use an exponentially weighted average of the gradient. While fancy-sounding it means considering not only the current gradient but also a contribution for the previously calculated gradients. The overall effect is a smoothed out convergence, that carries momentum or impulse from the previous iterations.

Find next, a Python implentation of it.

momentum — data from exponentially weighed averages. https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d

RMSProp optimization

This algorithm which stands for Root Mean Square Propagation has the particularity that it was never published in an academic paper. Instead, it was first proposed by Geoff Hinton in lecture 6 of the online course “Neural Networks for Machine Learning”.

It uses the gradient second momentum, aka, the variance to smooth out the variability in the activation functions based on a weighted average just as gradient descent with momentum does with the mean.

Adam optimization

Adam stands for Adaptive Moment Estimation. At its core, it is the junction between Gradient descent with momentum and RMSProp optimization. Following is a Python implementation of Adam.

Learning rate decay

Intuitively, while training improvement is easier at the beginning. General characteristics or features present in most of the samples are easy to pick up and learn by the network. The speed at which the algorithm learns is dictated by the learning rate -the factor by which the parameters are updated-. So at the beginning it is ok to use a higher speed. What learning rate decay proposes is to gradually reduce the speed as we train in more difficult features, that is, as we approach the loss function minimizing parameters. It also a matter of convergence, one we are in close vicinity of our optimum, we do not want to overshoot and go over it again and again.

Closing

We presented the most common optimization techniques for artificial neural networks. In retrospective once their basic mechanisms are understood they are intuitive and make good sense. We add operations to our computation that in turn represent effort savings. If you are interested in the implemntation details you may want to take a look at my Github repository here.

Remember, there are many ways to go down the hill!

https://imgur.com/a/Hqolp#NKsFHJb

Further reading

An overview of gradient descent optimization algorithms

A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size

Understanding RMSprop — faster neural network learning

Learning Rate Schedules and Adaptive Learning Rate Methods for Deep Learning

The Author

My name is Santiago Vélez G. I built my first website at sixteen. I hosted it in Yahoo! GeoCities and had profile pages of my school classmates. Then took a long detour. It included an exchange year in Michigan, a bachelor’s in Chemical Engineering, a master’s in Process Engineering, and an MBA. Also, I spent some years abroad in Germany and Mexico. While at it, I had the chance to learn German, Italian, and French and also to work in International Plant Engineering got eventually certified as a Project Management Professional. Then returned to Colombia tried my luck as a solopreneur with a livestock auction mobile app. Move on to work in mobility engineering, always on the project management and business development side. Two years ago, I decided to go all-in and back to software and now Machine Learning. I did several online courses and joined a twenty months intensive Software Engineering program with a focus on ML. Here I am today, about to begin the exploit phase of this new life. I am happy I boarded this super-fast, challenging, and exciting technology wagon and look forward to keeping learning for the time to come.

Let’s get in touch. LinkedIn, GitHub, Twitter.

--

--