The Hitchhiker’s Guide to Optimization in Machine Learning
A Detailed Guide on Optimization and Stochastic Gradient Descent
The aim of this article is to establish a proper understanding of what exactly “optimizing” a Machine Learning algorithm means. Further, we’ll have a look at the gradient-based class (Gradient Descent, Stochastic Gradient Descent, etc.) of optimization algorithms.
NOTE: For the sake of simplicity and better understanding, we‘ll restrict the scope of our discussion to supervised machine learning algorithms only.
Machine Learning is the ideal culmination of Applied Mathematics and Computer Science, where we train and use data-driven applications to run inferences on the available data. Generally speaking, for an ML task, the type of inference (i.e., the prediction that the model makes) varies on the basis of the problem statement and the type of data one is dealing with for the task at hand. However, in contrast to these dissimilarities, these algorithms tend to share some similarities as well, especially in the essence of how they operate.
Let’s try to understand the previous paragraph. Consider supervised ML algorithms as a superset. Now, we can go ahead and further divide this superset into smaller sub-groups based on the characteristics these algorithms share:
- Regression vs classification algorithms
- Parametric vs non-parametric algorithms
- Probabilistic vs non-probabilistic algorithms, etc.
Although setting these differences apart, if we observe the generalized representation of a supervised machine learning algorithm, it’s evident that these algorithms tend to work more or less in the same manner.
- Firstly, we have some labeled data, which can be broken down into the feature set X, and the corresponding label set Y.
- Then we have the model function, denoted by F, which is a mathematical function that maps the input feature set X_i to the output ŷ_i.
To put it in layman’s terms, every supervised ML algorithm involves passing as input to the model function F a feature set X_i, which the function F processes to generate an output ŷ_i.
However, this is just the inference (or testing) phase of a model, where theoritically, we are supposed to use the model to generate predictions on the data it has never seen before.
But what about “training” the model? Let’s have a look at it next.
Model Optimization
The training phase of a supervised ML algorithm can be broken down into two steps:
- Forward Propagation: The forward propagation step is similar to the inference phase of a model, where we have a parameterized model function F, that performs transformations on the input set X_i to generate the output ŷ_i.
- Backward Propagation: Before we understand how backpropagation works, it is very important to understand why we need it. When we are working with parameterized models, the parameter matrices (model weights and biases) are often initialized randomly. Because of this random initialization and the uncertainty associated with it, there's a very high probability that the initial model performance will be very underwhelming. Therefore, we need a way to somehow alter the model parameters in a way that its performance improves. This is exactly what backpropagation aims to achieve. In the backpropagation step, we first evaluate the inferred model output ŷ_i with reference to the actual label Y_i by calculating the total information loss (sometimes also referred to as model cost) via a loss function. Then, on the basis of the loss generated, we optimize the model in a way that on the next training cycle, the total cost generated by the model is reduced.
So how exactly do we define model optimization?
Model optimization can be defined as the process of updating the model parameters (i.e., the model weights and biases), based on a criterion (loss function), such that the information loss, when the model makes inference on the training data, is reduced. Simply put, during an optimization step, we iteratively update the model parameters such that the cost function is minimized.
So the model training can be summarised as an iterative cycle where we start with some randomly initialized model parameters, and then slowly optimize these model parameters over several backpropagation steps with the aim of achieving high accuracy and a low model cost.
So now that we know what model optimization is, let us have a look at some of the most widely used optimization algorithms in Machine Learning.
Gradient Descent Algorithm
Gradient descent is one of the easiest to implement (and arguably one of the worst) optimization algorithms in machine learning. It is a first-order (i.e., gradient-based) optimization algorithm where we iteratively update the parameters of a differentiable cost function until its minimum is attained.
Before we understand how gradient descent works, first let us have a look at the generalized formula of GD:
The basic idea here is to update the model parameters in the direction of the negative gradient of the loss function w.r.t. the parameter. To give you a better intuition, you can think of gradient descent as follows:
Let us imagine that our cost function is a mountainous region, and the aim of our optimization function, gradient descent, is to find the valley in this mountainous region (i.e., the minima of the cost function). Continuing with the analogy of the mountainous region, let’s assume that a blindfolded person is dropped somewhere on the mountain, and is asked to get to the base of the mountain.
Now, owing to the blindfold, naturally, the person won’t know the direction to the base of the mountain. Therefore, the best information available to the person regarding the direction to the base will be the direction of the steepest slope at every step they take. Hence, following the gradient descent approach, our blindfolded person will move in such a way that each step that they take will be in the direction of the steepest descent.
Basically, the gradient descent algorithm follows a greedy search strategy, where there might be a better solution in some other direction, however, the parameters get updated only in the direction relative to the gradient (or slope).
This search strategy however has a major flaw associated with it! Since it’s a greedy algorithm, there are high chances that the algorithm will run into a local minimum instead of a global minimum, and hence we might never converge to the optimal parameter values for the least model cost.
Effect of Learning Rate (γ)
When it comes to parameter updation in the gradient descent algorithm, the value of the hyperparameter γ, which denotes the step size (or learning rate) plays a key role.
- If γ is too low, the chance of reaching an optimal solution will increase. However, the convergence rate, i.e., how fast the minimum is reached, will be decreased drastically. This simply means that the number of epochs the model will have to be trained for will increase when the learning rate is too low.
- If γ is too large, the probability of attaining the optima is reduced as the model parameters tend to oscillate around the minima. However, the convergence rate increases as a result of the larger step size. Another noticeable disadvantage of a high learning rate is a phenomenon called divergence, which simply means loss explosion.
A practical way to deal with this trade-off between a high learning rate and a low learning rate is to have a variable learning rate— A larger learning rate for the initial epochs, then a reduced learning rate for the later epochs as we proceed further down the model training process. This will have the advantage of both a high learning rate (faster convergence) and a low learning rate (higher probability of reaching the optima).
While gradient descent is easy to implement, its drawback of being prone to frequently run into local optimum is something too significant to overlook.
What good is an optimization algorithm if it can’t actually minimize the model cost?
Hence, researchers came up with the modified, more optimized version of gradient descent- Stochastic Gradient Descent.
Stochastic Gradient Descent
While stochastic gradient descent belongs to the same class of first-order iterative optimization algorithm as the gradient descent algorithm, it is a significant improvement over gradient descent— both computationally and performance-wise.
In SGD, instead of using the gradients of the entire training set for weight updation, we stochastically (randomly) choose a data instance from the training set, then perform weight updation based on the gradients for that single data instance.
The generalized formula can be given as follows:
One look at the SGD formula and it’s clear that it is very much similar to the gradient descent algorithm. The only difference is that instead of taking the gradient of the model loss for the entire batch of data instances, SGD uses the gradient of the model loss of a single instance of data chosen uniformly at random from the dataset.
Yet, in practice, SGD works significantly better as compared to gradient descent, the reason being the noisy parameter updates in SGD. Since we are updating the parameters after forward-prop on every single instance, rather than after accumulating the gradients of the entire training set, the parameters tend to bounce around. This, to some extent, reduces the probability of getting stuck in a local minimum. This phenomenon is known as annealing.
Also, it is observed that SGD is computationally less expensive as compared to GD, thus significantly reducing the training time.
In theory, we assume that all the data instances in the dataset are independent w.r.t. each other, i.e., all the instances in the training set are unique. However, in practice, there’s a lot of redundancy (similar data instances) in the dataset. This tends to drive the noise in the dataset lower, thus ensuring that the updated parameter values are close to, if not better, than the ones obtained from gradient descent.
Note: In practice, instead of updating the model parameters on single data instances, we perform the weight updation on mini-batches of data. This can be thought of as a hybrid of GD and SGD. The advantage of mini-batch-SGD is that it is optimized for parallelization on GPUs (i.e., the model function can process multiple data instances at the same time), thus allowing for even faster training times and less noise.
Momentum-Based SGD
If we talk in terms of Physics terminology, momentum is the property of an object to keep it moving in the direction it is already moving in. In other words, momentum is the physical property of an object that allows it to maintain its direction and state of motion.
When we talked about Stochastic Gradient Descent in the last section, we observed that the model parameters oscillate a lot due to the noise. This excessive noise, while effective against local minima in practice, prevents the model loss function from attaining the absolute minimum. Hence, the cost function might never attain convergence, as can be seen in the image on the left given below.
Therefore, in order to reduce these oscillations, optimization researchers used the motivation behind the Newtonian Physics’ momentum and introduced a momentum-based variation of the Stochastic Gradient Descent algorithm.
To understand how momentum works, let us have a look at the generalized formula for SGD with momentum.
Here’s how SGD+momentum works:
- In non-momentum SGD, the parameters are updated via greedy search in the direction of the gradient.
- However, in SGD+momentum, the momentum additive from the previous steps influences the descent down the slope. As a result, instead of oscillating around in the direction of the negative gradient, the parameter updation is forced along a certain direction. This, as a result, dampens the noise and therefore increases the probability of the model cost attaining the absolute minima.
If you pay close attention to the formula for the momentum, it is just an accumulation of gradients after each parameter updation step in SGD. However, one important thing to note here is that the hyperparameter β (ranging between 0–1) dampens the accumulated gradients after each step. Thus, the momentum generated due to the gradients in the more recent steps affects the parameter updation more as compared to the momentum in much earlier steps.
In practice, as we can see in the image above, momentum increases the probability of convergence, while at the same time not being very computationally extensive. This is the reason why most Deep Learning and Machine Learning packages use the momentum-based implementation of SGD rather than the regular one.
With this, we come to an end of this article. The key takeaways from this article were:
- We saw what model optimization is.
- We had an overview of some of the most widely used optimization algorithms like gradient descent, stochastic gradient descent, and momentum-based SGD.
If you found the article helpful and would love to keep seeing more articles like this, make sure you hit the follow button.
Here are some of my other curated articles:
You can connect with me on LinkedIn:
Happy learning!