Optimizers: The Secret Sauce of Deep Learning

Neha Purohit

Published in

𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

13 min readSep 20, 2023

Contents:

History Of Optimizers

Standard Optimizers

-Batch GD

-SGD

-Mini Batch

Adaptive Optimization Algorithm

-Momentum

-Adagrad

-RMSPROP

-ADAM

History Of Optimizers

The main objective of any deep learning algorithm is to predict outputs closer to the actual

output. It helps to reduce the cost function, which is based on the prediction error. Optimization algorithms play an important role in the forward and backward propagation of a neural network during the training process. It speeds up the training process and finds the optimal parameters for a neural network. An optimization algorithm executes all possible solutions iteratively until it reaches a point that is optimum or satisfactory. Optimal optimizers effectively help neural networks during the training process. Gradient Descent, also known as Batch Gradient Descent, is an optimization algorithm used to minimize some function by iteratively moving in the direction of the steepest descent.

Standard Optimizers

The below animation is an excellent illustration of comparison of optimizers made by Alec Radford. Unfortunately the Adam optimizer is not mentioned in this illustration.

Although initially discovered by Augustin-Louis Cauchy in the mid-19th century, it has become a cornerstone of modern machine learning and deep learning optimization techniques. This iterative optimization algorithm plays a pivotal role in training machine learning and deep learning models, aiding in the quest to find the local minimum of a function.By adjusting parameters in the direction of the steepest descent, it typically converges to a local minimum, where the function is lower than in its nearby surroundings. In cases of non-convex functions, which may have multiple local minima, gradient descent may become trapped in one of these local minima, potentially missing the global minimum.

There are many variants of the gradient descent optimization algorithm. There are different types of optimization algorithms in a chronological order. Each optimization algorithm has its own advantages and disadvantages depending on the dataset, computation, implementation, and machine learning model.

Training a neural network involves configuring its parameters to enable it to perform a specific task effectively. This requires defining a loss function to assess the network’s performance, with the goal of minimizing this function. This process, known as training, entails adjusting the network’s parameters to achieve optimal performance while minimizing the loss. Optimization algorithms are pivotal in this training process, and there are various state-of-the-art optimizers, each with its unique strengths and weaknesses.

Gradient descent is one of the most common training methods. It fine-tunes network parameters iteratively to minimize the loss function, allowing the network to learn the optimal weights for specific inputs. It offers several advantages, including ease of implementation, versatility across tasks, high-quality results for large networks, computational efficiency for large datasets, and a stable trade-off between speed and stability.

In short, Gradient descent is an iterative optimization algorithm employed in machine learning to discover the optimal outcomes, specifically the global minima of a curve. It operates by iteratively adjusting parameters to minimize a function. The term “gradient” denotes the rate of slope inclination or declination, while “descent” signifies the process of moving downward. The algorithm’s iterative nature involves multiple repetitions to achieve the most optimal result, making it especially effective for fitting under-fitted data graphs optimally.

The formula for updating the parameters of a machine learning model using gradient descent is as follows:

θ(t+1) = θ(t) — α * ∇J(θ(t))

Where:

θ(t) represents the model parameters at iteration t.

α (alpha) is the learning rate, which determines the step size in the parameter space. It’s a hyperparameter that needs to be set in advance.

∇J(θ(t)) is the gradient of the cost or loss function J with respect to the parameters θ at iteration t. This gradient indicates the direction and magnitude of the steepest increase in the cost function.

The learning rate is a crucial hyperparameter in gradient descent and other optimization algorithms used in machine learning. It determines the step size at which the algorithm updates the model’s parameters during each iteration. In essence, the learning rate controls the pace at which the algorithm converges to the optimal solution. Cost functions are closely related to learning rate in the context of training machine learning models, especially when using optimization algorithms like gradient descent. The learning rate directly impacts the behavior of the optimization process, which, in turn, affects how the cost function evolves during training.

Batch Gradient Descent

The Vanilla version of Gradient Descent is Batch Gradient Descent.
In this technique, we take the entire data set and compute the Gradient Descent.
i.e. For each epoch, the entire dataset is used in the calculation of MSE.

The Disadvantage of this as we are taking the entire data set, obviously it increases the complexity. And so it is very slow as for a Neural Network.

Stochastic Gradient Descent

However, it’s drawbacks, such as requiring many iterations to find optimal parameters and potentially degrading performance with large or complex datasets. For such cases, Stochastic Gradient Descent (SGD), which uses partial data samples for gradient calculations, may be a more suitable choice due to its efficiency with extensive training datasets. But before we move on to SGD, let’s understand Epochs, Batch size, Iterations.

Epochs: An epoch is one complete pass of backward and forward propagation through the entire training dataset. During an epoch, the algorithm processes each data point in the training dataset exactly once, computes the gradients for the parameters, and updates the model’s weights. In practice, training a machine learning model often requires multiple epochs to improve its performance. After each epoch, the model has seen and learned from the entire dataset once.

Batch Size: Batch size refers to the small sample size used in a single forward and backward pass of the neural network during each iteration. Instead of updating the model’s weights after processing every individual data point (as in pure stochastic gradient descent), SGD groups data into batches of a predefined size. The batch size determines how many data points are processed in parallel during each iteration.

Iterations per Epoch: The number of iterations required to complete one epoch is determined by dividing the total number of training examples by the batch size. For example, if you have 1,000 training examples and a batch size of 100, it would take 10 iterations to complete one epoch.

In the context of training a neural network, it might seem counterintuitive at first that passing the entire dataset through the network once is insufficient. However, this is because we typically work with limited datasets, and the optimization process relies on iterative techniques like Gradient Descent. Therefore, a single pass through the dataset, often referred to as one epoch, is not enough to fully optimize the model and improve its performance. Multiple passes through the dataset are necessary to fine-tune the model’s weights and achieve better results.

An alternative approach is to utilize a portion of the data during gradient computation, a technique known as stochastic gradient descent (SGD). SGD involves a two-phase learning process. In the initial phase, the network initializes with randomly set parameters, and a loss function is computed based on the training data. In the subsequent phase, the update procedure iterates multiple times until the desired convergence criterion is met. Similar to standard gradient descent, this learning process revolves around minimizing the loss function by iteratively adjusting the network’s weights and biases. Consequently, this method offers both stability and scalability and is well-suited for large datasets.

The formula for updating the model parameters (θ) using stochastic gradient descent is as follows:

θ(t+1) = θ(t) — α * ∇J(θ(t), x_i, y_i)

Where:

θ(t) represents the model parameters at iteration t.

α (alpha) is the learning rate, a hyperparameter that determines the step size in the parameter space.

∇J(θ(t), x_i, y_i) is the gradient of the cost or loss function J with respect to the parameters θ at iteration t, computed using a single data point (x_i, y_i) from the training dataset. This is what makes it “stochastic” because it uses a random (or shuffled) data point for each iteration.

In certain scenarios, stochastic gradient descent (SGD) can experience slow convergence, especially when gradients are consistently small or noisy. This limitation arises because SGD relies solely on gradients at each iteration for updates. To address these issues in neural network training, momentum is introduced, which accelerates gradient descent by considering previous gradients in the update rule at each iteration.

Mini Batch

SGD is a widely-used optimization method, but it possesses certain limitations that have led to the development of Mini Batch SGD. The challenges associated with pure SGD include its tendency to yield noisy and erratic updates due to the fact that it updates model weights after processing each individual data point. This high variance in updates can hinder convergence and lead to instability. Additionally, SGD does not efficiently utilize modern hardware capabilities, such as GPUs, which are capable of performing optimized vectorized operations. Minibatch SGD was introduced to mitigate these issues by processing small, random subsets (mini batches) of the training data during each iteration. This approach addresses the drawbacks of pure SGD by reducing variance in updates, optimizing hardware utilization, and striking a balance between noise and stability in the optimization process. Consequently, minibatch SGD has become the preferred optimization algorithm for training deep learning models. However, the issues with Minibatch SGD is noise and oscillations that occur when updating the model’s parameters with mini-batches of data. For this we use Exponential Weighted Moving Average (EWMA), to smooth parameter updates during training.

moving_avg = β * moving_avg + (1 — β) * gradient

β * moving_avg represents the weighted contribution of the previous moving average,

(1 — β) * gradient represents the contribution of the current mini-batch gradient.

It involves initializing moving average variables for each model parameter and choosing a decay rate (usually between 0 and 1) to determine the influence of the current mini-batch gradient versus the historical moving average. During each iteration, the moving average for each parameter is updated using a weighted combination of the previous moving average and the current mini-batch gradient. Model parameters are then updated using these smoothed moving average values instead of raw gradients. This process reduces noise and oscillations in updates, leading to more stable and efficient training of machine learning models.

Adaptive Optimization Algorithm

Momentum

Gradient Descent with Momentum is also one of the optimization techniques used for smoothening and training machine learning models, often leading to faster and more stable convergence. It enhances standard Gradient Descent by introducing a momentum term that accumulates past gradients’ influence, allowing for faster convergence and smoother optimization. During each iteration, the algorithm calculates gradients, updates a momentum-based velocity vector, and adjusts model parameters.

Momentum helps the algorithm escape local minima, navigate regions with complex curvature, and converge more swiftly. Typically, a momentum term (beta) is chosen, and a learning rate determines the step size for parameter updates. This technique is particularly effective in training deep neural networks and other machine learning models.

In all above optimization methods, the learning rate (Eta) has been stable between 0 and 1, due to which the convergence to global minima has been stable too.

Adagrad

What if we need to dynamically change Eta — this is where Adagrad was introduced, AdaGrad (Adaptive Gradient Algorithm) is an optimization algorithm used in machine learning to dynamically adjust the learning rate for each parameter during training. It achieves this by accumulating the squared gradients of each parameter’s past gradients. This accumulation allows AdaGrad to give larger updates to parameters associated with infrequent, large gradients and smaller updates to parameters with frequent, small gradients. AdaGrad is particularly useful for sparse data because it can increase the learning rate for parameters that have not been updated frequently, aiding convergence in such cases. However, one drawback is that it can overly reduce the learning rate for frequently updated parameters, potentially causing slow convergence in some situations, which paved the way for ‘Diminishing learning rate’.

Diminishing learning rate, also known as learning rate decay or learning rate annealing, is a technique used in optimization algorithms, particularly during the training of machine learning models, to gradually reduce the learning rate over time. The learning rate is a hyperparameter that determines the step size at which model parameters are updated during optimization.

The concept behind diminishing learning rate is to start with a relatively large learning rate at the beginning of training when the model’s parameters are far from the optimal values. As training progresses and the model gets closer to convergence, the learning rate is reduced. The motivation for using diminishing learning rates includes:

Faster Convergence: A larger initial learning rate can help the model converge faster in the early stages of training when the parameter values are far from optimal.

Stable Convergence: Reducing the learning rate as training progresses allows the optimization process to stabilize and fine-tune the model’s parameters more precisely, avoiding overshooting the optimal values.

Escape Local Minima: Diminishing learning rates can help the optimization process escape local minima by allowing the algorithm to explore different regions of the loss landscape with smaller steps as it gets closer to a potential minimum.

This can be achieved by implementing strategies like Step Decay, Exponential Decay, 1/t Decay, Performance Based Scheduling.

Faster Convergence: A larger initial learning rate can help the model converge faster in the early stages of training when the parameter values are far from optimal.

This can be achieved by implementing strategies like Step Decay, Exponential Decay, 1/t Decay, Performance Based Scheduling.

Faster Convergence: A larger initial learning rate can help the model converge faster in the early stages of training when the parameter values are far from optimal.

This can be achieved by implementing strategies like Step Decay, Exponential Decay, 1/t Decay, Performance Based Scheduling.

RMSPROP

Addressing the diminishing learning rate as above is called RMSProp (Root mean square propagation). RMSProp is effective in handling the diminishing learning rate problem, making it a popular choice for optimizing machine learning models, particularly neural networks, where choosing an appropriate learning rate can be challenging.

ADAM

Adam (Adaptive Moment Estimation) was introduced to address these limitations by combining the benefits of adaptive learning rates (like RMSprop) with momentum.

Adaptive Momentum aka Adam, an algorithm conceived by Dr. Geoffery Hinton at the University of Toronto, has found application in instructing various entities, spanning from creatures to individuals to machines. Initially designed for deep neural network training in fields like machine translation and speech recognition, Adam incorporates an exponentially moving average of both the first moments (the mean) and second moments (the uncentered variance) of the gradients. This allows Adam to adaptively adjust learning rates for each parameter while incorporating information about the gradient’s momentum. This combination makes Adam more robust and generally results in faster and more stable convergence compared to RMSprop and some other optimization algorithms.

The Adam method is a highly efficient and widely utilized optimization technique that combines elements of gradient descent (GD) and momentum. It excels in estimating adaptive learning rates for all parameters used in gradient-based training. The mathematical notation for Adam is a critical aspect of this method, defining how it performs its parameter updates.

Optimizers: The Secret Sauce of Deep Learning

Written by Neha Purohit