Optimization Techniques in Deep Learning

Rina Mondal
4 min readDec 31, 2023

--

Optimization techniques uses algorithms designed to iteratively adjust a model’s parameters during the training process to minimize the loss function (the discrepancy between the predicted values of the model and the actual values), thereby enhancing the model’s performance.

In simpler terms, Its like finding the lowest position of a valley in a foggy morning.

Here, model parameters are updated based on the gradients of the loss function. In this blog, we will discuss some commonly used optimization techniques in deep learning :

  1. SGD
  2. SGD with Momentum
  3. Nesterov Momentum
  4. Adagrad
  5. RMSProp
  6. Adam

1.Stochastic Gradient Descent (SGD): Stochastic Gradient Descent randomly selects a mini-batch of examples at each iteration to compute the gradient, SGD uses a learning rate to control the size of parameter updates. The learning rate determines the step size in the direction of the negative gradient.

Considerations:

The term “stochastic” in SGD implies randomness. The randomness in mini-batch selection introduces noise, which can help the optimization process converge to a better solution by navigating through different parts of the loss landscape hence, SGD is not as susceptible to local minima as traditional gradient descent. However, Saddle points can slow down the convergence of SGD. The flat regions can cause the algorithm to take small steps, and the non-flat dimensions might lead the algorithm away from the saddle point. To overcome this problem, SGD with momentum was introduced.

2. SGD with Momentum: The addition of momentum helps to navigate through areas with flat or small gradients more smoothly and accelerates the learning process. The momentum term allows the optimization algorithm to keep moving in the same direction if the gradients are consistent, and it dampens oscillations, making convergence faster and more stable.

3. Nesterov Accelerated Gradient (NAG) : Often referred to as Nesterov Momentum. It is a smarter way of rolling downhill in optimization. Instead of just checking the slope where you are, it looks a bit ahead by incorporating the momentum. It’s like peeking over the hill before deciding how to roll down. First it calculates the momentum, from that position it evaluates the gradient and then it comes back to the original point and mixed together and calculates the next step.

The key distinction is that Nesterov Momentum calculates the gradient not at the current position but at an adjusted position that incorporates the momentum term.

4. Adagrad: Adaptive Gradient Algorithm is designed to adjust the learning rates for each parameter individually during the training of a machine learning model. It does this by keeping track of the historical gradients for each parameter and scaling the learning rates based on the accumulated gradient magnitudes. In simpler terms, Adagrad adapts the step sizes for updating model parameters based on their past gradients.

grad_squared= dx*dx
x-=learning_rate * dx / (np.sqrt(grad_squared)+ 1e-7)

Problems: Adagrad decreases the learning rates for parameters that have large historical gradients. As training progresses, the learning rates become progressively smaller, potentially leading to very slow or halted learning for some parameters.

5. RMSProp: This helps in addressing the concern a bit. Instead of letting those squared gradients continually accumulate over training, we let that squared estimate actually decay, It is designed to address the issue of aggressively decreasing learning rates over time. This is achieved by using a moving average of squared gradients with a decay factor.

grad_squared= decay_rate*grad_squared + (1-decay_rate)*dx*dx
x-=learning_rate * dx / (np.sqrt(grad_squared)+ 1e-7)

6. Adam (Adaptive Moment Estimation): Adam combines the benefits of both adaptive learning rates and momentum to efficiently update model parameters during training. We maintain an estimate of the first moment as a weighted sum of our gradients and a second moment which is a moving estimate of squared gradients. Now, when we make out update step using both i.e. velocity divided by squared gradient term. It kinds of incorporate the nice properties of both.

first_moment= beta1*first_moment + (1-beta1)*dx #momentum
second_moment=beta2*second_moment + (1-beta2)*dx*dx #squared gradients
x-=learning_rate * first_moment/ (np.sqrt(second_moment)+ 1e-7)

Optimization algorithms, crucial for training machine learning models, vary in effectiveness. While Stochastic Gradient Descent (SGD) is foundational, advanced algorithms like Adam, with adaptive learning rates, often outperform. The choice depends on the specific problem and dataset. Experimentation and staying informed about advancements are key for optimal model training.

Explore Data Science Roadmap.

Visit my YouTube Channel where I explain Data Science topics for free.

Give it :👏👏👏👏:
If you found this guide helpful , why not show some love? Give it a Clap 👏, and if you have questions or topics you’d like to explore further, drop a comment 💬 below 👇

--

--

Rina Mondal

I have an 8 years of experience and I always enjoyed writing articles. If you appreciate my hard work, please follow me, then only I can continue my passion.