Data Scientist Interview Guide: Understanding Deep Learning Optimizers

What are optimizers? When to use which one? How to implement them as part of neural networks?

Published in

CodeX

16 min readMay 8, 2023

A common set of question in data science machine learning interview is focused on model architecture choices or design choices. In this post we cover one such design choice — the Optimizer. What are they, when to use which one, what are the advantages/disadvantages of using certain activation functions and how to implement them in TensorFlow.

Optimizers

In deep learning, we train a neural network by adjusting the parameters (weights and biases) based on the error between the predicted output and the true output. The optimizer is the algorithm that helps us adjust those parameters to minimize the error and improve the accuracy of the model.

But before we dive into optimizers, let’s make sure we’re on the same page with some prerequisite concepts. You’ll need to have a basic understanding of:

Neural networks: These are the models that we train using the optimizer. They’re inspired by the way the human brain works, and consist of interconnected layers of nodes that process and transform data.
Cost functions: These are the functions that we want to minimize using the optimizer. They measure the error between the predicted output and the true output of the neural network.
Gradients: These are the derivatives of the cost function with respect to the parameters of the neural network. They tell us how the cost function changes as we adjust the parameters.

Now, back to optimizers. The goal of an optimizer is to find the optimal set of parameters that minimize the cost function. There are many different optimizers out there, each with their own strengths and weaknesses. Some are better suited for large datasets, while others are better suited for complex neural network architectures.

Here’s an analogy to help you understand the role of the optimizer: think of the neural network as a car, the cost function as the fuel gauge, and the optimizer as the driver. The driver’s job is to adjust the steering wheel and pedals to keep the car on the road and the fuel consumption low. Similarly, the optimizer’s job is to adjust the parameters of the neural network to keep the error low.

Now, let's look at some of the most popular optimizers, their advantages and disadvantages and how to use them.

Stochastic Gradient Descent (SGD)

“Stochastic” because it uses a random subset of the training data to update the parameters at each step, rather than the entire dataset. The math behind SGD is pretty simple. At each step, we calculate the gradient of the cost function with respect to the parameters, using the current batch of training data. We then update the parameters in the opposite direction of the gradient, multiplied by a learning rate (which determines how big of a step we take). The formula for updating the parameters using SGD is:

new_parameter = old_parameter — learning_rate * gradient

Here are some advantages and disadvantages of using SGD:

Advantages
1. Simple and easy to implement
2. Works well with large datasets, since it only uses a small batch of data at a time
3. Can find good solutions quickly, especially in high-dimensional parameter spaces

Disadvantages
1. Can get stuck in local minima (where the cost function is low, but not the absolute lowest)
2. Requires careful tuning of the learning rate
3. May require more iterations to converge compared to other optimization algorithms

SGD is a good optimizer to use in situations where we have a large dataset and a relatively simple neural network architecture. It’s also a good choice if we’re short on time and want to find a good solution quickly.

To use SGD as part of a neural network in TensorFlow, we simply specify it as the optimizer when compiling the model. Here’s some example code:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
model = Sequential([
 Dense(64, activation='relu', input_shape=(784,)),
 Dense(10, activation='softmax')
])
model.compile(optimizer=SGD(learning_rate=0.01),
 loss='categorical_crossentropy',
 metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_test, y_test))

And here’s a link to the TensorFlow documentation on using SGD: https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD

Adagrad Optimizer

Adagrad is an optimization algorithm that adapts the learning rate of each parameter in a neural network based on the historical gradients of that parameter. Let me break it down for you in simpler terms.

When we train a neural network, we use an optimizer to update the parameters (weights and biases) based on the gradients of the cost function with respect to those parameters. Adagrad takes this a step further by scaling the learning rate of each parameter based on how frequently and how much that parameter has been updated in the past.

Here’s the math behind Adagrad:

1. Initialize the learning rate for each parameter as a small value, e.g., 0.01.
2. For each iteration of training, calculate the gradient of the cost function with respect to each parameter.
3. Update each parameter using the following formula:
parameter = parameter — (learning_rate / sqrt(sum_of_squared_gradients)) * gradient
where sum_of_squared_gradients is the sum of the squares of all previous gradients for that parameter.
4. Repeat step 2 and 3 for the desired number of iterations.

For more on the reasoning behind Adagrad check out this post — https://machinelearningmastery.com/gradient-descent-with-adagrad-from-scratch/

Here are some advantages and disadvantages of using Adagrad:

Advantages
1. Allows the learning rate to be adaptive for each parameter, which can lead to faster convergence on sparse and/or noisy datasets.
2. Eliminates the need to manually tune the learning rate, which can be a tedious and error-prone process.

Disadvantages
1. It can accumulate the sum of the squared gradients indefinitely, which can cause the learning rate to become very small and eventually stop the learning process.
2. It requires more memory to store the historical gradients for each parameter, which can be an issue for larger neural networks with many parameters.

Adagrad is well-suited for sparse datasets, where some parameters may be updated infrequently, and noisy datasets, where the gradients may vary widely. It’s also a good choice when you don’t have prior knowledge of the appropriate learning rate for each parameter.

In TensorFlow, Adagrad in the following manner

import tensorflow as tf

# Define the neural network model
model = tf.keras.models.Sequential([
 tf.keras.layers.Dense(32, activation='relu', input_shape=(784,)),
 tf.keras.layers.Dense(10)
])

# Compile the model with Adagrad optimizer
optimizer = tf.keras.optimizers.Adagrad(learning_rate=0.01)
model.compile(optimizer=optimizer, loss='mse', metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, epochs=10, batch_size=32)

For further documentation on using Adagrad in TensorFlow, you can refer to the official documentation: https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adagrad

RMSProp Optimizer

RMSProp stands for Root Mean Square Propagation. It’s an optimization algorithm used in deep learning to adjust the learning rate of each parameter based on its recent gradients, similar to AdaGrad. Here’s how it works:

1. RMSProp keeps track of a moving average of the squared gradients of each parameter. This is called the “exponential moving average” and is denoted by the variable S.
S(i) = beta * S(i-1) + (1 — beta) * Sum_of_Squqred_Grads_till_i
2. The size of the moving average is controlled by a hyperparameter called “decay rate”, beta in the above equation, which determines how quickly the moving average forgets the past gradients.
3. The learning rate for each parameter is then divided by the square root of this moving average ‘S’, which has the effect of scaling down the learning rate for parameters that have large gradients and scaling up the learning rate for parameters that have small gradients.

In summary, RMSProp adjusts the learning rate of each parameter based on the recent gradients, using a moving average to scale the learning rate. Here are some advantages and disadvantages of using RMSProp:

Advantages
1. Helps to prevent oscillations in the parameter updates
2. Efficiently adapts the learning rate to different parameters
3. Can converge faster than other optimization algorithms on some datasets

Disadvantages
1. Can get stuck in local minima
2. Hyperparameter tuning can be tricky

RMSProp is a good choice for training neural networks with non-stationary or noisy gradients, which can occur in recurrent neural networks or when using dropout regularization. To use RMSProp as part of a neural network in TensorFlow, you can do the following

from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential([
 layers.Dense(64, activation='relu'),
 layers.Dense(10, activation='softmax')
])
model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.001), loss=’mse’)

Here’s a link to the TensorFlow documentation on RMSProp for further information: https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/RMSprop

Adadelta Optimizer

Adadelta is an optimization algorithm used in deep learning that adapts the learning rate during training. It was developed as an extension of the Adagrad and RMSProp optimizers, which has some limitations in its learning rate scheduling.

The math behind Adadelta can get a bit complicated, but I’ll try to explain it as simply as possible. The key idea behind Adadelta is to use a moving average of the gradients and the moving average of updates to the gradient to adapt the learning rate. Specifically, Adadelta keeps track of two moving averages:

- A moving average of the squared gradients, ‘S’, similar to RMSProp.
- A moving average of the squared updates, denoted by Delta (δ).
Essentially,
step_size_i+1 = sqrt(Delta_i) / sqrt(S_i)
parameter_update_i+1 = step_size_i+1 * gradient_of_parameter_i+1
delta_i+1 = gamma * delta_i + (1-gamma) * (parameter_update_i+1 ^ 2)
The parameter_update_i is then used to update the parameter.

Advantages
1. Requires comparatively less tuning of hyperparameters, as it adapts the learning rate automatically during training.
2. It can handle noisy data and sparse gradients better than some other optimizers.

Disadvantages
1. It can be slower to converge than other optimizers, especially on smaller datasets.
2. It can sometimes get stuck in local minima.
3. It requires the storage of additional variables which can be memory-intensive for very large models.

Adadelta is best used in situations where you have a large dataset with many features, the data is noisy or has sparse gradients, you want an optimizer that requires minimal hyperparameter tuning.

In TensorFlow, you can use Adadelta is used in the following manner

from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential([
 layers.Dense(64, activation='relu'),
 layers.Dense(10, activation='softmax')
])
model.compile(optimizer='Adadelta', loss='categorical_crossentropy')

For further documentation on how to use Adadelta in TensorFlow, you can check out the official TensorFlow documentation here: https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adadelta

Adam Optimizer

Adam stands for “Adaptive Moment Estimation”, which is a type of gradient-based optimizer used in deep learning. It’s a combination of two other optimization algorithms: AdaGrad and RMSProp.

The idea behind Adam is to adaptively adjust the learning rate of each parameter based on its historical gradients. It does this by keeping track of two things:
(1) the exponentially decaying average of past gradients — if this value is large, then the gradient signs have not been changing indicating that we are in the right direction and hence step size can be increase
(2) the exponentially decaying average of past squared gradients. — if this value is large, then gradient value themselves are large, so we do not need a large step size to make the parameter update significant. So this value helps reduce step size.

These two quantities are then used to update the parameters. Here’s the math behind it:

1. Compute the gradient of the cost function with respect to each parameter.
2. Calculate the exponentially decaying average of past gradients, m.
— m = beta1 * m + (1 — beta1) * gradient
3. Calculate the exponentially decaying average of past squared gradients, v.
— v = beta2 * v + (1 — beta2) * gradient²
4. Compute the bias-corrected estimates of m and v.
— m_hat = m / (1 — beta1^t)
— v_hat = v / (1 — beta2^t)
(where t is the current iteration, and beta1 and beta2 are hyperparameters)
5. Update the parameters based on the bias-corrected estimates.
— parameter = parameter — learning_rate * m_hat / (sqrt(v_hat) + epsilon)
(where learning_rate is the step size, and epsilon is a small constant to prevent division by zero)

Here are some advantages and disadvantages of using Adam:

Advantages
1. Adaptive learning rates: Adam adjusts the learning rate of each parameter individually, which can lead to faster convergence.
2. Robustness to sparse gradients: Adam performs well on problems with sparse gradients, such as text data or image recognition.
3. Memory efficiency: Adam only needs to store the first and second moments of the gradients, which requires less memory than other optimizers.

Disadvantages
1. Hyperparameter tuning: Adam has several hyperparameters that need to be tuned, such as beta1, beta2, and epsilon.
2. Overfitting: Adam can sometimes overfit to the training data, especially on small datasets.

Adam is a good optimizer to use for most deep learning tasks, especially when you’re not sure which optimizer to use. However, it may not be the best choice for all problems, and it’s always worth experimenting with different optimizers to find the best one for your specific task. Here’s an example of how to use the Adam optimizer in TensorFlow:

import tensorflow as tf

model = keras.Sequential([
 layers.Dense(64, activation='relu'),
 layers.Dense(10, activation='softmax')
])

# Define your neural network architecture here
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

For more information on how to use the Adam optimizer in TensorFlow, check out the official documentation: https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/AdamAdamax Optimizer

Adamax Optimizer

Adamax is an optimization algorithm that’s similar to Adam, but with a few key differences. Like Adam, it’s a combination of gradient descent with momentum and RMSProp. However, instead of using the sqaured gradients to compute the scaling factor for the learning rate, it uses the maximum absolute value of the gradients. This makes it more robust to noisy or sparse gradients. Here’s the math behind Adamax:

1. Initialize the parameters: θ, m, and u to zero vectors of the same dimension.
2. For each iteration t:
— Compute the gradient: g_t
— Update the first moment estimate: m_t = β1 * m_{t-1} + (1 — β1) * g_t
— Update the exponentially weighted infinity norm: u_t = max(β2 * u_{t-1}, |g_t|)
— Compute the adaptive learning rate: α_t = η / (1 — β1^t)
— Update the parameters: θ_{t+1} = θ_t — α_t * m_t / u_t

Some of the advantages and disadvantages are.

Advantages
1. It’s more robust to sparse or noisy gradients than other optimizers, such as SGD or Adagrad.
2. It’s computationally efficient and has been shown to work well on large-scale datasets and deep neural networks.
3. It doesn’t require manual tuning of hyperparameters, unlike some other optimizers.

Disadvantages
1. It can sometimes converge to suboptimal solutions, especially on problems with high-dimensional parameter spaces or saddle points.
2. It can be sensitive to the choice of the hyperparameters β1 and β2, although the default values often work well in practice.

Adamax is best used in situations where you’re working with large datasets or deep neural networks, where the gradients may be sparse or noisy; or if you want an optimizer that’s efficient and doesn’t require manual tuning of hyperparameters. To use Adamax in a neural network with TensorFlow, see the following.

from tensorflow import keras
model = keras.Sequential([…])
model.compile(optimizer='adamax', 
        loss='categorical_crossentropy',
        metrics=['accuracy'])

For more information on how to use Adamax with TensorFlow, you can refer to the official documentation here: https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adamax

Nadam Optimizer

Nadam stands for Nesterov-accelerated Adaptive Moment Estimation. It’s actually a combination of two other optimization methods, Nesterov accelerated gradient (NAG) and adaptive moment estimation (Adam). Here’s how it works:

1. Like Adam, Nadam keeps track of a running estimate of the first and second moments of the gradients to adaptively adjust the learning rate.
2. Nadam then uses a momentum term to accelerate convergence in the right direction, but with a twist. Nadam uses the Nesterov trick to be more aggressive in updating the weights. It is shown below
m_t = beta1 * m_{t-1} + (1 — beta1) * g_t
v_t = beta2 * v_{t-1} + (1 — beta2) * g_t²
m_t_hat = (1 — beta1^t) * m_t / (1 — beta1^t)
v_t_hat = (1 — beta2^t) * v_t / (1 — beta2^t)
w_t = w_{t-1} — eta_t * (beta1 * m_t_hat + (1 — beta1) * g_t / (1 — beta1^t)) / (sqrt(v_t_hat) + epsilon)
```
- m_t and v_t are the first and second moment estimates, respectively, at time step t.
- beta1 and beta2 are the exponential decay rates for the moment estimates.
- g_t is the gradient at time step t.
- m_t_hat and v_t_hat are the bias-corrected moment estimates.
- w_t is the weight at time step t.
- eta_t is the learning rate at time step t.
- epsilon is a small value added for numerical stability.

Some of the advantages and disadvantages are:

Advantages
1. It combines the benefits of both NAG and Adam optimizers, making it a powerful and efficient optimizer.
2. It has been shown to converge faster and perform better than both NAG and Adam on certain types of datasets.

Disadvantages
1. It can sometimes lead to overfitting, especially on smaller datasets.
2. It may require more computational resources than simpler optimization methods.

Nadam is a good optimizer to use in situations where you have a large dataset with a complex neural network architecture, or you need an optimizer that can handle noisy or sparse gradients. To use Nadam optimizer in a neural network in TensorFlow, see the following.

from tensorflow import keras
model = keras.Sequential([…])
model.compile(optimizer='nadam', 
        loss='binary_crossentropy', 
        metrics=['accuracy'])

For more information on using Nadam optimizer in TensorFlow, check out the official documentation: https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Nadam

Ftrl Optimizer

Ftrl, or Follow-the-Regularized-Leader, is an optimizer used in machine learning for updating the weights of a model during training. Let’s break it down in simple terms. Ftrl combines elements of both online learning and batch learning. Online learning updates the model after each training example, while batch learning updates the model after processing an entire batch of training examples. Ftrl tries to balance the benefits of both by adapting the learning rate for each weight based on the frequency of its updates.

The math behind Ftrl can get quite complex, but here’s a simplified explanation.

1. Initialize the weights of the model to some initial values.
2. At each iteration, calculate the gradient of the loss function with respect to the weights.
3. Update accumulator which stores the gradient squared for each feature
4. Calculate the effective learning rate for each feature, which depends on the accumulated gradients and the regularization parameters.
4. Update the weights using the calculated effective learning rate and gradient (similar to Adagrd/Adadelta etc).
5. Perform L1 and L2 regularization on the updated weights to help deal with sparsity and push weight to zero (key diffferentiator compare to other optimizers)
6. Repeat steps 2–6 until convergence criteria are met (e.g., the change in the loss function is below some threshold).

Some of the advantages and disadvantages of Ftrl are

Advantage
It adapts to different learning rates for different weights, which can lead to faster convergence and better generalization. It also has a regularization term that helps prevent overfitting.

Disadvantages
It can be more complex to implement than other optimizers, and may require more tuning to get the best results.

Ftrl is best used in situations where the data is very sparse, such as in natural language processing or recommendation systems, and where there are a large number of features. To use Ftrl as part of a neural network in TensorFlow, see the following.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Ftrl
model = Sequential()
model.add(Dense(64, input_dim=100, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
optimizer = Ftrl(learning_rate=0.001)
model.compile(loss='binary_crossentropy', 
              optimizer=optimizer, 
              metrics=['accuracy'])

For further documentation on using Ftrl in TensorFlow, check out the official documentation: https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Ftrl

L-BFGS Optimizer

L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno) is an optimization algorithm used in deep learning to train neural networks. Let me break it down for you in simple terms.

First, let’s talk about the math behind it. L-BFGS is a type of quasi-Newton method, which means it approximates the inverse Hessian matrix (a matrix that describes the curvature of the cost function) using the gradients of the cost function. The L-BFGS algorithm works as follows:

1. Initialize the weights of the model to some initial values.
2. At each iteration, calculate the gradient of the loss function with respect to the weights.
3. Estimate the search direction for the next update step using the previous gradients and Hessian approximation.
4. Perform a search along the estimated search direction to determine the step size that minimizes the loss function. This is done using backtracking/Wolfe condition which are out-of-scope for this blog.
5. Update the weights using the calculated step size and search direction.
6. Repeat steps 2–5 until convergence criteria are met (e.g., the change in the loss function is below some threshold).

Now, let’s talk about the advantages and disadvantages of L-BFGS:

Advantages
1. Can converge to a good solution with relatively few iterations compared to other optimization algorithms
2. Good for optimizing smooth and convex cost functions
3. Memory-efficient, because it only needs to store a limited number of previous gradients

Disadvantages
1. Can be slow for very large datasets or complex neural network architectures
2. Can get stuck in local optima (i.e., suboptimal solutions) if the cost function is not convex
3. Requires a lot of memory for high-dimensional problems

In terms of the best situation to use L-BFGS, it’s generally good for problems with a moderate number of parameters and a smooth, convex cost function. However, for larger datasets or more complex neural network architectures, stochastic gradient descent (SGD) or its variants may be more appropriate.

To use L-BFGS as part of a neural network in TensorFlow, see the following.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import LbfgsOptimizer
model = Sequential([
 Dense(64, activation='relu', input_shape=(784,)),
 Dense(10, activation='softmax')
])
model.compile(optimizer=LbfgsOptimizer(), 
      loss='categorical_crossentropy', 
      metrics=['accuracy'])

For more information on L-BFGS and how to use it in TensorFlow, you can check out the official documentation here: https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/LbfgsOptimizer

Proximal Gradient Descent Optimizer

Proximal Gradient Descent is an optimization algorithm that combines the advantages of Gradient Descent and Proximal Operator. In simple terms, it means that this optimizer adjusts the weights and biases of a neural network in a way that minimizes the cost function while also taking into account constraints that we may have on the weights and biases. Here’s how the algorithm works:

1. Calculate the gradient of the cost function with respect to the weights and biases.
2. Apply a proximity operator, which is a function that constrains the weights and biases to a certain range or set.
3. Calculate the step size for the next iteration of the algorithm.
4. Update the weights and biases using the step size and the constrained values.

Now, let’s talk about the advantages and disadvantages of Proximal Gradient Descent:

Advantages
1. It’s a flexible algorithm that can handle different types of constraints on the weights and biases.
2. It can converge faster than Gradient Descent in situations where contraints come into play.

Disadvantages
1. It may require more tuning of hyperparameters than other optimization algorithms.
2. It may not work as well for very large datasets or very deep neural networks.

Proximal Gradient Descent can be useful in situations where we want to add some constraints on the weights and biases of our neural network, such as ensuring that they remain non-negative or within a certain range.

In TensorFlow, Proximal Gradient Descent can be used in the following way.

import tensorflow as tf
from tensorflow.keras import layers

# Define the model architecture
model = tf.keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(784,)),
    layers.Dense(10, activation='softmax')
])

# Compile the model with Proximal Gradient Descent optimizer
optimizer = tf.keras.optimizers.ProximalGradientDescent(learning_rate=0.01,
                                                        l1_regularization_strength=0.01,
                                                        l2_regularization_strength=0.01)
model.compile(optimizer=optimizer,
              loss='categorical_crossentropy',
              metrics=['accuracy'])

This code creates a Proximal Gradient Descent optimizer with a learning rate of 0.01 and regularization strengths of 0.01 for both L1 and L2 regularization. The optimizer is then used to minimize the loss function during training.

For more information on how to use Proximal Gradient Descent optimizer in TensorFlow, you can refer to the official TensorFlow documentation.

Conclusion

In this blog we covered 10 of the most popular and practically used optimizers. We have seen the the intuition behind each ones math, the advantages and disadvantages of each, when to use which one and how they can be implemented using TensorFlow.

Credits
This post was written with help from ChatGPT. Some of the promopts used are
You are an experience data scientist with years of practical experience building and tuning machine learning models. Suggest the 10 most useful deep learning optimizers that someone new to data science should know about, to help build effective models. Dont recommend optimizers just because they are used in textbook or docs. Recommend functions that are actually used in code.
Explain SGD optimizer in simple terms. Include all the math behind it. Explain the advantage and disadvantages and list them as bullet point. Explain the best situation in which this optimizer should be used. Show how it can be used as part of neural network in TensorFlow. Link to further documentation in TensorFlow as well