Photo by Nicholas Rean on Unsplash

Optimizers in Neural networks

Amit Singh Rathore
Nerd For Tech
Published in
2 min readFeb 14, 2021

--

Optimizers are methods/algorithms used to modify the attributes like weights & learning rate in order to minimize the loss.

Gradient Based

Batch Gradient Descent — Regression & classification

It computes the gradient of the loss function w.r.t. to the parameters for the entire training dataset.

for i in range(epochs):
param_gradient = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * param_gradient

Slow
Computation heavy
Memory intensive

Stochastic Gradient Descent

In SGD the parameter update happens for each training example and label.

for i in range(epochs):
np.random.shuffle(data)
for sample in data:
params_gradient = evaluate_gradient(loss_function, sample, params)
params = params - learning_rate * params_gradient

Fast
High variance
Memory efficient

Mini batch SGD

Mini-batch gradient descent combines the best of both Batch & SGD. In this, parameter updates happen for every mini-batch(32, 64… usually in the powers of 2) of n training samples.

for i in range(epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=64):
params_gradient = evaluate_gradient(loss_function, batch, params)
params = params - learning_rate * params_gradient

Fast
Lesser variance
Memory efficient

In all three above optimizers, we had a constant learning rate. These learning rates also affect performance. The below graph summarizes how learning rates and loss relation on different values.

So there are optimizers that adapt the learning rate on the fly. Let's discuss those.

Momentum based

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations.

Adagrad(Adaptive Gradient Algorithm)

The learning rate is inversely proportional to the sum of the squares of all the previous gradients of the parameter. Hence a larger value of past gradients will result in a lower learning rate. While if the past gradient sum is small learning rate will be high.

RMSprop(Root Mean Square Propagation)

RMSprop divides the learning rate by an exponentially decaying average of squared gradients. [γ = 0.9, learning rate = 0.001]

eps , gamma = 1e-8, 0.999for epoch in range(epochs):   

for batch in get_batches(data, batch_size=64):
params_gradient = evaluate_gradient(loss_function, batch, params)
expected_grad = gamma * expected_grad + (1 - gamma) * np.square(params_gradient)
RMS_grad = np.sqrt(expected_grad + eps)
params = params -(eta/RMS_grad) * grad

Adam(Adaptive moment estimation)

A learning rate is maintained for each network weight and is adapted as learning happens.

alpha, beta1, beta2, epsilon  = 0.01, 0.9, 0.999, 1e-8
m_t, v_t, t = 0, 0, 0
while (1):
t+=1
param_gradient = evaluate_gradient(loss_function, data, params)
m_t = beta1*m_t + (1-beta1)*param_gradient
v_t = beta2*v_t + (1-beta2)*(param_gradient**2)
m_cap = m_t/(1-(beta1**t))
v_cap = v_t/(1-(beta2**t))
params = params - (alpha*m_cap)/(math.sqrt(v_cap)+epsilon)

Computationally less heavy
Memory efficient
Hyper-parameters require little tuning.

--

--

Amit Singh Rathore
Nerd For Tech

Staff Data Engineer @ Visa — Writes about Cloud | Big Data | ML