Optimizers in Neural networks
Optimizers are methods/algorithms used to modify the attributes like weights & learning rate in order to minimize the loss.
Gradient Based
Batch Gradient Descent — Regression & classification
It computes the gradient of the loss function w.r.t. to the parameters for the entire training dataset.
for i in range(epochs):
param_gradient = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * param_gradient
Slow
Computation heavy
Memory intensive
Stochastic Gradient Descent
In SGD the parameter update happens for each training example and label.
for i in range(epochs):
np.random.shuffle(data)
for sample in data:
params_gradient = evaluate_gradient(loss_function, sample, params)
params = params - learning_rate * params_gradient
Fast
High variance
Memory efficient
Mini batch SGD
Mini-batch gradient descent combines the best of both Batch & SGD. In this, parameter updates happen for every mini-batch(32, 64… usually in the powers of 2) of n training samples.
for i in range(epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=64):
params_gradient = evaluate_gradient(loss_function, batch, params)
params = params - learning_rate * params_gradient
Fast
Lesser variance
Memory efficient
In all three above optimizers, we had a constant learning rate. These learning rates also affect performance. The below graph summarizes how learning rates and loss relation on different values.
So there are optimizers that adapt the learning rate on the fly. Let's discuss those.
Momentum based
Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations.
Adagrad(Adaptive Gradient Algorithm)
The learning rate is inversely proportional to the sum of the squares of all the previous gradients of the parameter. Hence a larger value of past gradients will result in a lower learning rate. While if the past gradient sum is small learning rate will be high.
RMSprop(Root Mean Square Propagation)
RMSprop divides the learning rate by an exponentially decaying average of squared gradients. [γ = 0.9, learning rate = 0.001]
eps , gamma = 1e-8, 0.999for epoch in range(epochs):
for batch in get_batches(data, batch_size=64):
params_gradient = evaluate_gradient(loss_function, batch, params)
expected_grad = gamma * expected_grad + (1 - gamma) * np.square(params_gradient)
RMS_grad = np.sqrt(expected_grad + eps)
params = params -(eta/RMS_grad) * grad
Adam(Adaptive moment estimation)
A learning rate is maintained for each network weight and is adapted as learning happens.
alpha, beta1, beta2, epsilon = 0.01, 0.9, 0.999, 1e-8
m_t, v_t, t = 0, 0, 0while (1):
t+=1
param_gradient = evaluate_gradient(loss_function, data, params)
m_t = beta1*m_t + (1-beta1)*param_gradient
v_t = beta2*v_t + (1-beta2)*(param_gradient**2)
m_cap = m_t/(1-(beta1**t))
v_cap = v_t/(1-(beta2**t))
params = params - (alpha*m_cap)/(math.sqrt(v_cap)+epsilon)
Computationally less heavy
Memory efficient
Hyper-parameters require little tuning.