The Less The Loss, The Better… But How?

You can find this articles notebook on my Github.

Güldeniz Bektaş
Analytics Vidhya
Published in
9 min readJan 23, 2021

--

Deep learning has made great success in recent years. As the data got bigger and the models got deeper, so did the optimization algorithms we had to use to reduce the loss of the models.

Whether it’s machine learning or deep learning, we always want to minimize the loss of our model. The small difference between the actual values and predicted values, increases the reliability of our future predictions.

The goal of the optimization is to reduce the training errors, and by doing that we should be careful about overfitting.

Loss indicates how the model is doing in terms of accuracy. By changing the loss, we can change the performance of the model we trained. To do this, we improve the performance of the new model by minimizing the current loss. Like the name of this article, the less the loss, the better.

The process of minimizing (or maximizing) any mathematical expression is called optimization.

Optimizers are algorithms used to change attributes such as weight, learning rate to reduce loss.

The learning rate determines how big the steps will be in each iteration. If the steps are big we may miss the minima. If it is small, it takes time to find the minima. You can read this article for more detailed information about this topic.

I will mention six optimization algorithm in this article.

  1. Stochastic Gradient Descent
  2. Mini Batch - Stochastic Gradient Descent
  3. Momentum
  4. AdaGrad
  5. AdaDelta
  6. RMSprop
  7. Adam
  8. Gradient Descent

For the 8th part, I already wrote about it. I will leave the link for you to read it.

Let’s start then!

1. Stochastic Gradient Descent

Unlike gradient descent, it uses the randomly selected data from the original dataset instead of using the whole dataset in each iteration. It is better to use the entire dataset to reach the minima less randomly, but nowadays this process is very costly as data is starting to grow.

SGD algorithm as follows:

θ = θ − α⋅∂(J(θ;x(i),y(i)))/∂θ

Since it progresses by taking random samples from the data set in each iteration, the path to the minima will be more noisy. But it does not matter as long as we reach the minima.

Even though it requires a higher number of iterations to reach the minima than typical Gradient Descent, it is still computationally much less expensive than typical Gradient Descent.

Advantages of SGD

  1. Memory requirement less than Gradient Descent.
  2. Decreases overfitting by focusing on only a portion of the training set each step.

Disadvantages of SGD

  1. May stuck at local minima.
  2. May take too long to complete one epoch.

2. Mini Batch Stochastic Gradient Descent

We know about Gradient Descent, and Stochastic Gradient Descent. Let’s go over it again. Then I will mention about Batch Gradient Descent which I want you to know.

Gradient Descent basically performs an improvement process by examining the errors on the dataset whose training has been completed.

Stochastic Gradient Descent, instead of using the entire data set in each process step, it performs the optimization process with the random data it receives from the data.

Batch Gradient Descent is a variation of the gradient descent algorithm that calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated.[1]

Mini-Batch Gradient Descent is a variation of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients.[2]

MB-SGD algorithm is an extension of the SGD algorithm and it overcomes the problem of large time complexity in the case of the SGD algorithm. MB-SGD algorithm takes a batch of points or subset of points from the dataset to compute derivate.

It is observed that the derivative of the loss function for MB-SGD is almost the same as a derivate of the loss function for GD after some number of iterations. But the number of iterations to achieve minima is large for MB-SGD compared to GD and the cost of computation is also large.[3]

Source

Advantages of MB-SGD

1. It takes less time to reach the minima compared to the SGD algorithm.

2. More efficient computationally than SGD.

Disadvantages of MB-SGD

  1. May get stuck at local minima.
  2. Mini-batch requires the configuration of an additional “mini-batch size” hyperparameter for the learning algorithm.[4]

3. Momentum

Since the MB-SGD will update the parameters while making each iteration, the path it will take will be oscillating. Momentum also takes into account previous gradients to update parameters.

A major disadvantage of the MB-SGD algorithm is that updates of weight are very noisy. SGD with momentum overcomes this disadvantage by denoising the gradients. Updates of weight are dependent on noisy derivative and if we somehow denoise the derivatives then converging time will decrease. The idea is to denoise derivative using exponential weighting average that is to give more weightage to recent updates compared to the previous update. It accelerates the convergence towards the relevant direction and reduces the fluctuation to the irrelevant direction. One more hyperparameter is used in this method known as momentum symbolized by ‘γ’.[5]

V(t) = γ.V(t−1) + α.∂(J(θ))/∂θ
Source

The momentum term γ is usually set to 0.9 or a similar value. Momentum at time ‘t’ is computed using all previous updates giving more weightage to recent updates compared to the previous update. This leads to speed up the convergence.

You can think of it simply as a ball falling from the hill. Its speed will change according to the direction of the slope of the hill.

4. AdaGrad

While other optimization algorithms keep the learning rate constant, AdaGrad assigns the adaptive learning rate according to the weights. It performs smaller updates for parameters associated with frequently occurring features, while larger updates for parameters associated with infrequent features. So that, this algorithm is well-suited for sparse data. AdaGrad used for training large-scale neural nets at Google. Also, AdaGrad used to train GloVe word embeddings, as infrequent words require much larger updates than frequent ones.[6]

I will not go into the mathematics of this too much. If you want to learn more about that topic, you can read more at Source 6.

Advantages of AdaGrad

1. No need to update the learning rate since it changes manually.

Disadvantages of AdaGrad

  1. As the number of iteration becomes very large learning rate decreases to a very small number which leads to slow convergence.

5. AdaDelta

With AdaGrad, it takes too long to converge to the minima as the learning rate becomes very small as the number of iterations increases. AdaDelta adapts learning rates based on a moving window of gradient updates, instead of accumulating all past gradients. This way, AdaDelta continues learning even when many updates have been done. Compared to Adagrad, in the original version of AdaDelta, you don’t have to set an initial learning rate.

With Adadelta, we do not even need to set a default learning rate, as it has been eliminated from the update rule.

Source

6. RMSprop

RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera Class. RMSprop divides the learning rate by an exponentially decaying average of squared gradients. Hinton suggests γ be set to 0.9, while a good default value for the learning rate η is 0.001. RMSprop developed from the need to resolve Adagrad’s radically decreasing learning rates.

Source

7. ADAM

Adam can be looked at as a combination of RMSprop and Stochastic Gradient Descent with momentum. Adam computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients vt like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients mt, similar to momentum. Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface.[7]

Implementation Time!

I used Google Colab for this work. You can use this link to use Colab GPU free of charge!

Now, it’s time for implementation. We soon to see, and understand the differences between optimization algorithms.

We will use MNIST dataset for this part.

This work were taken from here.

# importing librariesimport keras
from keras.datasets import mnist
from keras.models import load_model
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras. layers import Conv2D, MaxPooling2D
from keras import backend as K
from keras import optimizers
from keras.callbacks import ReduceLROnPlateau
import tensorflow as tf
from keras.layers import *
import matplotlib.pyplot as plt
# download dataset(x_train, y_train), (x_test, y_test) = mnist.load_data()reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=1e-5)# setting necessary parametersbatch_size = 128
num_classes = 10
epochs = 20
w_l2 = 1e-5

Train, and test shapes;

img_rows, img_cols = 28, 28if K.image_data_format() == 'channels_first' :
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print(f"x_train shape : {x_train.shape}")
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
# converting class vectors to binary class matricesy_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

The output:

# building conventional neural networksfrom keras import regularizers
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), kernel_regularizer=regularizers.l2(w_l2),
input_shape=input_shape))model.add(BatchNormalization())model.add(Activation('relu'))model.add(Conv2D(64, (3, 3), kernel_regularizer=regularizers.l2(w_l2)))model.add(BatchNormalization())model.add(Activation('relu'))model.add(MaxPooling2D(pool_size=(2, 2)))model.add(Dropout(0.25))model.add(Flatten())model.add(Dense(128, kernel_regularizer=regularizers.l2(w_l2)))model.add(BatchNormalization())model.add(Activation('relu'))model.add(Dropout(0.5))model.add(Dense(num_classes, activation='softmax'))

Stochastic Gradient Descent:

model.compile(loss = keras.losses.categorical_crossentropy, optimizer = keras.optimizers.SGD(), metrics = ['accuracy'])model.summary()
# training the modelhist_SGD = model.fit(x_train, y_train, batch_size = batch_size, epochs = epochs, verbose = 1,validation_data = (x_test, y_test), callbacks = [reduce_lr])score = model.evaluate(x_test, y_test, verbose = 0)print('Test loss: ', score[0])print('Test accuracy: ', score[1])

After training, the values of mine are:

Test loss: 0.037460893392562866

Test accuracy: 0.9887999892234802

ADAM

model.compile(loss = keras.losses.categorical_crossentropy, optimizer = keras.optimizers.Adam(), metrics = ['accuracy'])model.summary()
# training the modelhist_adam = model.fit(x_train, y_train, batch_size = batch_size, epochs = epochs, verbose = 1, validation_data = (x_test, y_test), callbacks = [reduce_lr])
score = model.evaluate(x_test, y_test, verbose = 0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

After training the model:

Test loss: 0.039476025849580765

Test accuracy: 0.9930999875068665

RMSprop

model.compile(loss = keras.losses.categorical_crossentropy, optimizer = keras.optimizers.RMSprop(), metrics = ['accuracy'])model.summary()
hist_RMSprob=model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(x_test, y_test), callbacks=[reduce_lr])score = model.evaluate(x_test, y_test, verbose=0)print('Test loss:', score[0])print('Test accuracy:', score[1])

After training:

Test loss: 0.048933401703834534

Test accuracy: 0.9932000041007996

AdaGrad

model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adagrad(), metrics=['accuracy'])model.summary()
hist_adagrad=model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(x_test, y_test), callbacks=[reduce_lr])score = model.evaluate(x_test, y_test, verbose=0)print('Test loss:', score[0])
print('Test accuracy:', score[1])

After training:

Test loss: 0.048635631799697876

Test accuracy: 0.9932000041007996

AdaDelta

model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adadelta(),metrics=['accuracy'])model.summary()
hist_adadelta = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(x_test, y_test), callbacks=[reduce_lr])score = model.evaluate(x_test, y_test, verbose=0)print('Test loss:', score[0])
print('Test accuracy:', score[1])

After training:

Test loss: 0.04866880923509598

Test accuracy: 0.9932000041007996

Plotting The Results

I got some issues on the code below. For Tensorflow 2+ ‘acc’ became ‘accuracy’. That’s why original code didn’t work for me. I changed ‘acc’ to ‘accuracy’.

hists = [hist_adam, hist_SGD, hist_RMSprob, hist_adadelta, hist_adagrad]plot_history(hists, attribute='accuracy', axis=(-1,21,0.965,1.0), loc='lower right')
plot_history(hists, attribute='loss', axis=(-1,21,0.009, 0.09), loc='lower right')

As you can see in the plots above, ADAM optimization is better than other optimizations with a little difference.

You can find this articles notebook on my Github.

REFERENCES

Source 1

Source 2

Source 3

Source 4

Source 5

Source 6

Source 7

--

--