Bharti Kindra
5 min readSep 25, 2022
  • Basics of Deep Learning 1/5
https://www.databricks.com/glossary/neural-network

This series of articles is based on the first course I took in Deep Learning, at Kaggle, which provided me basic insights about constructing a Sequential Neural Network for regression and classification. It was also an introduction to Keras and Tensorflow for me. Basic concepts as well as their implementation in Keras, an API for deep learning have been explained from stratch.

The topics covered in this series of five articles are:

  1. What is sequential layer
  2. Stochastic Gradient Descent
  3. Optimization
  4. Overfitting and Underfitting
  5. Dropout
  6. Batch Normalization

In this particular article, Stochastic Gradient Descent has been discussed in detail.

Sequential Neural Network

The basic unit of a Neural network is a perceptron, which accepts input from user or from other perceptrons, apply a defined function, and returns an output. The most basic network is Sequential Neural Network.

It is a plain stack of layers where each layer has exactly one input tensor and one output tensor.

Figure 1 shows a sequential network where there is one input layer and 1 output. Since every neuron is a layer is connected to every neuron in adjacent layer, this is also a dense neural network. Mathematically, this network is : y= b+ w1* x1+ w2* x2.

Fig1 : Sequential Network

In Keras, this will be implemented as:

Sequential class in Keras allows to add as many layers. Here, the model summary shows the shape and number of parameters in the layer, which are 2 weights and 1 bias. The parameters for a layer are defined in terms of a vector as:

weights= [b, w1, w2]; input= [1, x1, x2]. Then y=weight^T innput.

The ultimate aim of the network is to find the values of these parameters.

It starts with randomly assigning values to weights and bias as shown in figure 2. It then uses iterative methods to reach the actual values.

Figure 2: Initialisation of parameters

Now that we have created 1 layer, we can build a sequential model by adding more layers.

Figure 3: Sequential Model

Since is is a dense network, input_shape is only defined for first layer because next layer automatically has input_shape as number of neurons in previous layer. To understand the number of parameters,

for dense_908 layer, input=11, output=20 so it implies 20*11 weights and 20 bias =240 parameters.

Stochastic Gradient Descent

Once the layers have been defined, the model automatically initialises the parameters. Next part is to quantify the objective, this is done using loss function, which defines the how close the predicted values are to actual values in regression, and how accurately it predicts the class inn classification. The objective is then to minimise the loss.

Different problems require different loss function. Keras provide a variety of loss functions to use, check the documentation. We use the Mean Absolute Error (mae) for regression and accuracy for classification in this article. These are defined as:

MAE= ∑ᵢ ‖y-predᵢ-y-actualᵢ‖

Accuracy= # correctly predicted data in test set/ length of dataset

SGD is an iterative process which uses the loss for a given set of parameters and shift the values of parameters in the direction of maximum negative change, called gradient. This is called gradient descent (a good read).

Stochastic Gradient Descent is an improvement of general gradient descent.

SGD is like general GD but it uses a random susbset of training data in every iteration. Because of this the fluctuations in the loss value wrt iterations are more in SGD but the results are same and the overall process is faster.

The subset used in an iteration is called minibatch and one iteration is technically referred as epoch.

Hyperparameters for SGD

SGD is a class of gradient descent and there are several approaches to execute it. They can be defined through changing some hyperparameters:

learning rate

Mathematically, Gradient Descent is given as,

w→ w’ =w-η ∇ w

w is a vector defining weights and η is the learning rate. It defines the amount of change in weights in each step. It is generally a float point and it as small as 0.001.

Figure 4: Parabola represents value of loss where axis in a hyperplane.

As shown in Figure 4, arrows represent updated values of weights. The amount of change is proportional to gradient, therefore, the change decreases near the optimal point (star).

This seems an infinite process as the change will keep on decreasing near the target.

This suggests that for a better performance it is better to have large learning rate initially and smaller rate later. In Keras this can be implemented using LearningRateScehduler. It allows various methods to customize such as:

  1. ExponentialDecay, which takes as input an initial learning rate and a decay rate.
Figure 5: Learning rate fro different values of decay rates

2. PiecewiseConstantDecay, which is like a stepwise function. It takes input the boundaries and vector of learning rate which contains values starting with each boundary. For,

boundary=[100,200,300]

learning rate =[0.1, 0.05, 0.01,0.001]

Figure 6: PiecewiseConstantDecay

Momentum

Number of epochs needed can be further reduced by using momentum while updating values of weights. The updated weights with momentum are given by,

δ’ (i+1)= β δ(i) +∇ w(i)

w→ w’ (i+1) = w(i) -η δ’(i+1) = w-η ∇ w(i)- η β δ(i)

Here, in addition to the original gradient term, we have a term which is a moving average for previous updates. Thus, it provides acceleration if the change is in one direction. On the other hand, if at a step the weights are updated in the opposite direction, momentum will provide friction. This article provides a good visualization for impact of different values for momentum and learning rate.

Batch Size

Since it takes a mini-batch in every epoch, the size of which needs to be defined.

Epoch

Number of epochs i.e, number of iterations is also defined.