[DL] 2. Feed Forward Network and Gradient Descent

Awaits
Learning
Published in
4 min readMar 4, 2020

1. Neural Network

The neural network is composed of three parts, (1) input layer 𝒙, (2) hidden layers π’š, and (3) output layers 𝐳. Nodes in figure 1 represent the input, hidden, output variables whereas the edges connecting two nodes from neighboring layers denote weights π’˜.

Since the flow of data(information) is from input to output, we call this the feed-forward network

figure 1. from Bishop

The mathematical representation for nodes in layers is as below.

figure 2. Definition of activation a and node z

activation function β„Ž should be non-linear and differentiable for the backpropagating gradients(error) that we will discuss about in the later chapter. The typical examples for the activation function include the tanh and Relu functions.

why the activation function β€˜h’ is supposed to be non-linear?

figure 3. rearrange the equation of node y stacking the previous layers

Suppose there are N layers in the neural network then the output node π’š is the result of input 𝒙 following the preceding N layers of π’š. But if our activation functions are all linear, then by the linearity, the weighted input of linear function results in the weighted function of the same input i.e. f(aβ‹…x) = aβ‹…f(x). If we do this N times as is shown in figure 3, we eventually end up with the representation of one linear layer. In other words, if we use linear function as the activation function then the meaning of stacking layers vanishes. Therefore, we use the non-linear function as an activation function in the neural network.

2. What makes our learning fail?

There are various cause for this happening but typical reasons are

  • Not enough number of hidden units(low model capacity)
  • The optimization algorithm cannot find appropriate weight values
  • Overfitting
  • Underfitting
  • A lack of deterministic relationships between inputs and outputs

3. How to train Feed-Forward Networks

Error Functions

Suppose our learning algorithm for the given task is based on supervised learning then we have a training set consisting of pairs of input {x(𝒏)} and the corresponding label vector t(𝒏).

Let the error function E(w) of weight w for the prediction of the network by comparing it with the target be as figure 4.

figure 4. Error function E(w)

The E(w) grows as our model makes more wrong prediction π’š. Therefore, our aim is to minimize the E(w) which corresponds to reducing making wrong predictions. In other words, the optimal values of weight w are what minimizes the value of error function.

How to obtain the optimal weight w?

figure 5. from bishop, error function E(w) of weight space

Given that the error function E(w) is smooth in weight space(This is why our activation function needs to be differentiable), then the smallest value of E(w) occurs when its derivative with respect to w is equal to zero.

In the above figure, there are two critical points: A and B where the derivative of E(w) is zero with respect to w. As B is the point where E has the smallest value, we call it the global minimum whereas A is a local minimum.

Gradient Descent Optimization

figure 6. Weight update in the Gradient Descent Optimization

we will deal with the reason why we update w in the direction of negative gradient in next chapter.

E(w) can be evaluated by using the entire training set meaning that it is the sum of each error evaluated from the first to the end of dataset.

Stochastic Gradient Descent(SGD)

In SGD, we only use one or subset of training dataset for updating parameters per iteration. In the case of using subset of training dataset, it is called the minibatch SGD. Since we randomly select subset of training samples out of the whole training set, this is Stochastic. On the other hand, gradient descent optimization uses ALL training samples to make one update.

In practice, the SGD is commonly used than gradient descent optimization. The reason why is that SGD converges much faster to the optimal value compared to the gradient descent. The gradients from SGD tend to oscillate more and they do not have such fine shape as the one from the gradient descent optimization, but we can still reach the optimal value(close approximation) of parameters with those gradients from SGD.

figure 7. gradients from GD(Left) and from SGD(Right), images from here

4. Reference

[1] Bishop. Pattern Recognition and Machine Learning. Springer, 2006

[2] GoodFellow

[3]http://www.holehouse.org/mlclass/17_Large_Scale_Machine_Learning.html

Any corrections, suggestions, and comments are welcome

Contents of this article are reproduced based on Bishop and Goodfellow

--

--