In depth explanation of FeedForward in Neural Network mathematically

Aung Kyaw Myint
Analytics Vidhya
Published in
6 min readNov 25, 2019

--

To explain the feed forward process, let’s look at the basic model of artificial neural network where we have only a single hidden layer. The input are each connected to the neurons in hidden layer and the neurons in hidden layers are each connected to the neurons in the output layers where each neuron represents a single output. We can look at it as a collection of mathematical functions. Each input is connected mathematically to a hidden layer of neurons through a set of weights we need to modify and each hidden layer neuron is connected to output layer in similar way.

There’s no limit to the number of inputs, number of hidden neurons in a layer, and number of outputs. Nor are there any correlations between these numbers, so we can have n inputs, m hidden neurons and k outputs. In a closer, even more simplistic look, we can see that each input is multiplied by corresponding weight and added at the next layer’s neurons with a bias as well. The bias is an external parameter of the neuron and can be modeled by adding an external fixed value input. This entire summation will usually go through an activation function to the next layer or to the output.

Our goal is to design a system in such a way it will give us the correct output y for the specific input x. Essentially what we really want is to find the optimal set of weights connecting the input to the hidden layer and the optimal set of weight of connecting the hidden layer to the output.

To do this, we need to start training phase where we will find the best set of weights for our system. This phase will include two steps: feedforward and backpropagation. In feedforward part, we will calculate the output of the system. The output will be compared to the correct output giving us the indication of an error. In the back propagation part, we will change the weights as we try to minimise the error. And start the feedfoward again until we find the best set of weights for the system.

Mathematical explanation of feedforward process

Calculating the value of the hidden states

To make the calculations easier, we will decide to have n inputs, 3 neurons in a single hidden layer and two outputs. In practice, we can have thousands of neurons in a single hidden layer. We will use W_1 as a set of weights from x to h and W_2 as a set of weights from h to y. Since we have only one hidden layer, we will have only two steps in each feedforward cycle.

Notice that both the hidden layer and the output layer are displayed as vectors, as they are both represented by more than a single neuron.

Other than the use of non-linear activation functions, all the calculations involve linear combination of inputs and weights. In other words, we will use matrix multiplication. Links for linear combination and matrix multiplication.

Step 1 (Finding h)

If we have more than one neuron in a hidden layer, h is actually a vector. Each input which is a vector is connected to each neuron in hidden layer. vector h’ of the hidden layer will be calculated by multiplying the input vector with the weight matrix W1W1 the following way:

h′¯=(x¯W1)

Using vector by matrix multiplication, we can look at this computation the following way

Equation 1

After finding h′ to make sure the value of h does not explode and increase too much in size, we need an activation function (Φ) to finalize the computation of the hidden layer’s values. This activation function can be a Hyperbolic Tangent, a Sigmoid or a ReLU function.

We can use the following two equations to express the final hidden vector h¯:

h¯=Φ(x¯W1)

or

h¯=Φ(h′)

Since Wij represents the weight component in the weight matrix, connecting neuron i from the input to neuron j in the hidden layer, we can also write these calculations in the following way: (notice that in this example we have n inputs and only 3 hidden neurons)

Equation 2

More information on the activation functions and how to use them can be found here.

Step 2 (Finding y)

The process of calculating the output vector is mathematically similar to that of calculating the vector of the hidden layer. We use, again, a vector by matrix multiplication, which can be followed by an activation function. The vector is the newly calculated hidden layer and the matrix is the one connecting the hidden layer to the output.

Essentially, each new layer in an neural network is calculated by a vector by matrix multiplication, where the vector represents the inputs to the new layer and the matrix is the one connecting these new inputs to the next layer.

In our example, the input vector is h¯ and the matrix is W2, therefore y¯=h¯W2. In some applications it can be beneficial to use a softmax function (if we want all output values to be between zero and 1, and their sum to be 1).

Softmax Function

To have a good approximation of output y, we need more than one level of hidden layers which can be even thousands. Essentially you can look at these neurons as building blocks that can be stacked.

Here, we didn’t emphasise about bias input. Bias does not change any of these calculations. Simply consider it as a constant input usually one, that is also connected to each of the neurons of the hidden layer by a weight. The only difference between bias and any other inputs is the fact that it remains the same as each of the other inputs change. And just as all the other inputs, the weights connecting it to the next layer are updated as well.

Our goal is to find the best set of weights that will give us the desired outputs for a specific input. In the training phase, we actually know the output of a given input. We calculate the output of the system in order to adjust the weights. We do that by finding the error and trying to minimise it. Each iteration of the training phase will decrease the error just a bit, until we eventually decide that the error is small enough.

The two error functions that are most commonly used are the Mean Squared Error (MSE) (usually used in regression problems) and the cross entropy (usually used in classification problems).

In the above calculations we used a variation of the MSE.

The next articles will focus on the back propagation process, or what we also call stochastic gradient decent with the use of the chain rule.

Link for in depth explanation of back propagation can be found here.

Content Credit: Udacity Deep Learning Program

--

--