Feed forward and back propagation back-to-back — Part 3 (Neural Network’s forward pass)

Published in

Analytics Vidhya

7 min readAug 1, 2020

Preface

In part 1 of this series (Linear Equation as a Neural Network building block) we saw what linear equations are and also had a glimpse of their importance in building neural nets.

In part 2 (Linear equation in multidimensional space) we saw how to work with linear equations in vector space, facilitating us to work with many variables.

Now I will show you how one linear equation can be embedded into another one (mathematically this is known as Function Composition) to structure a neural network. I will then proceed with how linear combination of weight matrices and feature vectors can help us with all the maths involved in the feed forward pass, ending this story with a working example.

This series has a mantra (note of comfort) that I will repeat below in case the eventual reader hasn’t read part 1.

Mantra: A note of comfort to the eventual reader

I won’t let concepts like gradient and gradient descent, calculus and multivariate calculus, derivatives, chain rule, linear algebra, linear combination and linear equation become boulders blocking your path to understanding the math required to master neural networks. By the end of this series, hopefully, these concepts will be perceived by the reader as the powerful tools they are and how they are simply applied to building neural networks.

Function composition

If linear equations are neural networks’s building blocks, function composition is what binds them. Great! But what is function composition?

Lets consider the 2 linear equations below:

Equation 1: Function composition

What is different here? The difference is that in order to calculate the value of f(x), for any given x, we first need to compute the value of g(x). This simple concept is known as function composition.

The above definition and notation, although correct, are not the ones commonly used. Function composition is normally defined as an operation where the result of function g(x) is applied to function f(x), yielding in a function h(x). Thus: h(x) = f(g(x)). Another notation is: (f ∘g)(x)=f(g(x)). And the above equations are written as follows:

Equation 2: **h(x)** written as a composition of **f(x)** and **g(x)**

Schematically the above function composition can drawn as illustrated in figure 1.

Figure 1: Function composition described as a network

In the above picture x is the input, or the independent variable. This input is multiplied by the angular coefficient a₂ which added to b₂ yields g(x). In turn, g(x) is multiplied by a₁, then added to b₁ resulting in f(x).

I find this really cool! Aren’t we getting closer to a neural network?

Believe me. If you understood what these two concepts (linear equation and function composition) are, you understood mathematically 80% of what a feed forward neural network is. What remains is to understand how to add additional independent variables (x₁, x₂, …, xₙ) to enable our neural network to work with many (probably the majority) of real world problems one deals with.

We saw in part 2 of this series how easy it is to work in vector space when dealing with linear equation in n-dimensions. To do so we need tow additional simple vector operations: vector and matrix transpose; and the dot product of vector and matrix.

Linear Algebra: Vector and Matrix transpose

Vector transpose is the transformation of a column vector into a row vector and vice-versa, i.e., a row vector into a column vector. Formally:

Similarly matrix transpose consists of turning rows into columns and vice-versa as exemplified below.

Linear Algebra: Dot product

The dot product of vector(u)=(u₁, u₂,…, uₙ) and vector(v)=(v₁, v₂,…, vₙ) is given by:

And the dot product between matrix W and vector(u) is:

Equation 6: Dot product between matrix and vector

From the above formula we can see that both vectors need to have the same dimension n. The dot product of two vectors is a real number (a scalar). In the case of matrix and vector the number of rows in the matrix has to be equal to the number of columns in the vector. The latter dot product yields in a column vector.

A 2-dimension function composition

Figure 2 is the schema contained in figure 1 with the addition of x₂.

Mathematically:

Now lets consider the vector of the angular coefficients of function g(x₁, x₂) as vector(aᵍ) = (a₂, a₃) and the vector of the inputs as vector(x) = (x₁, x₂) we can rewrite function g(x₁, x₂) as:

Considering vector(aᶠ)=(a₁, 0) we can rewrite f(x₁, x₂) as:

In the above two equations the multiplication of the vectors is in fact the dot product of these vectors yielding in a scalar as explained above.

Neural Network Feed Forward Pass

There is little to add to reach a neural network feed forward pass. It simply consists of submitting the input vector to (f ∘g)(x₁, x₂)=f(g(x₁, x₂)), computing the result as exemplified in figure 5 above.

In real neural networks there is still on additional function composition, which is to submit the output of both functions g(x) and f(x) to a squashing function. One of the most common is the Sigmoid Function ( σ(x) ) described below which converts all outputs to be within the range of (0,1):

Equation 7: Sigmoid function

The need for this squashing function, in neural network terms is known as the activation function, will be clarified in another post in this series. Right now suffice to say that this composition enables the network to learn non-linear patterns.

An example

In this example we will work with a 2-layer network. The first is the HIDDEN layer and the second is the OUTPUT layer. We do not consider the INPUT layer in the number of layers that a network has. The network is illustrated below.

Here we changed the nomenclature from the one we were using to one closer to what we normally find in neural network documentation. We were using the letter a as the angular coefficient. Now we will use the letter w (weight).

The above network receives two inputs: x₁ and x₂. The HIDDEN layer has two nodes (H1 and H2)while the OUTPUT layer has one node (O1). All nodes have a fixed input named bias, represented if figure 6 with b₁, b₂ and b₃.

Between the INPUT and the HIDDEN layers there are 4 weights. The superscript 1 in w¹ is a reference to these weights as the w² is a reference to the weights between the HIDDEN and the OUTPUT layers. The subscript w¹₁₁ is a reference to the weight (or angular coefficient) between x₁ and H1. The weight w¹₂₁ is a reference to the angular coefficient between x₂ and H1. Similarly we have w¹₁₂ as the weight between x₁ and H2 and w¹₂₂ the weight between x₂ and H2.

As we are working in vector space we can represent all mentioned weights is a matrix W¹ as follows:

With this g(x) becomes the equation below where the multiplication of W¹ and vector(x) is in fact the dot product between them and vector(b¹) = (b₂, b₃):

The result of node H1 is thus:

Considering vector(b²) = (b₁, 0), similarly, we have that W², the weight matrix between the HIDDEN layer and the OUTPUT layer, is:

Yielding in f(x) equals to:

With the output of the network being:

Conclusion

We saw that the feed forward process of a neural network consist basically of function composition of linear equations in vector space. The network shown in figure 3 of the example is also known as a multi-layer perceptron (MLP) with interesting results in image and text classification.

We learn’t in part 1 of this series how to find the angular coefficient of a linear equation in 2-dimensional space. In a neural network with many layers and multiple dimensions the process to compute all the weights (angular coefficient) and biases is known as back propagation, which is a subject of this series.

Watch this space for updates!