Neural Network: A Complete Beginners Guide from Scratch

A detailed explanation of Mathematics and concepts behind Neural Network

Published in

Gadictos

13 min readAug 21, 2019

Neural Network has become a crucial part of modern technology. It has influenced our daily life in a way that we have never imagined. From e-commerce and solving classification problems to autonomous driving, it has touched everything. In this article, we are going to discuss all the important aspects of neural networks in the simplest way possible and at the end of the tutorial, we have also provided code implemented in python for all the described parts.

Motivation

Animal brains even the small ones like the brain of a pigeon were more capable than digital computers with huge processing power and storage space. This puzzled scientists for many years and turned the attention to architectural differences. Traditional computers process data very much sequentially and there is no fuzziness or discreteness. Animal brains, on the other hand, although apparently running at much slower rhythms, seemed to process signals in parallel, and fuzziness was a feature of their computation.

The basic unit of a biological brain is the neuron. Neurons have various forms of them, their job is to transmit electrical signals from one end to another, from the dendrites along the axons to the terminals. These signals are then passed from one neuron to another. This is how our body senses light, touch pressure, heat, and so on. Signals from specialized sensory neurons are transmitted along with our nervous system to our brain, which itself is mostly made of neurons too. Now, the question is why are biological brains so capable even though they are much slower and consist of relatively few computing elements when compared to modern computers?

Let’s look at how a biological neuron works. It takes an electrical input and pops out another electrical signal. But can we represent neurons as linear functions? The answer is no! A biological neuron doesn’t produce an output that is a linear function of the form

So, neurons don’t exactly react readily but instead, suppress the input until it has grown so large that it triggers an output. Here comes the idea of the activation functions.

Activation Function

A function that takes the input signal and generates an output signal, but takes into account some kind of threshold is called an activation function. There are many such activation functions.

Here, we can see for the step function the output is zero for low input values. But once it reaches the threshold, output jumps up. We can improve the step function in many ways. The S-shaped function shown above is called the sigmoid or logistic function is another very popular activation function whose equation is

Another very important activation function that is used vastly is called ReLU or Rectified Linear Unit activation function which equation is

Here, a brief table of different types of activation functions can be found.

Neurons

The basic computational unit of a neural network is also called a neuron. It receives input from some other nodes, or from an external source and generates an output.

Each input has an associated weight (w), which is assigned on the basis of its relative importance to other inputs. The node applies an activation function e.g. sigmoid to the weighted sum of its inputs. If the combined signal is not large enough then the effect of the sigmoid threshold function is to suppress the output signal and fire otherwise.

Neural Network

In a biological neural network, electrical signals are collected by dendrites and these combine to form a stronger signal. If the signal is strong and passes the threshold the neuron fires a signal down the axon towards the terminals to pass onto the next neuron’s dendrites.

The important thing to notice is that each neuron takes input from many before it and also provides signals to many more. One way to replicate this from nature to an artificial model is to have layers of neurons, with each connected to every other one in the preceding and subsequent layer. The following diagram illustrates this idea:

we can see a neural network with three layers, each with several artificial neurons or nodes. Also, each node is connected to every other node in the preceding and next layers. This is how we actually take the idea from the biological brain and apply it to build a neural architecture for computers. But how this architecture actually learns?

The most obvious thing is to adjust the strength of the connections between nodes. Within a node, we can adjust the summation of the inputs, or we can adjust the shape of the sigmoid threshold function, but that’s more complicated than simply adjusting the strength of the connections between the nodes. The diagram on the top right shows the connected nodes, but this time weight is shown associated with each connection. A low weight will de-emphasize a signal, and a high weight will amplify it.

Next, we will see the idea of calculating signals in a neural network from the inputs through the different layers to become the output. The idea is called the forward propagation part of a neural network.

Neural Network: Forward Propagation

Suppose, we have a Boolean function represented by F(x,y,z) = xy + z̄. The values of this function are given below which we will use to demonstrate the calculations of the neural network.

Let’s use the fourth row (1, 0, 1) => 0 to demonstrate the forward propagation.

Here, there is a neural network with three layers. The first layer is called the input layer and the last layer is called the output layer. The layers in the middle are called the hidden layer. We have used one hidden layer for simplicity. Input and hidden layers contain three nodes and the output layer contains a single node. We now assign weights to the synapses between the input and hidden layers. The weights are taken randomly between 0 and 1 since it is the first time we’re forward propagating. It is called random initialization of weights which is very important.

Now for a single neuron or node, we take all the connected inputs, multiply them with the associated weights and sum it. Then the node applies an activation function e.g. sigmoid to the weighted sum of its inputs to introduce non-linearity.

We repeat this process for every node on every layer. Let’s focus on the first node of the hidden layer. All the nodes in the input layer are connected to it. Those input nodes have values of 1, 0, and 1 with the associated weights of 0.9, 0.8, and 0.1 respectively. We sum the product of the inputs with their corresponding set of weights to arrive at the first value for the hidden layer and do the same for the other nodes of the hidden layer.

We put these sums smaller in the circle because they’re not the final value. We can now, finally calculate the node's final output value using the activation function

Applying σ(z) to the three hidden layer weighted sums, we get:

We add this to our neural network as a hidden layer result.

Then, we calculate the weighted sum of the hidden layer results with the second set of weights (also determined at random) to determine the output sum.

Finally, we apply the sigmoid activation function to get the final output result:

Because we used a random set of initial weights, the value of the output neuron is off the mark; in this case by +0.77 (since the target is 0).

We can see that we are not even close to our target value. That’s because we have initialized weights randomly and we have to calibrate them. The process we will use to calibrate the weights is called backpropagation which we will cover next. But, before diving into backpropagation we need to give some ideas about computing the forward propagation process using matrix computation.

Matrix vs Neural Network

A matrix is nothing but a table or a grid of numbers. For example,

Here, matrix values are the weights of the neural network and we can represent the inputs of the network with another matrix.

When we multiply these two matrices we get

This is the result that we have found as the weighted sum of the input and the hidden layer. So, we can calculate the hidden layer output:

where W is the weight matrix and x is the input matrix. Which will be much easier and faster to calculate, since it doesn’t require calculating every node individually. This technique is called vectorization.

So, the general equations will be

Where zᵢ[ˡ] is the weighted sum of a single node and l denotes the number of layers and i denotes the node number in a layer.

Now, these equations are for a single training example! But in general, we will have many such examples like the values of the Boolean functions shown above. Let

where x⁽¹⁾, x⁽²⁾, …, x⁽ᵐ⁾ are different training examples when we have m training examples. Then

where a[ˡ]⁽ᵐ⁾ denotes the output of different layers of different examples, l denotes the layer number and m denotes the different examples.

Neural Network: Back Propagation

To improve our model, we first have to quantify how wrong our model predictions are compared to the target values of the model. Then, we adjust the weights accordingly so that the margin of error is decreased. Similar to forward propagation, backpropagation calculations occur at each layer but as the name indicates, backwardly. We begin by changing the weights between the hidden layer and the output layer.

Cost Function

For quantifying how wrong our model is, first we have to calculate the error between predicted values and the output values of the model. To do this we have to use a cost function. A cost function can be the sum of the difference between the output value and the predicted value like

Let's assume, we have target values 2, 3, 5, and 9 and output values 1, 5, 3, and 6 respectively.

Then the total error becomes

There is a problem with this error function. We can see that the second and third value cancels each other and we are not getting the actual error. To make our model more accurate we have to use a different cost function. What about the sum of the absolute value of the error?

Then the total error becomes

and it doesn’t cancel anything. The reason this isn’t popular is that the slope isn’t continuous near the minimum and this makes gradient descent not work so well, because we can bounce around the V-shaped valley that this error function has. The slope doesn’t get smaller or closer to the minimum, so our steps don’t get smaller, which means they risk overshooting. A better option is to use the sum of the squares of the errors. So, we will calculate the error for each output neuron using the squared error function and sum them to get the total error:

For example, the target output for our network is 0 but the neural network output is 0.77, therefore its error is:

Cross-Entropy is another very popular cost function whose equation is:

The Backward Pass

With backpropagation, our goal is to update each of the weights in the network so that the cause's actual output is closer to the target output, thereby minimizing the error for each output neuron and the network as a whole. Like forward propagation, we will derive equations for a single neuron to update the weights and then expand the concept for the rest of the network.

Consider, a network with inputs x₁, x₂, and x₃ and the associated weights w₁, w₂, and w₃ respectively. we know from the forward propagation part that

which we call the output of the network. We have a target value for the network and for an untrained network when weights are not calibrated, there will be a big error. We denote this error as Eₜₒₜₐₗ where

Our job is to find out how we will adjust the weights to decrease this error. Consider w₁ and we want to know how much a change in w₁ affects the total error. If we look closely, we can see that the error is affected by the output, and output is affected by z while z is affected by the weight w₁. By applying the chain rule, we know that,

If we break down each piece of the equation,

Next, we have to find out how much the output changes with respect to its total net input where

Finally, how much the total net input z changes with respect to w₁ needs to be determined

Putting all the pieces together

To adjust the weights we then use the formula

This rule for updating weights is called Gradient Descent.

Now, let’s do a workout example for updating the weight w₁₀ in the given figure. Here,

Similarly, we can update other weights and this is generally a long process.

Prediction

We have done all the hard work so far so that we can predict new data using our neural network. The dataset we work on is generally split into two parts. One part is called training data where we do all the training and another is called the test data where we test our network. We have developed equations for training and using them we have got a calibrated set of weights. We will then use this set of weights to predict the result for our new data using the equation