Understanding Neural Networks

Published in

Analytics Vidhya

11 min readJan 19, 2020

Voice recognition, Image processing, Facial recognition are some of the examples of Artificial Intelligence applications driven by Deep Learning which is based on the work of Neural Networks.

Neural networks were first proposed in 1944 by Warren McCullough and Walter Pitts, two University of Chicago researchers

What is a Neural Network?

Neural Networks are computer program designed to mimic the operation of a human brain. Each program in the neural network also known as neuron can only perform basic calculation. But by connecting numerous neurons together the computational power of the whole network becomes stronger than each individual part. The process of making a connection across neurons within a neural network is referred to as training and using that data to become smarter — similar to how humans learn information.

In context with the structure, patterns are introduced to the network through the input layer that has an input for each component in the input data which is later passed to the hidden layer. It is in the hidden layer where all the processing actually happens through the system of connections characterized by weights and biases. The input is received in the input layer, neuron calculates a weighted sum adding bias to it and according to the result and the pre-set activation function, it decides whether it should be fired or activated. At the end of this process, the last hidden layer is linked to the output layer which has one neuron for each possible desired output.

How does the Neural Network work?

To understand how a neural network works, we need to understand different types of neuron that we can include in our network

The first type of the neuron is perceptron and even though more modern works are available it is beneficial to understand the perceptron. The second important type of neuron is sigmoid

Perceptron

Perceptron takes several binary inputs x1, x2… and produces a single binary output i.e either 0 or 1. The three main steps that perceptron follows are:

Inputs x1, x2, x3…are introduced to perceptron. It can take few or more inputs
Weights w1, w2, w3…. which are real numbers expressing the importance of the respective inputs to the outputs are introduced
The neurons output , 0 or 1 is computed by whether the weighted sum ∑wjxj is less than or greater than the threshold value.

To put in mathematical terms:

To simplify the above equation, we can make the following changes:

Replace the summation of wjxj as a dot product w.x where w and x are the vectors whose components are weights and inputs respectively
Move the threshold to other side of inequality and replace it with bias -b

Bias can be described as a measure of how easy it is for the perceptron to output 1. If the bias is negative, the perceptron will find it difficult to output a 1 and if it is positive, it is easy to output a 1.

One of the strengths of perceptron is that we can vary weights and bias to obtain a decision making model. We can assign more weight to those inputs so that if they will output a positive output. If we pay attention to the formula, we can observe that a big positive bias will make it very easy to output 1; however a very negative bias will make the task of output 1 very unlikely. However one of the disadvantages of the perceptron, is that small change in weights or bias even in the single neuron can drastically change the output from 0 to 1 or vice versa. Here is where a more modern type of neuron comes in handy: sigmoid neurons. The main difference between a sigmoid neuron and a perceptron is that the input and the output can be any continuous value between 0 and 1

Sigmoid

Sigmoid is similiar to the Perceptron but a small change in weights and bias causes only a small change in their output.

Just like perceptron, sigmoid takes the input x1, x2, x3… but instead of only 0 or 1 these values can be any continuous value between 0 and 1
Weights w1, w2, w3 and bias b are introduced to the network.
But the output is not 0 or 1 but σ(w⋅x+b), where σ is called sigmoid function and is defined as:

To put it more explicity:

The shape of the sigmoid function is a smoothed out version of step function

In fact if the σ had been a step function, sigmoid neuron would have been a perceptron since the output would either 0 or 1.

The smoothness of σ means that small changes Δwj in the weights and Δb in the bias will produce a small change Δoutput in the output from the neuron. In fact, calculus tells us that Δoutput is well approximated by:

How does the Neural Network learn?

The main strength of the machine learning is their ability to learn and improve every time in predicting an output. In terms of neural network, it means an algorithm which helps us find weights and biases so that the output from the network approximates y(x) for all the training input x. For this purpose, we define a cost function:

where w is collection of all weights in the network, b is all the bias, n is the total number of training inputs, a is the vector of outputs from the network when x is the input and ∑ is the sum over all training inputs and C is the quadratic cost function also known as mean squared error or MSE.

The aim of the algorithm is to find the weights and bias which make the cost as small as possible and for this purpose we will be using Gradient Descent.

Let us suppose we’re trying to minimize some function, C(v). This could be any real-valued function of many variables, v=v1, v2… To minimize C(v) it helps to imagine C as a function of just two variables, which we’ll call v1 and v2

One way of identifying this is using calculus to try to find the global minimum. We could compute derivatives and try using them to find places where C reaches its minimum or maximum value. This might work when C is a function of just one or a few variables. But it’ll turn into a nightmare when we have many more variables and for neural networks we’ll often want far more variables. Using calculus to minimize that just won’t work.

We could instead compute the derivatives to calculate where the minimum is located or we can start in a random point and try to make a small move , mathematically speaking, move Δv1 in the direction v1and Δv2 in the direction of v2 and calculate the change in our function ΔC. We could express the change in the function as:

We need to choose Δv1 and Δv2 such as to make ΔC negative. To do so, it helps to define Δv to be the vector of changes in v, Δv≡(Δv1, Δv2). We’ll also define the gradient of C to be the vector of partial derivatives

With these definitions, we can write ΔC in terms of Δv and the gradient, ∇C as below:

The above equation shows how we can choose Δv so as to make ΔC negative.

In particular, suppose we choose Δv=−η∇C

where η is a small, positive parameter (known as the learning rate). Then the above equation tells us that

Because ∥∇C∥2≥0, this guarantees that ΔC≤0, i.e., C will always decrease, never increase.

Summing up, the way the gradient descent algorithm works is to repeatedly compute the gradient ∇C and then to move in the opposite direction.

The amount that we choose to move in any direction is called learning rate and that defines how fast we reach the global minimum. To make gradient descent work correctly, we need to choose the learning rate to be small enough that the above equation is a good approximation. If we don’t, we might end up with ΔC>0, which obviously would not be good! At the same time, we don’t want the learning rate to be too small, since that will make the changes Δv tiny, and thus the gradient descent algorithm will work very slowly. In practical implementations, learning rate is often varied so that above equation remains a good approximation, but the algorithm isn’t too slow.

There are a number of challenges in applying the gradient descent rule. One of the main issue is the amount of time needed to train large number of training samples. An algorithm called stochastic gradient descent can be used to speed up learning. The idea is to estimate the gradient ∇C by computing the gradient for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient ∇C and this helps speed up gradient descent, and thus learning.

Stochastic Gradient Descent works by randomly picking out a small number m of randomly chosen training inputs, which we refer to as mini-batch. We compute the gradient for this mini-batch. Then we pick out another randomly chosen mini-batch and train with those. The process is repeated until we’ve exhausted the training inputs which is known as epoch of training. At this point we start over with a new training epoch.

Backpropagation in Neural Networks

As seen above, neural networks can learn the weights and bias using gradient descent algorithm. But to compute the gradient of cost function, we need another type of algorithm called backpropagation. The goal of backpropagation is to compute the partial derivatives ∂C/∂w and ∂C/∂b of the cost function C with respect to any weight w or bias b in the network.

Assumptions about cost function for backpropagation to work

Before making the assumption, let us define the cost function:

where: n is the total number of training examples; the sum is over individual training examples, x; y=y(x) is the corresponding desired output; L denotes the number of layers in the network; and a^L=a^L(x) is the vector of activations output from the network when x is input.

The first assumption we need is that the cost function can be written as an average :

over cost functions Cx for individual training examples, x.

With this assumption, backpropagation lets us compute the partial derivatives ∂Cx/∂w and ∂Cx/∂b for a single training example. We then calculate ∂C/∂w and ∂C/∂b by averaging over training examples.

The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network as shown below:

Four Fundamental Equations behind backpropagation

Backpropagation computes the partial derivatives of cost function with respect to the weight and bias. But to do so, we will have to introduce an intermediate quantity, δ, which we call the error in the jth neuron of the lth layer. Backpropagation is based around four fundamental equations. Together, those equations give us a way of computing both the error δ in the lth layer and the gradient of the cost function. The four equations are described below:

An equation for the error in the output layer, δL: The components of δL are given by

The above equation can be written in matrix-based form as:

An equation for the error δ^l in terms of the error in the next layer, δ^l+1:

An equation for the rate of change of the cost with respect to any bias in the network:

An equation for the rate of change of the cost with respect to any weight in the network:

Backpropagation Algorithm

The backpropagation equations provide us with a way of computing the gradient of the cost function.

Input x: Set the corresponding activation a¹ for the input layer.
Feedforward: For each l=2,3,…L compute

3. Output error δL: Compute the vector

4. Backpropagate the error: For each l=L−1,L−2,…,2 compute

5. Output: The gradient of the cost function is given by

Since we compute the the error vectors δ^l backward starting from the final layer the algorithm is known as backpropagation. The backward movement is a consequence of the fact that the cost is a function of outputs from the network. To understand how the cost varies with earlier weights and biases we need to repeatedly apply the chain rule, working backward through the layers to obtain usable expressions.

References:

Neural Networks and Deep Learning — Michael Nielsen

Neural Networks — What, How and Why — Euge Inzaugarat