Simple Neural Network from Scratch

Coding a simple neural network for solving XOR problem in Python without ML library

Shubham Chouksey
The Startup
8 min readMay 26, 2020

--

Fig 1: Simple neural network with a single hidden layer with 5 units, the hidden units use sigmoid activation and the output unit uses linear activation.

Let’s walk through some aspects of these diagrams.

  1. The neural network is divided into three layers. The input layer, the hidden layer and the output layer.
  2. The values in the input layer are multiplied by a weight matrix, W¹.
  3. The nodes in the hidden layer sum their inputs and add a bias term, bₕ.
  4. The outputs of the hidden layer nodes are multiplied by a weight vector, W².
  5. The output layer sums the inputs and adds another bias term, bᴼ

Exercise: We are going to create code for a neural network which performs the exclusive OR or XOR operation which is not linearly separable function, The truth table of the XOR looks like this:

In words, the XOR function is 0 if both inputs are the same, or 1 if both inputs are different. Hence, the reason it is known as exclusive OR.

The following elements are required:

  • An input weight tensor
  • A hidden layer with two units using Sigmoid activation
  • A output weight tensor
  • An output unit with linear activation

There are 4 possible test (input) cases. Test your code for all cases.

Learning in neural networks: Backpropagation

Now that we have a promising representation, we need to determine if it is trainable. The answer is not only yes we can, but that we can do so in a computationally efficient manner, using a cleaver algorithm known as backpropagation.

The backpropagation algorithm was developed independently multiple times. The earliest work on this algorithm was by Kelly (1960) in the context of control theory and Bryson (1961) in the context of dynamic programming. Rumelhart, Hinton and Williams (1986) demonstrated empirically that backpropagation can be used to train neural networks. Their paper marks the start of the modern history of neural networks, and set off the first wave of enthusiasm.

The backpropagation algorithm requires several components. First, we need a loss function to measure how well our representation matches the function we are trying to learn. Second, we need a way to propagate changes in the representation through the complex network For this we will use the chain rule of calculus to compute gradients of the representation. In the general case, this process requires using automatic differentiation methods.

The point of backpropagration is to learn the optimal weight for the neural network. The algorithm proceeds iteratively through a series of small steps. Once we have the gradient of the loss function we can update the tensor of weights.

It should be evident that the back propagation algorithm is a form of gradient decent. The weights are updated in small steps following the gradient of down hill.

How to find a gradient?

Let’s work out backpropagation for the above simple neural network with a single hidden layer with five units. This neural network, including the loss function, is shown in Figure 1 Above. There are only three layers, input layer, a five unit hidden layer, and a single unit output layer. There are only two weight tensors for this network. Further, the hidden units use sigmoid activation and the output unit uses linear activation. These activation functions have simple partial derivatives.

First, we need to work out the forward propagation relationships. We can compute the outputs of the hidden layer as follows.

In the same way, the result from the output layer can be computed as follows, since the activation function for this layer is linear.

To perform backpropagation, we need fill out the gradient vector by computing ∂J(W)/∂W for each weight in the model.

To keep things simple in this example we will just use a non-normalized squared error loss function. This is just the MLE estimator (without normalization) for a Gaussian distribution.

We want to compute the gradients with respect to the input and output tensors:

Let’s start with the easier case of the partial derivatives with respect to the output tensor. We can apply the chain rule as follows:

The first partial derivative of the chain is:

And, the second partial derivative in the chain, given the linear activation of the output unit, becomes:

Multiplying the two components of the chain gives us:

The partial derivatives with respect to the input tensor are a bit more complicated. To apply the chain rule we must work backwards from the loss function. This gives the following chain:

First, we find the right most partial derivative in our chain:

Which given the sigmoid activation results in:

The middle partial derivative must account for the non-linearity:

We have already computed ∂𝐽(𝑊)/∂𝑆6. Multiplying all three partial derivatives we find:

Where S₆ and S₁,₂,₃,₄,₅ are computed using the relationships given above.

Now, keeping up with whole bunch of Theory and formula’s given above let’s move on to solve the problem of XOR Function using python code.

Importing necessary library:

Here, we are going to use Sigmoid Activation function for the hidden layers hence,

Partial derivative of above Sigmoidal Activation function is,

Therefore,

For calculating S₆ or S₁,₂,₃,₄,₅ we are going to create forward function let’s take a look below,

Using Backpropagation for learning neural network, we are going to find the gradient for each input weight tensor and output weight tensor, For that creating a backprop function with the help of above given formula's for W¹ and W² to return gradient of both input weight tensor and output weight tensor.

Initialize a numpy array variable X for training data and y for target value data, as shown below,

Note: 1st column of Variable X are the biases for hidden layer i.e bₕ and 1st column of Variable y are the bias for output layer i.e bᴼ

Initialize random input weight tensors and output weight tensors using randn in numpy.random

Initialize learning rate, number of iteration and declare empty cost function, which we used while training our simple neural network model. As we used lr to be less it is going to learn good but it takes time, you can test for different learning rate also.

So, start with the training we are iterating through a range of epochs and at each epoch we are updating input weight tensors and output weight tensors respectively with the help of given learning rate, appending the mean of costs for all the input training cases in costsvariable and printing the cost value at each multiple of 1000 iteration.

After completion of training, let’s test the input test cases for XOR function with the help of forward function define above.

output:

Let’s Analyze the cost function while training up the model, have a look

The above plot seems pretty it shows that prior to 5000 iteration curve was steep depth and after that it looks slighter straight line ending with minimized cost or error value sounds Good!

So, That’s all Folks this is a simple architecture and working model of neural networks but in real world scenario it is going to be much more complex and huge lots of parameters and hidden layers, Yes the intuition of this blog is more inclined to mathematics behind neural networks, I hope you enjoyed it and learned some valued things.

Keeping Safe, Happy Learning, Thank you!

Refer full code file below:

--

--

Shubham Chouksey
The Startup

Data Scientist | Python | NLP | Artificial Intelligence | Machine Learning | Chat Bots | Deep Learning