# Building Neural Networks with Python Code and Math in Detail — II

## The second part of our tutorial on neural networks from scratch. From the math behind them to step-by-step implementation case studies in Python. Launch the samples on Google Colab.

Author(s): Pratik Shukla, Roberto Iriondo

Last updated, June 30, 2020

In the first part of our tutorial on neural networks, we explained the basic concepts about neural networks, from the math behind them to implementing neural networks in Python without any hidden layers. We showed how to make satisfactory predictions even in case scenarios where we did not use any hidden layers. However, there are several limitations to single-layer neural networks.

In this tutorial, we will dive in-depth on the limitations and advantages of using neural networks in machine learning. We will show how to implement neural nets with hidden layers and how these lead to a higher accuracy rate on our predictions, along with implementation samples in Python on Google Colab.

# 1. Limitations and Advantages of Neural Networks

## Limitations of single-layer neural networks:

• They can only predict linearly separable data. If we have non-linear data, then training our single-layer neural network will lead to low accuracy in our prediction rate.
• Decision boundaries for single-layer neural networks must be in the hyperplane, which means that if our data distributes in 3 dimensions, then our decision boundary must be in 2 dimensions.

To overcome such limitations, we use hidden layers in our neural networks.

## Advantages of single-layer neural networks:

• Single-layer neural networks take less time to train compared to a multi-layer neural network.
• Single-layer neural networks have explicit links to statistical models.
• The outputs in single layer neural networks are weighted sums of inputs. It means that we can interpret the output of a single layer neural network feasibly.

## Advantages of multilayer neural networks:

• They can be used to classify non-linearly separable data.
• Multilayer neural networks are more reliable compared to single-layer neural networks.

# 2. How to select several neurons in a hidden layer?

• The number of hidden nodes should be less than twice the size of the nodes in the input layer.

For example: If we have 2 input nodes, then our hidden nodes should be less than 4.

a. 2 inputs, 4 hidden nodes: Figure 1: A neural net with 2 inputs, and 4 hidden nodes.

b. 2 inputs, 3 hidden nodes: Figure 2: A neural net with 2 inputs, and 3 hidden nodes.

c. 2 inputs, 2 hidden nodes: Figure 3: A neural network with 2 inputs, and 2 hidden nodes.

d. 2 inputs, 1 hidden node: Figure 4: A neural net with 2 inputs, and 1 hidden node.
• The number of hidden nodes should be 2/3 the size of input nodes, plus the size of the output node.

For example: If we have 2 input nodes and 1 output node then the hidden nodes should be = floor(2*2/3 + 1) = 2

a. 2 inputs, 2 hidden nodes: Figure 5: A neural net with 2 inputs, and 2 hidden nodes.
• The number of hidden nodes should be between the size of input nodes and output nodes.

For example: If we have 3 input nodes and 2 output nodes, then the hidden nodes should be between 2 and 3.

a. 3 inputs, 2 hidden nodes, 2 outputs: Figure 6: A neural net with 3 inputs, 2 hidden nodes, and 2 outputs.

b. 3 inputs, 3 hidden nodes, 2 outputs: Figure 7: A neural net with 3 inputs, 3 hidden nodes, and 2 outputs.

## How many weight values do we need?

1. For an output layer: Number of hidden layer nodes * No. of outputs

# 3. The General Structure of an Artificial Neural Network (ANN): Figure 8: General structure for an artificial neural network with three layers, an input layer, a hidden layer, and an output layer.

## Summarization of an artificial neural network:

2. Assign random weights in the hidden layer and the output layer.
3. Run the code for training.
4. Find the error in prediction.
5. Update the weight values of the hidden layer and output layer by gradient descent algorithm.
6. Repeat the training phase with updated weights.
7. Make predictions.

## Execution in phase-1:

We first define some terms we are going to use in these derivatives:

a. Finding the first derivative:

b. Finding the second derivative:

c. Finding the third derivative:

Notice that we already derived these derivatives in the first part of our tutorial.

# Execution in phase-2:

Represented as:

a. Finding the first derivative:

Here we are going to use the chain rule to find the derivative.

Using the chain rule again.

The step below is similar to what we did in the first part of our tutorial on neural networks. Figure 16: Expanding our result for the first derivative, resulting in the output weight.

b. Finding the second derivative:

c. Finding the third derivative:

# 4. Implementation of a multilayer neural network in Python

📚 Multilayer neural network: A neural network with a hidden layer 📚 For more definitions, check out our article in terminology in machine learning.

Below we are going to implement the “OR” gate without the bias value. In conclusion, adding hidden layers in a neural network helps us achieve higher accuracy in our models.

# Graphical representation: Figure 22: Inputs on the graph, notice that the same color dots have the same output.

# Implementation in Python:

a. Import required libraries:

b. Define input features:

Next, we take input values for which we want to train our neural network. We can see that we have taken two input features. On tangible data sets, the value of input features is mostly high.

c. Define target output values:

For the input features, we want to have a specific output for specific input features. It is called the target output. We are going to train the model that gives us the target output for our input features. Figure 25: Defining our target output, and reshaping our target output into a vector

d. Assign random weights:

Next, we are going to assign random weights to the input features. Note that our model is going to modify these weight values to be optimal. At this point, we are taking these values randomly. Here we have two layers, so we have to assign weights for them separately.

The other variable is the learning rate. We are going to use the learning rate (LR) in a gradient descent algorithm to update the weight values. Generally, we keep LR as low as possible so that we can achieve a minimal error rate. Figure 26: Defining the weights for our neural net, along with our learning rate (LR)

e. Sigmoid function:

Once we have our weight values and input features, we are going to send it to the main function that predicts the output. Notice that our input features and weight values can be anything, but here we want to classify data, so we need the output between 0 and 1. For such output, we are going to use a sigmoid function.

f. Sigmoid function derivative:

In a gradient descent algorithm, we need the derivative of the sigmoid function.

g. The main logic for predicting output and updating the weight values:

We are going to understand the following code step-by-step.

## How does it work?

b. Next, we find the input for the hidden layer. Defined by the following formula: Figure 31: Finding the input for our neural network’s hidden layer.

We can also represent it as matrices to understand in a better way.

The first matrix here is input features with size (4*2), and the second matrix is weight values for a hidden layer with size (2*3). So the resultant matrix will be of size (4*3).

The intuition behind the final matrix size:

The row size of the final matrix is the same as the row size of the first matrix, and the column size of the final matrix is the same as the column size of the second matrix in multiplication (dot product).

In the representation below, each of those boxes represents a value.

c. Afterward, we have an input for the hidden layer, and it is going to calculate the output by applying a sigmoid function. Below is the output of the hidden layer:

d. Next, we multiply the output of the hidden layer with the weight of the output layer: Figure 34: Formula representing the output of our hidden layer, with the weight of the output layer.

The first matrix shows the output of the hidden layer, which has a size of (4*3). The second matrix represents the weight values of the output layer, Figure 35: Representation of the hidden layer, and our output layer.

e. Afterward, we calculate the output of the output layer by applying a sigmoid function. It can also be represented in matrix form as follows.

f. Now that we have our predicted output, we find the mean squared between target output and predicted output. Figure 37: Finding the mean between our target output and our predicted output.

g. Next, we begin the first phase of training. In this step, we update the weight values for the output layer. We need to find out how much the output weights affect the error value. To update the weights, we use a gradient descent algorithm. Notice that we have already found the derivatives we will use during the training phase.

g.a. Matrix representation of the first derivative. Matrix size (4*1).

`derror_douto = output_op -target_output`

g.b. Matrix representation of the second derivative. Matrix size (4*1).

`dout_dino = sigmoid_der(input_op)`

g.c. Matrix representation of the third derivative. Matrix size (4*3).

`dino_dwo = output_hidden`

g.d. Matrix representation of transpose of `dino_dwo`. Matrix size (3*4). Figure 42: Matrix representation of our variable dino_dwo, see the implementation for details.

g.e. Now, we are going to find the final matrix of output weight. For a detailed explanation of this step, please check out our previous tutorial. The matrix size will be (3*1), which is the same as the `output_weight` matrix.

Hence, we have successfully find the derivative values. Next, we update the weight values accordingly with the help of a gradient descent algorithm.

Nonetheless, we also have to find the derivative for phase-2. Let’s first find that, and then we will update the weights for both layers in the end.

h. Phase -2. Updating the weights in the hidden layer.

Since we have already discussed how we derived the derivative values, we are just going to see matrix representation for each of them to understand it better. Our goal here is to find the weight matrix for the hidden layer, which is of size (2*3).

h.a. Matrix representation for the first derivative.

`derror_dino = derror_douto * douto_dino`

h.b. Matrix representation for the second derivative.

`dino_douth = weight_output`

h.c. Matrix representation for the third derivative.

`derror_douth = np.dot(derror_dino , dino_douth.T)`

h.d. Matrix representation for the fourth derivative.

`douth_dinh = sigmoid_der(input_hidden)`

h.e. Matrix representation for the fifth derivative.

`dinh_dwh = input_features`

h.f. Matrix representation for the sixth derivative.

`derror_dwh = np.dot(dinh_dwh.T, douth_dinh * derror_douth)`

Notice that our goal was to find a hidden weight matrix with the size of (2*3). Furthermore, we have successfully managed to find it.

h.g. Updating the weight values :

We will use the gradient descent algorithm to update the values. It takes three parameters.

1. The original weight: we already have it.
2. The learning rate (LR): we assigned it the value of 0.05.
3. The derivative: Found on the previous step.

Since we have all of our parameter values, this will be a straightforward operation. First, we are updating the weight values for the output layer, and then we are updating the weight values for the hidden layer.

i. Final weight values:

Below, we show the updated weight values for both layers — our prediction bases on these values.

j. Making predictions:

j.a. Prediction for (1,1).

Target output = 1

Explanation:

First of all, we are going to take the input values for which we want to predict the output. The “result1” variable stores the value of the dot product of input variables and hidden layer weight. We obtain the output by applying a sigmoid function, the result stores in the `result2` variable. Such is the input feature for the output layer. We calculate the input for the output layer by multiplying input features with output layer weight. To find the final output value, we take the sigmoid value of that.

Notice that the predicted output is very close to 1. So we have managed to make accurate predictions.

j.b. Prediction for (0,0).

Target output = 0

Note that the predicted output is very close to 0, which indicates the success rate of our model.

k. Final error value :

After 200,000 iterations, we have our final error value — the lower the error, the higher the accuracy of the model.

As shown above, we can see that the error value is 0.0000000189. This value is the final error value in prediction after 200,000 iterations.

## Putting it all together:

`# Import required libraries :import numpy as np# Define input features :input_features = np.array([[0,0],[0,1],[1,0],[1,1]])print (input_features.shape)print (input_features)# Define target output :target_output = np.array([[0,1,1,1]])# Reshaping our target output into vector :target_output = target_output.reshape(4,1)print(target_output.shape)print (target_output)# Define weights :# 6 for hidden layer# 3 for output layer# 9 totalweight_hidden = np.array([[0.1,0.2,0.3], [0.4,0.5,0.6]])weight_output = np.array([[0.7],[0.8],[0.9]])# Learning Rate :lr = 0.05# Sigmoid function :def sigmoid(x): return 1/(1+np.exp(-x))# Derivative of sigmoid function :def sigmoid_der(x): return sigmoid(x)*(1-sigmoid(x))for epoch in range(200000): # Input for hidden layer : input_hidden = np.dot(input_features, weight_hidden)  # Output from hidden layer : output_hidden = sigmoid(input_hidden)  # Input for output layer : input_op = np.dot(output_hidden, weight_output)  # Output from output layer : output_op = sigmoid(input_op)#========================================================== # Phase1  # Calculating Mean Squared Error : error_out = ((1 / 2) * (np.power((output_op — target_output), 2))) print(error_out.sum())  # Derivatives for phase 1 : derror_douto = output_op — target_output douto_dino = sigmoid_der(input_op)  dino_dwo = output_hiddenderror_dwo = np.dot(dino_dwo.T, derror_douto * douto_dino)#=========================================================== # Phase 2  # derror_w1 = derror_douth * douth_dinh * dinh_dw1 # derror_douth = derror_dino * dino_outh  # Derivatives for phase 2 : derror_dino = derror_douto * douto_dino dino_douth = weight_output derror_douth = np.dot(derror_dino , dino_douth.T) douth_dinh = sigmoid_der(input_hidden)  dinh_dwh = input_features derror_wh = np.dot(dinh_dwh.T, douth_dinh * derror_douth)# Update Weights weight_hidden -= lr * derror_wh weight_output -= lr * derror_dwo # Final hidden layer weight values :print (weight_hidden)# Final output layer weight values :print (weight_output)# Predictions :#Taking inputs :single_point = np.array([1,1])#1st step :result1 = np.dot(single_point, weight_hidden) #2nd step :result2 = sigmoid(result1)#3rd step :result3 = np.dot(result2,weight_output)#4th step :result4 = sigmoid(result3)print(result4)#=================================================#Taking inputs :single_point = np.array([0,0])#1st step :result1 = np.dot(single_point, weight_hidden) #2nd step :result2 = sigmoid(result1)#3rd step :result3 = np.dot(result2,weight_output)#4th step :result4 = sigmoid(result3)print(result4)#=====================================================#Taking inputs :single_point = np.array([1,0])#1st step :result1 = np.dot(single_point, weight_hidden) #2nd step :result2 = sigmoid(result1)#3rd step :result3 = np.dot(result2,weight_output)#4th step :result4 = sigmoid(result3)print(result4)`

Below, notice that the data we used in this example was linearly separable, which means that by a single line, we can classify outputs with 1 value and outputs with 0 values. Figure 56: Graph showing data being linearly separable, allowing to classify outputs with 1 value or 0 values.

# 5. Comparison with a single-layer neural network

The final error value for the following code is:

As we can see, the error value is way too high compared to the error we found in our neural network implementation with hidden layers, making it one of the main reasons to use hidden layers in a neural network.

`# Import required libraries :import numpy as np# Define input features :input_features = np.array([[0,0],[0,1],[1,0],[1,1]])print (input_features.shape)print (input_features)# Define target output :target_output = np.array([[0,1,1,1]])# Reshaping our target output into vector :target_output = target_output.reshape(4,1)print(target_output.shape)print (target_output)# Define weights :weights = np.array([[0.1],[0.2]])print(weights.shape)print (weights)# Define learning rate :lr = 0.05# Sigmoid function :def sigmoid(x):    return 1/(1+np.exp(-x))# Derivative of sigmoid function :def sigmoid_der(x):    return sigmoid(x)*(1-sigmoid(x))# Main logic for neural network :# Running our code 10000 times :for epoch in range(10000):    inputs = input_features#Feedforward input :    pred_in = np.dot(inputs, weights)#Feedforward output :    pred_out = sigmoid(pred_in)#Backpropogation     #Calculating error    error = pred_out - target_output    x = error.sum()        #Going with the formula :    print(x)        #Calculating derivative :    dcost_dpred = error    dpred_dz = sigmoid_der(pred_out)        #Multiplying individual derivatives :    z_delta = dcost_dpred * dpred_dz#Multiplying with the 3rd individual derivative :    inputs = input_features.T    weights -= lr * np.dot(inputs, z_delta)#Predictions :#Taking inputs :single_point = np.array([1,0])#1st step :result1 = np.dot(single_point, weights) #2nd step :result2 = sigmoid(result1)#Print final resultprint(result2)#====================================#Taking inputs :single_point = np.array([0,0])#1st step :result1 = np.dot(single_point, weights) #2nd step :result2 = sigmoid(result1)#Print final resultprint(result2)#===================================#Taking inputs :single_point = np.array([1,1])#1st step :result1 = np.dot(single_point, weights) #2nd step :result2 = sigmoid(result1)#Print final resultprint(result2)`

# Graphical Representation Of Data Points :

As figure 59 shows, we have 2 inputs and 1 output. In this example, we are going to use 4 hidden perceptrons. The red dots have an output value of 0, and the black dots have an output value of 1. Therefore, we cannot simply classify them using a single straight line.

# Implementation in Python:

b. Define input features:

c. Define the target output:

d. Assign random weight values:

On figure 64, notice that we are using NumPy’s library random function to generate random values.

`numpy.random.rand(x,y)`: Here x is the number of rows, and y is the number of columns. It generates output values over [0,1). It means 0 is included, but 1 is not included in the value generation.

e. Sigmoid function:

f. Finding the derivative with a sigmoid function:

g. Training our neural network:

h. Weight values of hidden layer: Figure 69: Displaying the final values of our weights in the hidden layer.

i. Weight values of output layer: Figure 70: Displaying the final weight values for our output layers.

j. Final error value :

After training our model for 200,000 iterations, we finally achieved a low error value. Figure 71: Low error value of the model trained during 200,000 iterations

k. Making predictions from the trained model :

k.a. Predicting output for (0.5, 2).

The predicted output is closer to 1.

k.b. Predicting output for (0, -1)

The predicted output is very near to 0.

k.c. Predicting output for (0, 5)

The predicted output is close to 1.

k.d. Predicting output for (1, 1.2)

The predicted output is close to 0.

Based on the output values, our model has done a high-grade job of predicting values.

We can separate our data in the following way as shown in Figure 76. Note that this is not the only possible way to separate these values.

Therefore to conclude, using a hidden layer on our neural networks helps us reducing the error rate when we have non-linearly separable data. Even though the training time extends, we have to remember that our goal is to make high accuracy predictions, and such will be satisfied.

## Putting it all together:

`# Import required libraries :import numpy as np# Define input features :input_features = np.array([[0,0],[0,1],[1,0],[1,1]])print (input_features.shape)print (input_features)# Define target output :target_output = np.array([[0,1,1,0]])# Reshaping our target output into vector :target_output = target_output.reshape(4,1)print(target_output.shape)print (target_output)# Define weights :# 8 for hidden layer# 4 for output layer# 12 total weight_hidden = np.random.rand(2,4)weight_output = np.random.rand(4,1)# Learning Rate :lr = 0.05# Sigmoid function :def sigmoid(x): return 1/(1+np.exp(-x))# Derivative of sigmoid function :def sigmoid_der(x): return sigmoid(x)*(1-sigmoid(x))# Main logic :for epoch in range(200000): # Input for hidden layer : input_hidden = np.dot(input_features, weight_hidden)  # Output from hidden layer : output_hidden = sigmoid(input_hidden)  # Input for output layer : input_op = np.dot(output_hidden, weight_output)  # Output from output layer : output_op = sigmoid(input_op)#======================================================================== # Phase1  # Calculating Mean Squared Error : error_out = ((1 / 2) * (np.power((output_op — target_output), 2))) print(error_out.sum())   # Derivatives for phase 1 : derror_douto = output_op — target_output douto_dino = sigmoid_der(input_op)  dino_dwo = output_hiddenderror_dwo = np.dot(dino_dwo.T, derror_douto * douto_dino)# ======================================================================== # Phase 2# derror_w1 = derror_douth * douth_dinh * dinh_dw1 # derror_douth = derror_dino * dino_outh  # Derivatives for phase 2 : derror_dino = derror_douto * douto_dino dino_douth = weight_output derror_douth = np.dot(derror_dino , dino_douth.T) douth_dinh = sigmoid_der(input_hidden)  dinh_dwh = input_features derror_dwh = np.dot(dinh_dwh.T, douth_dinh * derror_douth)# Update Weights weight_hidden -= lr * derror_dwh weight_output -= lr * derror_dwo  # Final values of weight in hidden layer :print (weight_hidden)# Final values of weight in output layer :print (weight_output)#Taking inputs :single_point = np.array([0,-1])#1st step :result1 = np.dot(single_point, weight_hidden) #2nd step :result2 = sigmoid(result1)#3rd step :result3 = np.dot(result2,weight_output)#4th step :result4 = sigmoid(result3)print(result4)#Taking inputs :single_point = np.array([0,5])#1st step :result1 = np.dot(single_point, weight_hidden) #2nd step :result2 = sigmoid(result1)#3rd step :result3 = np.dot(result2,weight_output)#4th step :result4 = sigmoid(result3)print(result4)#Taking inputs :single_point = np.array([1,1.2])#1st step :result1 = np.dot(single_point, weight_hidden) #2nd step :result2 = sigmoid(result1)#3rd step :result3 = np.dot(result2,weight_output)#4th step :result4 = sigmoid(result3)print(result4)`

# 7. Conclusion

• Inputs store in its networks instead of a database.
• These networks can learn from examples, and we can predict the output for similar events.
• In case of failure of one neuron, the network can detect the fault and still produce output.
• Neural networks can perform multiple tasks in parallel processes.

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

Published via Towards AI

# Citation

`Shukla, et al., “Building Neural Networks with Python Code and Math in Detail — II”, Towards AI, 2020`

# BibTex citation:

`@article{pratik_iriondo_2020,  title={Building Neural Networks with Python Code and Math in Detail — II},  url={https://towardsai.net/building-neural-nets-with-python},  journal={Towards AI},  publisher={Towards AI Co.},  author={Pratik, Shukla and Iriondo, Roberto},   year={2020},  month={Jun}}`

📚 Are you new to machine learning? Check out an overview of machine learning algorithms for beginners with code examples in Python 📚

## References:

 Neural Networks from Scratch with Python Code and Math in Detail — I, Pratik Shukla, Roberto Iriondo, https://towardsai.net/neural-networks-with-python

Written by

Written by

## Towards AI Team

#### Publishing the Best of Tech, Science, and The Future | For Authors → https://towardsai.net/contribute | Subscribe→ https://towardsai.net/subscribe — @Towards_AI ## More From Medium

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just \$5/month. Upgrade