Hello Neural Network — The Beginner’s Guide to Understanding Neural Networks From Scratch

Published in

Praemineo

17 min readDec 4, 2018

Preface note

Greetings geeks..! Having a solid grasp on deep learning techniques feels like acquiring a superpower these days. But where should you start? What should you learn? What are the core concepts that actually make up this complex yet intriguing field? Yes, there are some Frameworks like TensorFlow and Keras but they are always presented neural networks as a black box without getting into the details of explaining how that internally worked and magically predicting the results. So I came up with this idea of starting from scratch.

Building from scratch helps in understanding how a neural network works in the back-end, I’ll be sure to explain everything along the way and always encourage you to reach out if you have any questions! I will assume no prior knowledge in NNs, but you will instead need to know some fundamentals of Python programming and low-level calculus.

Without delay, let’s dive into building our first neural network model.

What we are doing

In this article I’ll show you how to create and train a neural network, We’ll be creating the simplest neural network possible. I am going to be talking through the code that I made to predict exam result using student’s sleep and study hours data.

In our path to understanding neural networks, we are going to answer three questions: What, How and Why?

What is a Neural Network?
How Do Neural Networks Work?
Why we use weight, Bias, cost function, activation function, forward propagation, and backward propagation.

So what is a neural network?

In simple words, a neural network is a computer simulation of the way biological neurons work within a human brain. The amazing thing about a neural network is that you don’t have to program it to learn it learns all by itself, just like a human brain!

Over time, the output is used to improve the accuracy of a neural network model.

“AI is the new electricity.” Andrew -Ng.

human brain vs neural network, identifying patterns

What Are The Main Components Of Neural Network?

The first building block of a neural network is, well, neurons. A neuron is like a function, it takes a few inputs and returns an output and thair generally goal is to perform a given task achieving the least possible error.

Neural networks are typically organized in layers. Layers are made up of a number of interconnected ‘nodes’ (connection points).

A Basic neural network generally consists of 3 layers:

Input Layer: This layer accepts input in the form of an array of numbers.
Hidden Layers: Hidden layers are intermediary processing units. They are added just to increase the accuracy of the predictions.
Output Layer: Results of the hidden layer are then fed to the output layer, which produces the final prediction

Let’s understand all these things with the help of a story

There is a student whose name is Perceptron, he is an engineering student. His exams are approaching. As he is an engineering student so he always likes to study at the last moment of the exam. He wants to predict his test score, based on how many hours he sleeps and how many hours he studies the night before the exam.

Perceptron’s friend Raj has some student’s information (like how many hours they sleep and how many hours they study the night before the exam and their test score ). Perceptron asks his friend to give that information. with the help of this information, he wants to predict the score and give that score to Raj. Raj will tell Perceptron how near his prediction is. This is feed-forward

So he started making a mathematical formula for calculating test score using raj’s dataset. He takes first data .i.e. 5 hrs of study and 3 hrs of sleep. With this data he got to know that some students scored 75 marks in the exam, so somehow he wants to convert this 5 hrs and 3hrs to 75 marks, so he comes up with this solution that what if he multiply sleep hours and study hours with such a number so that he can get his test score.

Hour * W1 + Sleep* W2

But there is one problem, Which number?
So he started multiplying 5 and 3 with some random number. This is called weight

So first he put w1=7 and w2=3 randomly

5*7 + 3 * 3 = 44

He got output 44, Raj told him that it is lower than the target result so he knows he needs to increase the value of w1 and w2

Now If the end result is bad, he has gone back through the whole process trying to correct it, This process is called back-propagation.

We can backpropagate the error through the network and adjust the weights to minimize the error. This error is calculated with the help of a cost function and his friend Raj is Cost Function because he tells him every time how wrong his prediction is.

5*10 + 3 * 10 = 80

this time he got 80 …again he changes the value of w1 and w2

5*9 + 3 * 10 = 75

This time he got perfect output …now he knows that if he multiplies his number of study hours with 9 and number of sleep hours with 10 he will get his test score.

But suddenly he remembers, he got 5 marks because of cheating so he can’t count that 5 marks from his number of hours of study he has to add that value separately this is called Bias

In simplest words, Bias that output of the neural net when it has absolutely zero input.

So, we know that our function output = w*input (y = mx) needs to have this constant term added to it. In other words, When we change our weight w1, we will change the gradient of the function to make it steeper or flatter. But what about shifting the function vertically? that’s why we need a constant term.

Now again he creates that equation again and with few with adjustment

5*8+3*10+5 = 75

Finally, he got the correct result

Similarly, Just like a human brain, an artificial neuron learn with the same process.
let’s understand this with help of another example

How does an artificial neuron learn?

Let’s say we need our neuron to learn the relationship between the above numbers. learning means finding a value for w that represents the relationship between the number pairs.

target value = 8

Let’s put W = 3 randomly

predict value = 6

Now we compare the result predicted by our artificial neuron, with the ground truth values and calculate the difference, which is basically the error of our prediction.

When the weight is 3 we are off by 2 from the expected output(target value =8). We could refer this value as the cost of this operation. Now we apply this operation with all number pairs.

cost = 2 (8–6)

cost = 5 (20–15)

cost = 8 (32–24)

Let’s calculate the total cost

cost function (y=target value , yhat=predict value )

Total Cost = 2+5+8 = 15
Now the goal is to minimize this cost of the network.

What affects the cost? Seemingly in this case only one variable weight.
so we need to adjust such a way that we will get your desired output.

if we put w=4, Our total cost will be 0 that mean our model got the perfect value of W and with the help of this value, we can predict the result.

The neural network is a kind of technology that is not an algorithm. It is a network that has weights on it, and you can adjust these weights so that it learns. You teach it through trials — Howard Rheingold

Understanding The Process With the help of Code

Now that you’ve gotten a basic intro, let’s jump into the code.

Prerequisites,

A good python IDE, I recommend jupyter notebook
Python 3 — as we’re writing python code
Numpy pip3 install numpy — this is a python library, NumPy is a Python package which stands for ‘Numerical Python’. it’s an optimized version of Python lists. which contains a powerful N-dimensional array object

When we step into a neural network problem, first we need to ask two questions.

1. what is your goal?
2. what are the resources you have currently to accomplish that goal?

So lets breakdown out problem according to the way we defined above
Goal — Predict test score.
Resources — Sleep hours and study hours dataset

This is our input dataset, sleep hours and study hours | Test Score is our target dataset

now in order to find the value of Yhat (Predict score ), we need weight, input, bias, and Activation function.. let’s find these values one by one.

lets’s generate some random values for weight 1, weight 2 and bias.

print(w1)

0.5002893521983375

print(w2)

-0.47546268114414475

we got all the values now we need an activation function

Activation Function

As the name implies, activation function is a mathematical formula (algorithm) that is activated under certain circumstances.

The activation functions are typically non-linear. Non-linear mappings applied to inputs are able to capture interesting properties of the input.

but why we need activation function for our example?
Because our Perception using random weight and Bias in order to predict his test score. sometimes he is getting predict value more than 100 or in some case, less than 0. score cant be negative or more than 100, out of 100 right? We need some function to restrict the value of Yhat (Predict score ) to a certain range in the case between 0 and 100. that’s why we use activation function In simple words, the activation function is a function that limits the output signal to a finite value.

we are going to use the sigmoid function because this function squashes all input value in the range to 0 and 1.

Implementation of activation functions in python

now we have everything we need. let’s create a single neural in order to predict our test score.

The first layer of neurons represent all input values. Then each neuron on the following layer (the hidden layer) takes the sum of all the neurons on the previous layer, multiplied by the weights that connect them to the relevant neuron on the hidden layer.

First, multiply the neurons by the weights and find the sum.

Forward Propagation

The value of each output neuron can be calculated as the following :

I am feeding 3hr sleep time and 5hr study time into our neural network.
but first, we need to account for the differences in the units of our data. Both of our inputs are in hours, but our output is a test score, scaled between 0 and 100. Neural networks are smart, but not smart enough to guess the units of our data, that’s why we need to convert these value into the same scale.

hours = 0.3 | Sleep = 0.5 (Input Value)|| Score (Target Value)= 0.75

0.2506244358881565

But 3 hours of sleep and 5 hours of study should give output 75 but our computer is telling us .. you will get 25 score which is fine consider..this is just a random guess

So first we need to teach the computer how the wrong result is.

We can do this with the help for cost function just like the way raj did by telling each time how much lower and higher score is from the target value.
There are many available cost functions, and the nature of our problem should dictate our choice of the cost function. In this example, I am using a simple sum of squares error as our cost function.

Sum of squares error is the difference between each predicted value (yhat) and the actual value (target_value). The difference is squared so that we measure the absolute value of the difference.

0.2493759540320219

so we get 0.24 cost we have to reduce this cost.

Now let’s understand how we can reduce the cost .. there are a few ways but one of the best and an efficient way to reduce cost by using gradient descent.

Let’s recall differentiation from the calculus class.

Differentiation is the action of computing a derivative. The derivative of a function y = f(x) of a variable x is a measure of the rate at which the value yof the function changes with respect to the change of the variable x. It is called the derivative of f with respect to x.

how does this work?

You start by initial weight (random weight w1). At this point, the gradient descent algorithm calculates the gradient of the loss curve at the starting point, which is the derivative (slope) of the curve. In the end, it gives you the direction of your next step (Read this great article for better understanding)

You repeat this process incrementally, each step at a time, trying to reduce the cost function until your algorithm converges to a minimum as shown in this gif below.

1- Gradient descent will first pick a random value of x.
2- It will then update x iteratively until we reach convergence following this equation :

so if we find derivative of our cost function

( y — yhat)² = 2(y-yhat)*1 | (simple x²=2x*1 formula )

Prediction = 0.25 | target_value = 0.75

Learning Rate

Learning rate (or step size) is a user-specified value that is usually between 0.001 and 10. It is key to determining the sensitivity of learning, and the emphasis that should be put on the response to an error. If the learning rate was too high, the changes in the weight would be too dramatic in response to the error margin.

Train the network.
Training our neural network, that is, learning the values of our parameters,Our random prediction = 0.25 | Target value = 0.75

0.35049954871052524
0.4303996389684202
0.49431971117473616
0.545455768939789
0.5863646151518311
0.6190916921214649
0.6452733536971719
0.6662186829577375
0.6829749463661899
0.696379957092952
0.7071039656743616
0.7156831725394893
0.7225465380315914
0.7280372304252731
0.7324297843402185
0.7359438274721748
0.7387550619777399
0.7410040495821919
0.7428032396657536
0.7442425917326029

Cost = predicted value — target value => 0.74–0.75 = -0.01

Ideally, we want our cost to be zero, that is, without divergence between estimated and expected value.

As you can see our cost is decreasing, Our goal is to find the best set of weights and biases that minimizes the cost function.

Now let’s apply all these in our Study and sleep data and let’s predict the test score

Building a Neural Network 🧠

Let’s wrap up everything…

Now that we know how a single neuron works, we can connect them together to form a network in the form of layers.

Let’s build a simple neural network with 1 hidden layer with 3 neurons. Input layer will have 2 nodes as our data has two features (X1 and X2) and the output layer will have one node.

We need to do the below steps to build our Neural Network model.

Step 1: Initialise the model’s parameters

Perform the below steps in a loop until we get minimum cost/optimal parameters

Step 2: Implement forward propagation
Step 3: Compute loss
Step 4: Implement backward propagation to get the gradients
Step 5: Update parameters

Then merge all the above steps into one function we call train().
Once we built train()` and learn the right value of w1,w2, and b, we can make predictions on new data.

Step 1: Initialise the model’s parameters

first let’s assign some random value to the w1,w2,b

An iteration describes the number of times a batch of data passed through the algorithm.in this example, our 1 batch has 2 number [0.3,0.5]

now let’s extract each batch from our data one by one and feed into the network

Normalization: If features come in different value ranges, it is important to apply normalization by dividing each dimension by its standard deviation.

After Normalization, this is our Input hours and study data.

x= [[0.5, 0.2],
[1, 0.4],
[1.5, 0.5 ],
[2, 0.6],
[3, 1 ]]

Test Score

y= [0.65, 0.75, 0.85,0.90,0.96]

Our input data X

If we want to use 1st input set we need to pass x[0] that will give you value [0.5,02]

Step 2: Implement forward propagation

i=i%len(x) This will make sure our index stay in range

Step 3: Compute loss

We want to see how good our predicted result is compared to the desired output.

We will designate the network with an error function (usually Mean Square Error or Edit Distance) to evaluate how well we are doing with our current weights values and structures.

now we find cost function derivative with respect our weight

cost_derivative_pred = derivative of the cost function with respect to prediction

Step 4: Implement backward propagation to get the gradients

once we calculate the cost, we start from the output layer, comparing the desired results with predicted results then tracing back one layer at a time. The adjustments are made to our weights through various methods where gradient descent is the most popular one.

Here we need to use chain rule in order to find a derivative because we are dealing with more than one variable (w1,w2,b)

For each layer, we need the derivative of the error with respect to its input because it’s going to be the derivative of the error with respect to the previous layer’s output. This is very important, it’s the key to understand backpropagation!

How backpropagation works technically is outside the scope of this tutorial, but here’s the best source I’ve found for understanding it: here

cost_derivative_w1= how much cost change when w1 is changed cost_derivative_w2= how much cost change when w2 is changed

we got the cost_derivative_w1,cost_derivative_w2 and cost_derivative_b with the help of this variable we will update our w1,w2, and b variable and reduce the cost.

The network compares its own prediction to the correct label. This tells it how right or wrong it was.

Now that we have spread this information back, we can adjust the weights of connections between neurons. What we are doing is making the loss as close as possible to zero the next time we go back to using the network for a prediction

Step 5: Learning-Update parameters

Once we have computed our gradients, we multiply them with learning-rate and subtract from the initial parameters, so that it will guess a little bit more correct the next time.

When the actual result is different than the expected result then the weights applied to neurons are updated.

Once we are done adjusting weights and reach the input layer, we will redo the forward propagation again.

We need to repeat these steps over multiple epochs till our cost is minimum.

Now lets train our final model by running the function train() over 10000 epochs and see the results.

After it has done this process 10,000 times, we can check how well our network has learned

We got the value of w1,w2, and b

print(w1)1.106947390053921print(w2)0.7472646398217485print(b)
-0.23103905546069953

Well done you just made your first neural network, now you can predict your test score with the help of this model.

let’s try.

Predictions
suppose you study 6hr and sleep 4hr.

0.695830184548361

You will get 0.69 marks.
let’s check our model with Raj’s Data

Number of hours of study  5
Number of hours of sleep  2
Actual score    ->  0.65
Predicted score -> 0.614Number of hours of study  10
Number of hours of sleep  4
Actual score    ->  0.75
Predicted score -> 0.769Number of hours of study  15
Number of hours of sleep  5
Actual score    ->  0.85
Predicted score -> 0.858Number of hours of study  20
Number of hours of sleep  6
Actual score    ->  0.9
Predicted score -> 0.917Number of hours of study  30
Number of hours of sleep  10
Actual score    ->  0.96
Predicted score -> 0.98

So What’s Next?

Fortunately still much to learn about Neural Networks and Deep Learning, I’ll be writing more on these topics soon, Thanks for reading!

Please share this post on social media and give me give some CLAPS👏 if you enjoyed it/found it useful. Please feel free to leave feedback in the comments so I know how to improve for the next post All your feedback, questions, ideas🗣 are greatly appreciated and feel free to contact me at sohel@praemineo.com