How Neural Networks and Backpropagation Work

Published in

Artificialis

18 min readAug 13, 2021

What Can Deep Learning Models Do and What they can’t?

We all know the term “Artificial Intelligence”. We all heard about AI, about Object Detection. Every day that passes we hear more and more about new AI programs that enter into our world, into the field of medicine, school, military, and even government. But what’s so special about it? What makes this term so famous and so scary at the same time? Well, I’ll tell you one thing, this name is just too intimidating. AI is just a dumb machine that isn’t even that complicated or that scary. All It does are just calculations, like y=mx+b, that it’s. Now, most of you are thinking to yourself: “So maybe school does teach us something useful”, Well, the answer is still no, since it’s much more than y=mx+b. But this is the core of this field.

Object Detection model’s output(Type of a Deep Learning model)

So what does y=mx+b can predict?

This simple function can predict different diseases, insurance prices, house prices, and almost everything that has to do with numbers.

What does y=mx+b can’t predict?

Hold on, if so we can predict anything, right? I mean, in the end, everything in the world can be converted into numbers, right? The answer is yes and no. Anything in the world can be converted into numbers, but no, we can’t predict anything. Some things in this magical world don’t follow maths equations. For example, the most famous idea, predicting the stock market. We can predict the stocks if nothing happens but it is never like this, there are real-world events that can majorly influence the stock market and it breaks the math equation. We can’t predict things using AI while real-time events are taking place and can influence the true results, math can’t predict if the stock market will collapse tomorrow. Math can’t predict how and when you will die, as a lot of events throughout your life can influence and change, resulting in different possibilities which the AI algorithm will never be able to predict.

So after you understood the general idea behind AI we can start to dig deep into the core of it, into the maths behind this field, and start to create huge projects!!!

The Essence of Neural Networks

Before we are going to dig into this, we need to understand the general theory behind the whole Deep Learning field.

Imagine that you’re a student at Harvard, and you want to know if you’re going to pass your finals or not according to your previous knowledge, successes, motivation, and characteristics as a university student at Harvard. You use your hacking skills to hack into Harvard’s database and manage to find yourself. Ahh….what did you find? Your ID, your name, your average grades, your classes names, (these are converted into numbers, somehow..) how smart you are, Wha….h-how? . Your previous knowledge that you gave them before enrolling in the university and how much did you study. H-How?! How do they know how much did you study?! This is scary af, Harvard administration are scary people with a lot of stalkers, HELP!! So you got your inputs, we are going to call them X, each input will have its own number, X₁=Your ID, X₂=Your name and so on. So in the end we have 7 inputs(X₁, X₂, X₃, X₄, X₅, X₆, X₇) and we want to predict if you’re going to pass your finals or not according to those inputs that you found. Remember that I told you that you can predict many things by y=mx+b? So let’s try it. Remember that y is what we want to predict, y will be either 0(Will not pass) or 1(Will pass). Let’s write the equation:

y = X₁ + X₂ + X₃ + X₄ + X₅ + X₆ + X₆ +X₇

Wait! You told us that the typical equation is y=mx+b not y=x!! I know, I know, hold up, calm down and continue to read. Now, where is the ‘m’? Where is the slope? Well, let me introduce you to the world of weights, no, not your mass. In the world of Deep Learning, weights or W represent how important the X is. For example, your ID isn’t important at all, who cares about it and how it’s going to help you to predict if you’re going to pass or fail your exam? So the weight for it will be 0 since it’s not going to help us at all! But X₇(How much did you study) can influence a lot! So let assign it a weight of 0.7. And so on for the other Xes.

In the end we got something like that:

y = X₁⋅0 + X₂⋅0 + X₃⋅0.3 + X₄⋅0.3 + X₅⋅0.5 + X₆⋅0.7 + X₇⋅1

But Tomer, where is b? Leave it for now, we will get to it later. So we got our first function!! But how does it help? Well, it doesn’t help you at the moment but let’s solve this function first. Let’s say that X₁=0.31, X₂=0.67, X₃=0.87, X₄=0.8, X₅=0.8, X₆=0.7, X₇=0.9. Weird numbers but yeah. So I’ll give you 15 seconds to solve it. If you got y=2.291 — congratulations, you managed to pass your third-grade exam! Be proud of yourself! But wait, y needs to be 1 or 0, this value is way over 1 so I lied to you?!! Nahh, calm down. The value that you got will be stored in a neuron, neuron is a “node”(A variable) that stores the sum of your inputs*weights. Each neuron has its own activation function, that activation function returns a value between 0 or 1, your output! For this example let’s use the Threshold activation function (If a value is greater than 1 so it will return 1 if the value is lower than 1 it will return 0). Congratulations!!! You’re going to pass the finals according to this equation!! You can finally calm down and relax…or not? Well, sometimes neural networks can be wrong since real-world events can happen, so we can’t be sure, GO BACK TO STUDYING!

The Neural Network

Woah! Hold up, it was too much, I just told you too many words(X₁, y, neuron, activation function, weights, and so on). Let us summary all of it.

So what is it called the Neural Network or not weights network or something else why take something from the medical field and then use it for programming, why? Well, this simple neural network is very similar to the neural network in our brain.

Our brain consists of billions of neurons interconnected and each one is passing a message to another neuron and at the end, you have a specific action to be performed. For example, you want to reach your pocket to grab your wallet. So there’s a command that’s being sent to your brain. Then this command gets back to the neurons in your brain and then the neurons that are dedicated for this command get fired up. And at the end, the output of all of these interconnections will send a command to move your hand and grab your wallet.

Does it sound familiar by any chance? It gets an input(A command), it stores the input with their weights(Fires up) and then the activation function returns us the answer(Moves your hand). Oh? It sounds the same, well. It’s the same thing but this is how it got its name, from our brain!

Alright, enough talking about biological stuff, too much for my head and it’s not interesting(For me), let’s explain the neural network.

The Input

Can also be called features, X, parameters, and labels
Can be anything from numbers to text to images and even voice
The input itself much be converted into numbers because as you saw we are creating mathematical functions
Each feature is a new X as we saw, there can be unlimited labels

Neuron

The neuron stores sum of the labels ⋅ weights

Activation Function

Gives an output of 1 or 0 based on its input

Alright Tomer, you still didn’t tell us where is the b?! WHERE IS THE B?!?!! y=mx+b . Well, this is when I introduce you to a new name, bias. The bias will always be equal to 1 and it will be the b in the y=mx+b function. The bias will be our saviour if the neuron will not give us the right answer, the bias will save us and make sure that we will get the right answer. * The bias will be part of the input.

Alright Tomer, you told us that y=mx+b can predict many things but it can separate only two points. It can create a linear line, that it’s, just a line. But what if we need to predict the XOR function, a pretty simple function that can’t be classified by one line but two!

How Neural Network can solve real-life problems but can’t solve an XOR problem?! WHAT YOU DON’T TELL US!! Well, the real answer is that we don’t use only one equation(One perceptron) but we use many. For example, for the XOR problem, we can use only two perceptrons, to create two lines and this is why it is called a “deep” neural network because later in my blogs you will understand that we are going to use more than millions of perceptrons to solve pretty simple problems.

Now, since there will be too many perceptrons we are going to use layers. There are 3 layers in a neural network: Input Layer(The layer where you give your inputs), Hidden Layer(The layer where all the magic happens, the activation functions, and a lot of neurons!), and the output layer(The layer that gives us the output).

The picture above represents a simple neural network architecture:

8 features(Three neurons in the Input Layer)
24 activation functions(8 neurons in each hidden layer and there are 3 hidden layers)
4 outputs(Well, we know that each neuron can output 1 or 0 so there can be 4 outputs for example if the picture it’s a dog, cat, snake or a cow)
We can understand that the neural network is going to create 24 lines according to the 24 neurons in the Hidden Layers but…but..it can be a non-linear line which means even a circle! Or more complex than that! This is the beauty of maths with neural networks, we can create unreal “lines” by y=mx+b!

So what we learned?

A perceptron may be able to separate the inputs by a straight line
A single perceptron is not sufficient to model the XOR function
The solution for the XOR problem is to expand beyond the single-layer architecture by adding an additional layer of units known as a hidden layer(Processing layer). This kind of architecture is called Multilayer perceptron(MLP)
A neural network is a network with many inputs, several hidden layers with several units, and an output layer at the end.
The number of output units(output neurons) depends on the classes(Output) we want to predict. Example: If we are predicting if an image is a cat/dog/bird/cow, we would have 4 neurons at the output.

Gradient Descent

Tomer, we understood you, we understood the idea and everything but how it works? How do we know the values of the weights that the neural network needs? I heard something about training a neural network…Yeah, training! What’s training? Well, I’ll tell you how just wait, we need to cover everything first. Well, let’s start with the loss function! Okay, another function, I’m out!! BYE!!! -_- Hold up there! This is a pretty simple function, not something to worry about or anything. It just tells us how wrong is our function. What? It just tells us how wrong is our function. WHAT!? Alright, alright, my mistake!

In the field of Deep Learning, we don’t write the algorithm but we give input and output for the model. Wha? But if we know the output what’s the point then? Well, welcome to the training. So like in your Harvard tests, the professor teaches you the subject and then gives you homework. This is when you “train”, study, this is when you create the functions to solve those problems, you teach yourself how to solve those functions. You already got the answers in the textbox but you need to show how you need to create the function and everything and this is when you start to train yourself. You have a base function which is y=mx+b and you start to play with the ‘m’, you start to add numbers, to add more information to the function until you reach the needed answer. This is the same thing with the weights, you change the weights until the function will give you the needed answer or the closest one. The loss function tell us the distance between the real output that we gave in the start and between the output that the function gave us. This means the lower the loss function the better, the higher the worse. We will call the loss function J and the weights W.

We can see that the loss function creates a parabola which means the minimum point of the parabola is the best weight since we want the lowest value of J(Distance between the real value and the function’s value). One more thing, from now let’s call the real value by the name of y and the function’s value by the name of ŷ(Y-hat). So we can all agree that to find the best weight we need to find the minimum point of the parabola and then we can get the best function! But what will happen if we will have multiple minimum points? This is when the Gradient Descent comes to us.

The Gradient Descent is the best algorithm that is used to minimize the error(Loss function) with respect to the weights of the neural network.

The slope can be called also the by name of the gradient. Now, to find the minimum point we want to minimize the error by changing the weight. So if our slope is negative we want to increase w, which will cause the loss function to decrease. And if the slope is positive we want to decrease w, which will cause the loss function to decrease.

And so we update the weight by that technique(It also called Weight Update Rule): E=Loss Function, η = Learning rate

Woah! What’s η? Or Learning rate? Well if we will look at the function we can understand that it can increase and decrease the gradient descent. The learning rate decides how fast we update the weights, in other words, the step size of the update.

The Forward Propagation

So let’s combine everything and start to use real numbers!!

So the professor in the Data Science class gave you a homework with that dataset:

Input:

X₁ = 0.5, X₂ = -0.5
X₁ = 0.3, X₂ = 0.4
X₁ = 0.7, X₂ = 0.9

And this output:

y₁ = 0.9, y₂ = 0.1
y₁ = 0.9, y₂ = 0.9
y₁ = 0.1, y₂ = 0.1

Your task it to create a neural network with the given dataset, so for the input X₁ = 0.5, X₂ = -0.5 we want the output to be y₁ = 0.9, y₂ = 0.1.

So what we got here?

Two neurons in the input layer(We got X₁ and X₂)
Two neurons in the hidden layer(Each neuron for each input neuron)
Two neurons in the output layer(We got y₁ and y₂)
Four biases(One for each neuron in the hidden and output layer)

And since this is the first, we will initialize the weights randomly(Later in my future blogs you will learn different techniques). The architecture of our model is going to look like that:

So let’s start with the first neuron in the hidden layer:

Let’s sum the inputs and the weights to find out what the first neuron stores:

z₁ = First Neuron = ∑(Xᵢ ⋅ Wᵢ) = 0.5 ⋅ 0.1 -0.05 ⋅ (-0.2) + 1 ⋅ 0.01 = 0.16

Now let’s do it for the second neuron in the hidden layer:

z₂ = Second Neuron = ∑(Xᵢ ⋅ Wᵢ) = 0.5 ⋅ 0.3 -0.05 ⋅ 0.55 + 1 ⋅ (-0.02)=-0.145

Now, do you remember that it’s neuron has it own activation function so for this example we will use the sigmoid activation function:

Activation functions usually represented as σ and the value inside them represented it as z

Activation function is scaling the input value(net value) to a value from 0–1.

σ₁ = 1 / (1 + e-ᶻ) = 1 / (1 + e^(0.16)) = 0.5399

σ₂ = 1 / (1 + e-ᶻ) = 1 / (1 + e^(-0.145)) = 0.4638

Now, the hidden neuron’s output becomes the input to the next neuron:

ŷ₁ = Output Neuron = ∑(Xᵢ ⋅ Wᵢ) = 0.5399 * 0.37 + 0.4638 * 0.9 + 0.31 = 0.9271

ŷ2 = Output Neuron = ∑(Xᵢ ⋅ Wᵢ) = 0.5399 ⋅ (-0.22) + 0.4638 ⋅ (-0.12) + 0.27 = 0.0955

And now again, another activation function for each one of the out neuron:

σ3 = 1 / (1 + e-ᶻ) = 1 / (1 + e^(0.9271)) = 0.7164

σ4 = 1 / (1 + e-ᶻ) = 1 / (1 + e^(-0.0955)) = 0.5238

But Tomer, the output that we got aren’t the output that we want…what about the gradient descent that you told us! We need to change the weights! Indeed, now we finished our front propagation we can start our back propagation, the most important thing in the neural network!

The Back Propagation

A method to train the neural network, by adjusting the weights of the neurons, for the purpose of reducing the output error(Loss Function/Cost Function). We are going to use the Mean Squared Error(MSE) as our Loss Function.

We are going to find the derivative of the error(loss function) with respect to our output then the derivative if the output with respect to our next input value and finally, the derivative of our net(Neuron) input value with respect to the weight.

For example, let’s say that:

y = z + 2

z= w+4

Our objective is to find the derivative of y with respect to W. Now, what we can do if we find the derivative of Y with respect to set and then we multiply the derivative of Z with respect to W, then we can end up with the derivative of Y with respect to W because these two cancel out.

What we are going to do

We can do the same thing with our neural network.

Find the derivative of our error with respect to our output
Find the derivative of our output with respect to our net’s input value
Find the derivative of our net-input value with respect to our weight.

The thing that we did is called the Chain Rule, so basically, the Chain Rule is ruling this neural network(Haha! Not funny! Nerd!)

Let’s start with finding the first term!

The derivative of our error with respect to our output

E = Error Function or in other words the error.

We need to find how much the error changing with respect to the output.

Mean Squared Error equation. P = Number of neurons in the output layer

We can write it like that while y is the real value and the y^₁ is what we got. Since we have two hidden neurons so we need to add both of them in the sum equation

First of all, the second neuron equation will be zero because there is no derivative for y one which is the output of the first neuron which is not presented here in this equation. So we will derivative with respect to the first neuron so this is not present in this equation, therefore, the derivative will be zero. So we can cancel this out.

So when you derivates this term you will end up with the negative one.

This is what you need to get in the end. e stands for the derivative

After a few calculus, it’s time to apply the numbers.

eᵏⱼ = -(0.9–0.7164) = -0.1836

Now that we finished the first term, it’s time for the second term!

The derivative of our output with respect to our net’s input value

We need to find how much the output changing with respect to the input.

S = Activation function = Sigmoid function

The derivative result is the output of the sigmoid function times 1 minus the output of the sigmoid function

This is what you need to get in the end

After a few calculus, it’s time to apply the numbers.

eᵏⱼ = 0.7164 ⋅(1–0.7164) = 0.2031

Finally, it’s time for the first term!

The derivative of our net-input value with respect to our weight.

We need to find how much the input changing with respect to the weight.

The sum of the weights and the activation function’s result

eᵏⱼ = 0.5399

It’s time to multiply everything!!

This is what we need to get after everything!

It’s time to update the weight for the neuron, all of these maths equations just to update something that is called a weight…..why!? WHY?! Well, welcome to the dark side of AI :D

Weight Update Rule

The professor told you that the learning rate is 1.2 so μ will be equal to 1.2.

W = 0.37 + 1.2(0.0201) = 0.3941

Now it’s time to adjust the hidden layer neuron’s weight! Since this time the neuron has two passed which means he affects the first and the second output neurons so we need to updates the two weights of the neuron.

Remember the chain rule equation that has the sum? Now p equals 2 since there are 2 output neurons in the output layer.

The derivative of our input with respect to the first weight will be just input which means it will be equal to 0.5. Z₁ = X₁(0.1) + X₂(-0.2)+bias

First Step

The derivative of the sigmoid activation function with the respect to the input will be equal to 0.2484

Second step

Now we need to sum up the derivatives of two of the output neurons.

y₁ = 0.5399(0.37) + 0.4638(0.9) + 0.31

y₂ = 0.5399(-0.22)+0.4638(-0.12) + 0.27

Now after summing it all up and multiply it we will get:

Gradient = -0.0370 ⋅ 0.2484 ⋅ 0.5 = -0.0045954

Now back to the Weight Update Rule

Weight Update Rule

W = 0.1 + 1.2(0.0045954) = 0.1055

Now, if you still didn’t understand anything in the backpropagation, please visit this site.

A similar procedure will happen for all other neurons. We only updated the weights for the upper neurons now, it’s your time to update the weights for the other neurons. I’ll give you a few hours.

Remember, that we did all of this only for the first sample! We need to proceed, of course, to the second sample and the third one. Therefore, we’re going to have three iterations in order to complete one. So we take the second sample and then the third sample and then we’re finished with one Epoch.

Epoch — One complete pass through all the samples.

After repeating that for many epochs (ex. 25) our neural network is expected to reach the minimum error point and be considered as trained.

After doing the professor’s homework and not sleeping this night + falling asleep on your notebook

Good morning! Are you ready for the next class?! You learned a lot! You learned the theory and the maths behind a neural network and now can cheat in your maths homework without the need to think by yourself, let the neural network do it! You did a great job!! The professor itself gave you an A+ for that assignment and I’m proud of you! The next class will be about different Loss Functions, I can’t wait to see you!

What now?

You can stick with me, as I’ll publish more and more blogs, guides and tutorials.
Until the next one, have a great day!
Tomer

Where to find me:
Artificialis: Discord community server , full of AI enthusiasts and professionals
Newsletter, weekly updates on my work, news in the world of AI, tutorials and more!
Our Medium publication: Artificial Intelligence, health, life. Blogs, articles, tutorials. all in one.

Don’t forget to give us your 👏 !