Convolutional Neural Networks for Deep Learning pt. 2

Published in

Accedia

8 min readMar 13, 2024

In the previous post, we uncovered the inner workings of the forward pass of a convolutional neural network. This included scanning through the pixels of an image, retrieving different characteristics of the image, stripping any unnecessary information, calculating the neurons’ weights and biases, and making a prediction. As mentioned, the initial prediction yields completely random results in terms of weights and activations.
In this blogpost we will demystify the actual “magic” behind the learning process of a CNN. The word “magic” is written in quotes because the backpropagation process might seem like something extraordinary, but it is mostly calculus and algebra that tweaks the neuron’s weights and biases so that the network can be more accurate.

Summing up all the steps of backpropagation?

The process of backpropagation begins with calculating the loss of the result from the forward propagation (or in other words quantifying how bad the neural network made its prediction). Then we apply the chain rule from calculus and apply an optimization algorithm (which in our case would be Gradient Descent) which reduces the value of the loss by tweaking the weights and biases of each neuron.

The bias term

Quick FYI, in the previous blogpost the term “bias” was purposely not mentioned in order to reduce complexity, but now we will uncover it.

Just like the weights, the bias is a real, learnable value and can be any number. It is added to the weighted sum before applying an activation function like the Sigmoid squishification function (if you remember from the previous blogpost). The bias is used for offsetting the weighted sum, so that you can define at what level the weighted sum needs to be for the neuron to fire (or activate). E.g. you might not want the neuron to fire when the weighted sum is just > 0, but you might want it to fire when the weighted sum is e.g. 50.

There are different initialization techniques for setting the initial value of the bias and the weights, like zero initialization where you set all values to zero, or random initialization where you set all values to small random numbers. There’s also the Xavier/Glorot Initialization, which is commonly used, along with other types of initializations.

Calculating the loss

As mentioned earlier, the loss function is used for calculating exactly how far is the result of the prediction compared to the expected output. Looking up this term online you can find both “loss function” and “cost function” and you might get confused about what they mean. Some people find a distinction between the two terms referring to “cost function” as the error for an entire training set, whereas “loss function” is the error for a single training example. Very often you can see both terms used interchangeably, so when you come across this for example in a blogpost, try to read further to understand if the author is using the term for a specific data point or in a more general manner. In this blogpost we will be using the term “loss function”, and we will focus on specific training examples.

Two of the most used loss functions are Mean Squared Error (MSE) and Cross-Entropy Loss. Mean Squared Error is mostly used in regression tasks such as the network predicts a price, or any other continuous parameter, while Cross-Entropy loss is used for classification tasks, like whether there is a dog or a cat in the image. There are two separate types of Cross-Entropy Loss — Binary Cross-Entropy (BCE, a.k.a. Log Loss) used when the network must predict if the result is or isn’t a given class, and Categorical Cross-Entropy (a.k.a. Softmax Loss) when doing classification on several predefined classes. Mean squared- error is not used in image recognition, but we will still cover it.

Mean Squared Error

This is the formula for Mean Squared Error:

It is called Mean Squared Error because it finds the average of a set of errors. In the formula above 1 over n is 1 over the number of items, which is multiplied by the sum of the difference of each training example squared. The difference is calculated by subtracting the predicted output (y hat) from the actual result (Y) and after that the result is squared. The result of the difference between predicted and actual needs to be squared in order to emphasize and penalize larger errors, so the larger the error is the more it will be “exaggerated”. A larger error means a more serious indication that something is wrong with the predicted output.

It is not always needed to multiply by 1/n e.g. in the case when you are training a model on a very small dataset, but most commonly that’s not the case because lots of training examples are being used.

Cross- Entropy Loss

As mentioned, the other commonly used loss function, mainly for classification problems, is Cross- Entropy Loss.

For each class we sum up the actual probability of a class times logarithm of the predicted probability. Just FYI, when we know which is the correct class, we just set its probability to 1 and we set the probabilities of all other classes to 0. We then add the minus sign in front of the equation because we would always end up with a negative result, so it’s better to normalize it to a positive number.

The goal is to have a value from the calculation of the loss function which is as small as possible. This value can be decreased using an optimization algorithm, such as Gradient Descent, that “tweaks” each node on the neural network (or each weight) so that we gradually lower the loss and end up with a network that can make successful predictions.

Gradient descent and other optimization algorithms

After we know what the loss of the neural network is it’s time to perform some changes of the weights and biases of the individual neurons, so that we minimize the loss. A way to do that is by using optimization algorithms such as gradient descent. As with the loss functions, there are several different types of gradient descent such as Stochastic Gradient Descent, Batch Gradient Descent, Mini-Batch Gradient Descent and others. Each type of gradient descent has its own advantages and is suitable for different scenarios, so choosing the right algorithm depends on the nature of your data, the size of your dataset, the problem at hand and other factors. So, let’s see what each type of the mentioned gradient descent algorithms has to offer.

Gradient descent is an algorithm that finds the most efficient way of reaching a minimum value of a function. In the case of deep learning the function that we want to have a minimum value is the loss function. You can think of the gradient descent as a person who is walking down a valley. This person would want to get to the bottom of the valley in the most efficient way possible. In our case the valley is a representation of the loss function where the bottom of the valley would be a loss function with a value very close or equal to zero.

The Chain rule

When changing the individual weights you can’t just go around and just tweak values randomly, because some values hold more power than others and that can throw things out of whack. Remember, the idea of backpropagation is to minimize the cost function, so sometimes an increase in one neuron can minimize the cost function more than the decrease of another neuron. That’s why we need to calculate the partial derivative of the cost function with respect to the derivative of each weight.

· d means derivative.

· cost is the cost function.

· w of i is the weight of the i-th neuron.

What we basically want to find out here is the ratio of how a really small change in the weight compared to the cost, so that we know if this change of the weight would have a significant impact on the cost, or if it would be negligible.

Gradient Descent formula

The typical formula for gradient descent is the following:

Here we calculate the new weight by subtracting from the old weight alfa (which is the learning rate) times the derivative of the cost function over the derivative of the old weight.

So if we have a weight of let’s say 0.37 and we’ve set the learning rate to 0.1 and the derivative of the cost function over the current weight is 0.0055. Then we would end up with 0.37–0.1 * 0.055 and the new weight would be 0.3645.

What is “learning rate”?

The learning rate is a constant that you predefine yourself and it’s a value that manages how quickly the gradient descent would reach the minimum (in this case how quickly it will minimize the cost function). You can’t set a very large value because then the network would quickly overshoot and if the value is too high it might never be able to reach the minimum.

Final Thoughts

You can use many different formulas and algorithms when performing gradient descent, but what we have described here is probably the most straightforward process. We started by calculating the cost function and quantified how badly our network has performed, then we proceeded to the chain rule in order to calculate the derivatives of the weights to the cost function. Finally, we calculated the weight using the gradient descent algorithm. We also shed some light on what the bias term is and what the learning rate is. Keep in mind that depending on the situation you might want to use a different kind of gradient descent, or a different cost function, etc. so it’s really needed to gain knowledge and experience in the field of convolutional neural networks in order to make informed decisions. Also, nowadays there are libraries and frameworks that wrap up all of those algorithms, so that you can easily implement that in code and get a fully connected convolutional neural network working in about 10 lines of code.