Understanding Weight Update in Neural Networks

7 min readApr 20, 2023

Introduction

Neural networks (NNs) can learn to recognize patterns and make predictions based on input data. At the heart of a NN are its weights, which are numerical values that determine the strength of connections between neurons. During the training process, the network adjusts these weights in order to minimize its error on a given task. This process of weight update is crucial to the effective of predictions of a NN.

In this article, we’ll take a closer look at how weights are updated in an oversimplified NN. We’ll start by discussing the basic principles behind weight update, including the concept of gradient descent. We’ll then dive into the math behind weight update, explaining how the network uses the gradient of its loss function to adjust its weights.

Disclaimer

It’s worth noting that there are many resources available that cover the topic of weight update in NNs. From academic papers to blog posts to online courses, there is a wealth of information out there for anyone looking to learn more about this critical aspect of deep learning.

Also note that this article is primarily the outcome of a small milestone in my learning process, since I am seeking to deepen my understanding on various concepts involving machine learning/deep learning. That said, I will be more than happy if readers find this resource useful and insightful.

I suspect that the added value of this writing will be found in the details around the calculation of the partial derivatives. In most of the ressources I have read the answer to the equations are provided rather than elaborated step-by-step.

Objective

At the heart of weight update in NNs is the desire to minimize a function that we refer to as the loss function or the cost function. Roughly speaking, this function measures the discrepancy between the network’s predictions and the true values of the output data. By reducing the value of the loss function, we can make the network’s predictions more similar to the true values.

For simplicity purposes we will use binary classification as the task that the network will be trained on. In a binary classification task, the loss function that is typically optimized is the cross-entropy function due to its convexity and the fact that it has a unique minimum. This function measures the difference between the predicted probability distribution of the network’s output and the true distribution of the data.

In the cross-entropy formula y denote the true value and a the predicted value. Also, for simplicity purposes, we will consider a single true value y and, with it, a single prediction a; a single weight w; a single bias b; and a single input of a single dimension x. I will leave “more complex” networks for future articles.

(left) Usual representation of a NN | (right) Our oversimplified representation of a NN

Forward Pass

The simple version of the NN takes an input and computes a linear transformation on it. After, the result of this linear transformation is passed through a non-linear function. In our simplified case, the output of this last non-linear function will be the value of the prediction. This process happens left to right when looking at the illustration and we call it forward pass.

Cost Calculation

Once the linear and non-linear operations have been computed on the input we obtain a final value. This final value is then compared to the real value to have a measure of difference. The more similar the real and the predicted values the smaller the value of the cost function. Inversely, the more different, the greater the value of the cost function.

Backward Pass

Assuming that the prediction value is very similar or equal to the real. We can assume that the network does its job accurately and we could ideally stop there. However, this is never the case, specially when using non-simplified NNs. Thus, to reduce the distance between the prediction and the real values we should to modify whatever is modifiable in our system. In our reduced example, these are the weight w and the bias b.

To update these values, usually the gradient descent algorithm is used. This algorithm will aim to find the set of weights that minimizes the value of the loss function.

To update the weights, we first compute the gradient of the loss function with respect to the weight.

Gradient of the loss with respect to the weights

The gradient of the loss with respect to the weight and with respect to the bias cannot be directly computed. However, it is possible to do this calculation step by step using the chain rule. This process happens from right to left and we call it the backwards pass.

First we compute the gradient of the loss with respect to the prediction since the cost is a function of the predicted value. In other words there is a direct relationship between them.

Gradient of the lost function with respect to the predicted value

The next step includes the non-linear part of the network. We will compute the gradient of the non-linear function with respect to z (the output of the linear transformation). Note that for this illustration I will use the sigmoid function as the non-linear function. In this article I do not go into any detail regarding the sigmoid function.

And the gradient goes as follows:

Gradient of the non-linear function with respect to z

Note that this will look differently if another non-linear function was used.

So far we are able to compute the gradient of the loss function with respect to z (linear transformation of the input) thanks to the two gradients we previously computed.

With this, we can finally compute the gradients that we initially were looking for, which are (a) the gradient of the loss function with respect to the weight and (b) the gradient of the loss function with respect to the bias.

First we compute the gradient with respect to the weight as follows:

Gradient of the loss function with respect to the weight

Similarly we can compute the gradient of the loss with respect to the bias:

That’s it! All needed gradients have been computed.

As a reminder, the gradient tells us how much the loss function changes when we make small adjustments to the weight. All that’s left is updating the values of the weight and the bias.

We adjust the weight in the direction of the negative gradient, which means we subtract a small multiple of the gradient from each weight. We add a small factor that will control the magnitude of the change that we call the learning rate α, which determines how quickly the network converges to the optimal weights.

Weight and Bias Update

This is how the weight and bias update will look like for our simplified NN and a sigmoid as a non-linear function applied to the linear transformation:

And that is how we update the parameters of a NN.

One important note to add is that the process of computing the gradient and updating the weights is repeated many times on a NN of a bigger size, with each iteration ideally bringing the network closer to the optimal set of weights. This process is known as backpropagation.

Conclusion

In conclusion, weight update is a critical process in neural networks that enables them to learn from input data and make accurate predictions. By adjusting the numerical values that determine the strength of connections between neurons, a NN can minimize its error on a given task using gradient descent. The math behind weight update involves using the gradient of its loss function to update the weights.

Final Words

It’s worth noting that in most cases, we don’t need to reproduce the backpropagation algorithm from scratch as there are tools available that abstract away the implementation details and allow us to implement it easily. However, it is beneficial and interesting to have a thorough understanding of how the process works.