A worked example on updating weights by gradient descent

For CS610 Applied Machine Learning

James Koh, PhD
MITB For All
8 min readMay 10, 2024

--

Gradient descent is one of the fundamental concepts which, I dare say, any machine learning course would teach (or at least, assume prior knowledge of). We use it solve an optimization problem, particularly when there exist no closed-form solution. (Even if a closed-form solution do exist, it is also possible to solve it by gradient descent).

Objective

Two questions might come to your mind when you see the following math equations (Note that for simplicity, I replace the calligraphic L with just a simple italics L):

Gradient descent algorithm to update weights w based on the gradient of the loss L.
  1. How is the equation derived?
  2. What is the intuition behind the gradient being proportional to the error?

In this article, we will answer the above two questions, using a worked example with concrete numbers. After all, this isn’t a course on abstract math!

Back To Basics

Just a quick recap for those who have forgotten the ∇ operator learnt during your undergraduate days. Suppose we have a function y(x), where x has two dimensions as represented by x₁ and x₂.

Example of computing ∇y. All equations, unless otherwise stated, are typed by the author.

Taking yet another step backwards, ∂y/∂x₁ is the partial derivative of y with respect to x₁. Physically, it is the rate of change of y along the direction of x₁. Mathematically, you simply have to perform the univariate differentiation of y with respect to x₁ while treating all other dependent variables as constants. For example:

Derivations

First, let’s see how the derivation actually comes about. We need to be clear about the definitions before even taking one step further. The positive class is stated as y = 1, this is straightfoward. However, some materials describe the negative class as y = 0, while others describe it as y = −1.

In this article, I will treat the negative class as y = −1 to be consistent with the lecture notes.

We will start by considering just a single sample xᵢ. If this is a positive sample, we want the prediction that it is a positive class to be as close to 1 as possible. That is, we want to maximize the probability p(yᵢ=1 | xᵢ). On the other hand, if it is a negative close, we want the above prediction to be as close to 0 as possible. The cross entropy loss gives us just this.

You would often see the formula below where the negative class is represented as y = 0. Here, p is the probability of a sample belonging to the positive class. It is obvious to see that p=1 when y=1, and p=0 when y=0, will lead to zero loss.

Cross entropy loss when negative class is represented as y = 0. Note that this form should not be used when the negative class is y = −1. In addition, not that this is for a single sample; if trained on a batch all that’s required it to sum across the different samples.

If we represent the negative class as y = −1 instead of zero, the following loss may be used. (Notice how this is analogous to the formula indicated in slide 19, though the difference here is that this is just for one single sample.)

Cross entropy loss when negative class is represented as y = -1.

Like it or not, as postgraduate students you should feel at perfect ease interpreting mathematical notations such as the above, no differently from the way you read an English sentence.

δ(y=1) is the Dirac delta function; it is 1 if and only if y=1, and 0 otherwise. Meanwhile, δ(y=−1) is 1 if and only if y=−1. Meanwhile p(1|xᵢ) representations that probability that the observed sample xᵢ belong to class 1. You may find it clearer to be represented as p(y=1|xᵢ), though it is common practice to shorten the notation since it is obvious from the context.

p(1|xᵢ) + p(−1|xᵢ) = 1, for a binary classification problem, and we can express both probabilities as a function of the model weights w and the observed feature vector x:

Probability of a sample belonging to the positive or negative class, as a function of the weights w.

We can then use a clever way of expressing the loss as

Cross entropy loss, where yi takes the value of 1 or -1.

The loss will be summed across all samples of the dataset, leading to:

Loss that will be used for updating the model weights is obtained by summing across samples.

Next, I will use a simple example to show how differentiation can be extended from a univariate case to a multivariate case. Consider the function sin(wx), where w and x are vectors. When we take the gradient with respect to w, we could either:

  1. Expand the formula and perform the partial derivatives one component at the time (see first two lines in the figure), or
  2. Perform the differentiation with respect to the vectors as a whole (see last two lines in the figure).
Example of finding the gradient of f w.r.t. w

Now, you may be asking, how is the transpose being taken care of. The short answer is that you do not need to worry about that level of details. The long answer is that you can show up for consultations and I will work out the above using indicial notation to prove, though indicial notation is something you probably would deal with only while doing a PhD.

Along this same spirit, let’s find the gradient of L with respect to w:

Step-by-step derivation to find gradient of L w.r.t. w

Note that for clarity, I’ve denoted the gradient operator here with the subscript w. The equation is actually the same as what you see in your notes.

If you’ve followed up till this point, give yourself a pat on the back!

In the next section, we will see the intuition behind the [1−p] component within the gradient.

Worked Example

Now, let’s consider just a training dataset with only two samples, and a very simple model which makes a prediction only on a single feature x. (Note that in general, the iᵗʰ sample is expressed as xᵢ, where each xᵢ has m different features and is expressed as a vector. Here, we simple consider each sample to be represented by a scalar given that there is only one feature of interest.)

Example of a dataset with one positive sample and one negative sample

The loss function can be plotted as follows:

Loss, as a function of weight w, for the particular dataset.

All you need to do is apply the formulas given in the earlier section, duplicated here for your convenience:

Replicated equation for your convenience

You can write a python helper function for convenience. Here’s a snippet for you to use.

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
return 1 / (1 + np.exp(-z))

def cross_entropy_loss(x, y, w):
return -np.sum(np.log(sigmoid(y*x*w))) # usually we use mean, but to illustrate, sum is used for consistency

x = np.array([1.6, 0.2])
y = np.array([1, -1])
w = 0.50 # initialize weight to something

holder = []
for w in np.arange(-1,4,0.1):
holder.append((w, cross_entropy_loss(x, y, w)))

plt.plot(np.array(holder)[:,0], np.array(holder)[:,1])
plt.show()

Now, let’s consider specific what happens at a particular w, say, w = 0.5.

Predicted probabilities of each sampling belonging to the positive class, at a given w. (Note that the reason I put many leading zeros is to make it extremely clear that it refers to the sample number and not feature dimension).

Ideally, we want the probability of the first sample to be as close to 1 as possible, while at the same time having the probability of the second sample be as close to 0 as possible.

At this point, do you see that there is a trade-off?

To have p(1|x₀₀₀₁) close to 1, we want w to be very positive. On the other hand, to have p(1|x₀₀₀₂) close to 0, w has to be very negative. This is why the loss can never reach 0 in our given example (as well as in real life).

Applying the equation, you can compute the loss to be 1.12 when w=0.5. Note that this is the absolute loss, NOT the gradient.

Finally, let’s go all the way back to the very first equation you see in this post:

Gradient descent algorithm to update weights w based on the gradient of the loss L.

The gradient would be negative, and corresponding update would increment w. From the graph, this is exactly what is desired.

Applying the gradient descend update with actual numbers.

You can plug in the numbers yourself to see what happens when w is large, say, w = 2.0. The resulting gradient would be positive, causing w to be decreased in the next update.

Notice that [1–0.69] is proportional to the extent in which our prediction on the first sample (which is positive) misses the truth. If the feature is positive and the label are both positive (as is the case for the first sample), this causes the gradient to be negative, which means that we have an inclination to make w larger (such that the next time a sample with the feature is observed, we are more inclined to predict it as the positive class). For this particular combination of yᵢ with xᵢ and all else being equal, the larger the [1–p] factor, the more aggressively we want to increment w in the next iteration.

Note that you have to consider −yxᵢ [1−p(yᵢ|xᵢ)] as a whole. For example, the contribution from the second sample towards the gradient is positive, which means that there is an inclination to make w smaller (or more negative). Look at the equation and tell yourself why this should be so — does decreasing w make the model more inclined to given the correct prediction for the second sample?

Keep in mind that at the end of the day, it is the collective effect of all training samples that determine whether we increase or decrease w in each interation. Also, keep in mind that in general, x is multidimensional, and we are training the weights for a vector w rather than a scalar.

And this is just logistic regression. Things get more fun when it comes to multi-layer neural networks!

If you have any questions, post in the comments and I will clarify.

Conclusion

From this article, you have learnt

  1. how is gradient associated with a binary classification problem is being derived, and
  2. how actual numbers can be plugged in, giving results which are physically meaningful and intuitive.

Disclaimer: All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.

--

--

James Koh, PhD
MITB For All

Data Science Instructor - teaching Masters students