Gradient Descent and Disappointed Girlfriend

Published in

DataComics

10 min readNov 28, 2018

Motivation

I have been wanting to learn the math behind the algorithms behind the models I use in my day to day life for a while. There have been multiple failed attempts. While reading I usually start understanding math and in the middle, I get lost. Then I take a break and do something else and come back to my system just to find that I am completely lost. Since teaching is the best way to learn I thought of writing an explainer for one of the coolest algorithms that I have come across, Gradient Descent with a style this blog has always embraced.

But Before That

This blog requires some basic level of understanding of math in Algebra and geometry. This blog also requires an in-depth understanding of movies and stereotypes. I know that is a tough ask but here is some good news, this blog doesn’t expect you to understand your girlfriend. It doesn’t even expect you to have one. And finally…

And also smoking is injurious to your health.

The fictitious Girlfriend

The protagonist of this story is an impish boy in his high teens who has got a mobile phone recently. He also has a girlfriend. She is happy that the boy finally listened to her and got himself a mobile phone. The boy has always found it very difficult to understand the mental state of his girl even while having an in-person discussion. He feels there is a lot of mood swings involved even during a short conversation. If It is that difficult to figure out the mood by looking at the face or listening to the tone of the voice, With a mobile phone now, imagine how hard it will be for him to figure out the mood while texting. How is that boy suppose to know that hmmmm, hmmm, hmm, hm are four different things meant in four different moods? A lot of misunderstandings happened in the past just because he was unable to understand the mood of his girlfriend. He believes just by understanding the air around and curating text that will comfort his girlfriend a lot of bad blood can be avoided.

The Anger and The Disappointment

After experiencing it several different times and doing some research the boy finally figured out that anger and disappointment are two different things. In the past, there have been several instances where the boy misunderstood anger for disappointment and disappointment for anger. Poor guy used to take the literal meaning of the text from his girlfriend which reads “I am not angry”.

He believed that if he builds a classifier which could label the texts from his girlfriend whether it means anger or disappointment he will be able to comfort her better and prevent the backlash. So he starts collecting data. He pings all his friends who have girlfriends and asks them for labeled data for text messages from their respective girlfriends. Since he is a boy with some manners he doesn’t ask for the actual texts, instead, he asks only for a couple of attributes about the text. The length of the text message, and time difference in seconds between this text and the previous text which was received. After getting a good amount of sample data he decides to build a binary classifier which will help him classify those text messages into anger or disappointment.

This is how the scattered plot of collected labeled data looks.

You can see that the anger and disappointment texts have got grouped separately together in the form of clusters. When in anger you get text messages with a lot of text very frequently and when in disappointment you get messages with very little texts that too with a lot of time interval between two consecutive texts from your girlfriend. So probably “hmmm ok do as you wish” response sent without much time interval means anger and “hm” sent after a huge gap is probably mean disappointment.

The Fine Line

The boy has to build a classifier to separate out the clusters so that next time when he receives a text from his girlfriend he will be able to classify it well. Since we have only two attributes the feature space is only two dimensional. (Time difference and length of text). So a line is enough to separate out the cluster and learn to classify the texts next time. So after the equation of the line is learned, from next time we just need to plug the values of x — time difference and y-Length of the time to find the Y — (0 for anger, 1 for disappointment) next time.

In order to find the line of best fit by equations a lot of calculation needs to be performed. The line equation has two main constants (weights) which have to be learned. One is the slope of the line and other is the y-intercept. The slope in a way defines an angle in which the line is tilted and the y-intercept is the point where the line meets the y-axis when x is 0. Find the equations of slope m and intercept b below.

Now don’t get angry or disappointed. We are not going to compute either slope or intercept. For a simple reason, that computation takes time and its little complex. Here we have only two-dimensional feature space, imagine if we would have gathered other features like the number of full stops in the texts number of smiley in text received etc. In that case, even the plane or entity that separates the feature space in order to classify them will be multidimensional and equation to solve it will be much more complex and very difficult to compute.

You can be wrong but correct your mistakes, okay?

Imagine by plugging random numbers we accidentally found the constants of the equations for slope and intercept, how nice it would be? But think of our luck and think about how lucky we have been in the past? Even if we were to try all possible combinations that were there our luck is aligned in such a way that we find the right numbers only when a handful of it is left to be tried. Say a number between 0–200 there a 1000 possible values(including decimals)for slope and intercept. So we have to try 1000 * 1000 (a million) combinations before getting to the exact numbers. These number of combinations available will only grow as the number of dimensions increase. Now we are stuck in the dark without knowing how to solve the equation and whether the boy’s girlfriend is angry or disappointed.

This problem puzzled scientist for quite some time in history and solution to this is something quite brilliant.

What if you choose some random values for the constants and plug it to compute the result and compare it how much it is different from that of values we have in training data (loss) and update the weights in such a way that in the next iteration the loss is reduced comparatively.

This idea is central to the entire algorithm. Update weights in such a way that the loss is minimized and keep updating the weights until the loss cannot be further reduced. This also in a way says that In order to form the equation you need not find the exact values for the weights that will solve the equation you just need to get to the set of weights for which the loss is the least. Which is pretty close to solving the problem.

The Analogy

This loss function can be analogous to the misunderstanding between you and your girlfriend. You need not completely understand your girlfriend, which is simply not possible. You simply don’t have enough compute power. You just need to understand her better than anyone else would do. If you have reached there you have got your personal life sorted out.

In theory, you just have to update the weights in a way so that it will help you reach the sweet spot where you can find the enlightenment tree, sitting under which you can be assured that no one can understand your girlfriend better. In the next section, we shall use the algorithm to actually get there and see what are the challenges we overcome on the way.

The Descent

To do this with math one needs to understand a couple of mathematical concepts, which is a field in itself. 1. Linear Algebra 2.Differential Calculus.

Linear Algebra

It is a field of mathematics that deals with linear equations and matrices. To put in simple words it is algebra done with matrices primarily for data transformation and faster computation.

Calculus

This is a branch of mathematics that deals with changes, the rate of change and understand how things are related to each other. We use this in our problem to find how the error (loss) is related to weight and alter the weights in such a way that it results in minimizing the loss.

Epoch

Epoch is nothing but a set of steps repeated continuously until a condition is met. It's just a fancy word for iteration, which is indeed just a fancy word for repeating things which is not a fancy thing to do.

First, it starts with initializing random weights. The line equation is computed with random weights and a line is drawn using those weights separating the two types of emotions.

Then the loss is computed. In our case loss is nothing but the sum of squared errors. Ie (Actual — Predict)² summed over all instance of x and y.

Then differential calculus is used to update weights in such a way the loss is minimized in the next epoch.

These steps are repeated until the loss cannot be further improved by updating weights.

You can see how this is done in action below. You can notice that the line of separation gets better with the epoch.

The loss drops down until a point then it stabilizes, over this point, it cannot be improved.

The Anti Climax

The minimum loss that the boy was able to achieve using gradient descent starting from a random weight is close to 18. The boy was pretty happy. The boy feels that he has reached a point where no one can understand his girlfriend better than he does. Suddenly he gets a text from his girlfriend which reads “Even my dog understands me better than you do.”

The boy's face grew totally pale not understanding what’s going around him. He feels even after reaching the lowest point of the loss function there is something that he is missing. Let’s see what is the thing that the boy is missing.

The Local Phenomena

Let’s bring back the old sweet spot curve that we saw.

The boy is of course in a minimal point in a particular area of the curve but it is not the lowest point in the curve. The boy is in a place that is called local minima. In other words from the boy’s point of view, the boy thinks that he has reached the point where one could understand his girlfriend the best. But he is unaware of the fact that he has traversed only one part of the curve and there is scope for further improvement in other areas of the curve. This happens because since initial weights are randomly chosen and you descent from the point downwards and reach a place where you falsely assume that no further improvement can be made. To overcome this the same process is repeated starting with various random initial weights and finally, the sweet spot is discovered (Global Maxima).

Repeating our experiment with various random initial weight for one of the weights we were able to reach the sweet spot where the loss was close to 10. The boy has finally attained enlightenment.

The boy finally manages to build the classifier that will help him find out the mood which his girlfriend is in when she texts him next time with a good accuracy.

After 8 years…

The boy is busy in his workplace. He gets a few texts messages and calls from his girlfriend. She is disappointed and on the verge of going into a depression. Our protagonist instead of mistaking it for anger and switching off his mobile, he immediately without a second thought goes to her place and comforts her thereby saving a potential break up and saving his career by not becoming a film director.