In our last blog post, we learnt about ordinary least squares (OLS) for optimising a linear regression model. We covered the basics of linear regression model, skimmed through the idea of cost function & gradient descent and created our very first linear regression model by using python package statsmodel. Finally we ended the post by discussing the assumptions under which the model operates. In this blog post we’ll dive deeper into cost functions and gradient descent and try to understand how the OLS uses it to come up with optimal beta estimates for a model.
Suppose you buy a packet of chips for $5 and your friend without looking at the bill predicts the price of the chips packet you bought to be $3, then he/she is off by $2. Therefore, the cost associated with the prediction is $2. Now this is the cost or error in prediction that we want to reduce by learning and that is the focus of this blog post. Therefore, in very simplistic terms a cost function is the measure of, how bad is your model’s performance?
In the example above we used the difference between the actual and the predicted value to find the cost or error in a model, similarly we can make use of other functions to determine the cost of a model. The choice of function is determined by the kind of problems you’re trying to solve and each function has different properties. Let’s have a look at some commonly used cost functions.
Residual Sum of Square ( RSS ) : It is the sum of square of residuals (huh!, but you may ask, what is residual). Well, residual is the difference between the actual and the predicted value. In the example above $2 was the residual. OLS uses RSS for estimating the beta values.
Mean Square Error( MSE ) : It is the average of square of residuals.
Logistic regression cost function : It is a cost function used for classification problem using logistic regression. We will look into it in a greater detail while exploring classification problem. For now, I just wanted to make you familiar with the idea of different cost functions available to us.
There are many more cost functions out there but these are the most commonly used cost functions.
Now that we understand the concept of cost functions, let us look into gradient descent algorithm.
So, suppose you are enjoying your vacation on a snowy mountain and one afternoon you decide to go skiing. You gear up with all the necessary equipments and reach the top of a hill to launch yourself down the slope. You give yourself a slight push and ski your way down to the bottom of the hill. On top of the hill, since it’s a steep slope, you gain some momentum and cover a large distance in no time. On the contrary at the end of the hill, you start loosing that momentum and start slowing down. Now that the slope of the hill is almost zero, you know that you’re at the bottom and can not ski any further without any extra efforts. Well, gradient descent algorithm works exactly in the same manner.
As the name suggest, we ‘descend’ down the ‘slope’ or gradient. To get a better understanding let us work with an example.
X = [1, 2, 3, 4] Y = [2, 4, 6, 8]
We can see from the data above, that the function takes the form of:
If we give the above data to OLS function we defined in our last post, we’d expect
For the purpose of illustration, since beta_0 is zero, let us just focus on beta_1. Therefore, the function take the form of Y = beta_1 * X. Now let us calculate the cost associated with different beta values.
Given that the algorithm estimated a value of beta_1 = 2, let us calculate the cost associated with the model Y = 2 * X.
Similarly, let us calculate the cost associated with the model if the model would have estimated a beta_1 = 1.
Lastly, let us calculate the cost associated with the model if the model would have estimated a beta_1 = 3.
Likewise we can calculate the cost for any beta value.
beta_1 = [-1, 0, 1, 2, 3, 4, 5]cost = [270, 120, 30, 0, 30, 120, 270]
Now that we have cost associated with seven different beta values, let’s try to visualise it to get a better understanding of effect of beta on cost and predictions.
Well, from the visualisations above we can uncover a few interesting points as follows :
- Firstly, with the combination of the graphs we can see that as we move from beta = -1 to beta = 2, the predictions become more accurate and the cost associated with the model decreases to the point of RSS = 0. Then, as we increase the beta value further from 2 to 5, the predictions starts to get inaccurate and the cost associated with the model starts to increase.
- Secondly, the further we are from a good estimate, steeper is the slope of the cost function. Similarly in other words, as we move closer to a good estimate, the slope of the cost function decreases. (left graph)
- Lastly, the minimum cost associated with a model is always greater than or equal to zero. Here, from the graph on the right we can see that since our model with beta = 2 (teal colour) overlaps the actual value ( Y ), the cost associated with the model is 0.
Now, that we understand the working of gradient descent at a conceptual level, let us now shift our focus on what computation does the algorithm make to assure that we are at the bottom of the hill i.e. the cost function.
Referring back to the ski example, we know that we are at the bottom of the hill when there is no more slope for you to ski on and hence the skiing comes to a halt. Similarly, the algorithm tries to compute the slope of the cost function (RSS), with respect to beta values, and comes to a halt at the bottom of cost function.
The question you may ask is, how to compute this slope?
For all the math-a-magicians out there, we know that the slope is nothing but the derivative of a function. So we start at an arbitrary point on a cost function and move down the hill bit by bit in the direction of maximum slope. Every time we move we compute the slope again stop when the slope, i.e. the derivative, is equal to zero .
For the purpose of easy illustration we will assume there is only one feature X, however, the algorithm works exactly the same for any number of features.
The algorithm goes as follows :
- Assign random value to betas.
- Calculate predictions using the beta value
3. Compute RSS ( the cost associated with assigned beta value)
4. Calculate partial derivative of RSS with respect to each beta. We use chain rule to do the following computations. Incase we had more features, we’d calculate beta with respect to them as well.
5. Update our estimates or beta values. This step is to make sure we move down the hill (cost function).
In the step above eta(η) is called learning rate. It determines how fast or slow we want our algorithm to move downhill. To interpret the equation think of it as follows : From the point we are at on the cost function (beta), move down the hill (subtraction), by eta times slope. We will discuss more about effects of eta on our algorithm in a separate blog.
Note : The derivative values used in the above equation are same as calculated in step 4. The update on beta values is done simultaneously. So we calculate all the derivatives with respect to each beta first (step 4) and then update the beta values simultaneously in step 5.
6. We repeat step 2, 3, 4 and 5 till our stopping condition is met.
Ideally, we we would like to repeat these steps till the we reach the bottom of the hill, i.e. where slope is equal to zero. However, as we move close to our optimal solution, we’d see that our updates on beta values are minimal. This is because as we get closer to the optimal solution the slope get closer to zero. At this point step 5 can be summarised as follows :
Therefore, it might take a lot of computation and time to reach the point of absolute optimality, however, a lot of time due to lack of resources we do not mind close to optimal or suboptimal solution. Hence, we stop the algorithm once a desired suboptimal level of solution is achieved.
There are many metrics that we can look into while deciding when to stop the algorithm. Two of the most popular ones are as follows :
- Repeat the steps ‘n’ number of times. The algorithm may or may give an optimal solution.
- Repeat the step until the change in cost between current and the previous iteration is less than a threshold value. This is a preferred choice as we can make an assumption that any change in cost in iterations post this point is going to be less that the threshold value and hence will be negligible.
Lastly, gradient decent being a popular algorithm can be used from many packages without having to code it. However, it is always a good practice to understand the algorithm and not use it like a black box.
Congratulations!! That is it. We now understand the fundamentals of gradient descent. There are a few variation of gradient descent but all of them are derived from this fundamental algorithm. We shall look into some of these variation sometime in our future blogs.
Until then, keep rocking !