Beginner’s Guide to Finding Gradient/Derivative of Log Loss by Hand (Detailed Steps)
This tutorial will show you how to find the gradient function of the most famous logistic regression’s cost function, the log loss. The tutorial is designed for someone who doesn’t have any math background like myself. However, it is expected that they know a little bit, just a little, about derivative. You might have studied calculus in high school, and now all you can remember is the derivative of x is 1 or the derivative of 2x is 2. If that’s the case, you’ll do just fine.
This tutorial will be divided into 3 parts
- Introduction to Derivative
- The Chain Rule
- The Tutorial
Intro to Derivative
Suppose we have a function that says y = 5x. The gradient of function y with respect to x equals to 5. Now what about y = 5x + 3x². The gradient of y with respect to x equals 5+6x. What about if y function depends on not only one variable x. let’s say it has 1 more variable. For example y = 2z+ 4x. If the function has more than one variable, we find the gradient for each variable.
First we treat all the other variable as constant. Let’s say we want to find the gradient of y with respect to z. Substitute other variable, in our case “x”, with whatever number you want. I’ll use 5. So the equation will be y = 2z + 4(5), simplifying that to be y = 2z + 25. Then we derive it to be 2 + 0. Same thing goes to gradient of y with respect to x. Treat z as constant we will have 2(5) + 4x derived into 0 + 4.
Remember! substituting a variable with a constant only happens in your head to be safe. This is just a method to make it easier. This is going to be dangerous if, for example, you have y = ab-5c. Suppose we want to find the gradient of y with respect to a. Treat b and c as constants 5 and 3 respectively we have y = 5a-5(3). You will get 5–0. This is wrong indeed. The correct result is b.
If you write it down on your worksheet by substituting them with constants, be sure to always check and recheck the result. If the substitute number still present in the result, substitute it back to it’s original form, in our case 5 was a substitute for b. So we put b back to substitute 5. Next if there is zero as the result which means a derivative of a constant, you can leave it as is. With respect to our case, 5–0 is substituted with b-0.
The Chain Rule
Suppose we have 3 functions SR = R², R = y-yhat, and yhat = wx +b. If we write the full version of function SR we have SR = (y-(wx+b))². If you want to know the gradient of SR with respect to R, then we can do what we did earlier. The gradient of R² equals 2R.
But what if we want to know the gradient of SR with respect to x? This is where the chain rule kicks in. First we find the gradient of SR with respect to R. Then we find the gradient of R with respect to yhat. Finally we find the gradient of yhat with respect to w. And by multiplying the results to each other you get the gradient of SR with respect to w.
I guess now you know why it is called The “Chain” Rule, init?
Gradient of Log Loss : the tutorial
For a quick reference to logistic regression. cost function is used to evaluate our prediction. And the prediction (using linear equation) is transformed into probability using sigmoid function before can be used inside the cost function.
We calculate the gradient of cost function to know which direction our loss is moving, up or down. So that we can update the weights and intercept to influence movement of the loss. For log loss, the smaller the better. So we change weights and intercept to decrease loss.
- Cost Function
- Sigmoid Function
- Linear Equation
First, let’s formulate how our chains would look like. Remember that we have function L, p, z. And we want to calculate the gradient of L with respect to w (weights) and b (intercept) which reside inside function z. Using the chain rule we talked about upfront, we formulate the gradients as follows.
Gradient of Cost Function with respect to w (weights)
Gradient of Cost Function with respect to b (intercept)
Gradient of L w.r.t p
Let’s prepare our cost function before calculating the gradient. For the first partial derivative, we only care about finding the gradient of cost function with respect to “p” or the sigmoid function. Therefore we substitute the formula of sigmoid function with its name “p”.
Let’s first differentiate -y log(p). The derivative of log(x) is 1/x. Therefore we the derivative of log(p) is 1/p.
Next we calculate the derivative of -(1-y)log(1-p). The derivative of log(1-p) is not as straightforward as the last one. That is due to the function (log(1-p)) we are differentiating still has another function (1-p) inside it. The derivative rule of this kind of function requires us to multiply the derivative of log(1-p) with the derivative of 1-p. We leave y and -(1-y) as they are because we treat them as constant. we only care about p.
Multiplying all together we will have
Putting the derivative of y log(p) and (1-y)log(1-p) we have
Gradient of p w.r.t z
Now is the turn to differentiate the sigmoid function (p) with respect to z (linear equation).
The derivative rule of a function in the form of 1/(1+e^-z) or 1/f(x) is the derivative of 1/f(x) equals to the derivative of f(x) over f(x)² times -1. Thus the derivative of our sigmoid function equals the derivative of 1+e^-z over (1+e^-z)² times -1. For addition when deriving our 1+e^-z, the derivative of e^-z equals -e^-z.
Gradient of z w.r.t w & b
Next up to the third function that we are going to differentiate. The linear equation function or in our case the z function.
We are deriving z with respect to w. Therefore we treat b and x as constant. That leaves x as the result.
If we are deriving z with respect to b, we then do the same thing. Treat w and x as constant. Thus we have 0 + 1.
End Result of Gradient L w.r.t w & b
Putting all the results of each function we have derived, we have the following function. Which then to be known as the derivative/gradient of our logistic regression’s cost function. Below is the gradient of our cost function with respect to w (weights). If with respect to b (intercept), replace x with 1.
It’s ready to use, but that’s a long one. We can still simplify those functions, dL/dp and dp/dz. So that it will just be (p-y)x for weights and (p-y) for intercept. I will show you next how to do just that.
dL/dp
We will just do fraction summation in this part
get the same denominator
simplify the numerator
eliminating yp and -yp we have
dp/dz
For this part. We will need a little algebraic LHS and RHS operation. here we go. First, seperate the derivative of our sigmoid function as the following.
Now let’s solve for 1/(1+e^-z)²
Now let’s solve for e^-z/1
Let’s put back the simplified version of both. Simplify the equation by canceling out the denominator p and one of p in p² we have
Now let’s reform our gradient function
Cancel out both of left bottom p(1-p) and (1-p)p we will have
replace x with 1 for dL/db
Conclusion
Hope it helps you understand the step-by-step process of finding gradient to log loss function. I made it as detailed as I could. Most of the time when looking at math behind machine learning tutorials, I go “how on earth they produced this from that”. I hope this tutorial helps people who have the same struggle.
If you want to know how to implement this function and put it into work with logistic regression, I have a tutorial of how to do one in python here. I’ll gladly accept any feedback, thank you!