Why do we subtract the slope * alpha in Gradient Descent?
--
If we are going in the direction of the steepest descent, why not add instead of subtract?
Ok, I got that the derivative (gradient) means the direction of change. But still, why would you subtract this?
Because your goal is to MINIMIZE the loss function J(θ).
Here is an example with simple scalar.
As you can see from this example, when the derivative is positive, you need to subtract the fraction of that derivative if you want to minimize the cost function. So, in the maximization problem, you would add the alpha * the derivative (slope).
Ok, this is for the scalar example. What about the multi-dimensional?
The same logic applies to the multi-dimension. In the case of more than one dimension, the gradient of a function would just be a vector of all its partial derivatives. So basically nothing changed.
What’s the difference between derivative and gradient?
Both the derivative and the gradient are concepts used in calculus to describe the rate at which a function changes. But their meanings are slightly different.
The derivative is a scalar that tells us a rate of change at a point. Scalar means that it has only magnitude and no direction. The gradient, on the other hand, is a vector that points in the direction of the function’s steepest increase. Geometrically, it is a vector that is perpendicular to the function’s contours at a given point. And the magnitude of the gradient is the rate of change in the direction of the gradient.
In short, the gradient is a vector that has both a magnitude and a direction, while the derivative is a scalar that only has a magnitude.