An intuitive approach to Gradient Descent

Sayantanee Sen
5 min readMay 29, 2022

--

Photo by Simon Fitall on Unsplash

In this article we will try to understand the gradient descent algorithm and the mathematics that goes behind it.

Lets take a simple linear regression problem where we want to predict the weight of a given person based on his or her height. To start with any supervised machine learning problem, we would first need train data that gives us the relation between the entities : height and weight in this case. The entity that helps in the prediction process is called independent variable(s) or feature(s). There can be one or more independent variables. For instance, in our example we can have both height and age of a person derive the weight (Now we will just focus on one feature — Univariate Linear Regression). The entity which we are predicting is called the dependent or target variable. We will use “x” to denote the feature and “y” to denote the target variable. We will use “m” to denote the number of samples in training data. Lets say we have data of 10 people to train our model, in that case, m = 10 . Finally, h(x) will be used to denote the hypothesis function which will determine the predicted “y”.

Cost function is used to measure the prediction accuracy. Lets say ​hθ​(xi) is the predicted value of the target variable (for ith record) and yi​ is the actual value which is given (for ith record). Thus, the difference between these two gives us the error (or the loss). The difference is squared to cancel out any negative signage followed by summing up the error components for each value of i where i=1,2,3…m and then finally divided by the total number of train samples, which gives us the mean squared error (MSE). Our goal is to minimize this error term : J(θ0​,θ1​)

Note: The mean is halved just for computational simplicity

Visual intuition of cost function in a 2d space

(2D space: 2 variables ; 1 independent feature ‘x’ and 1 target variable ‘Y’)

For simplicity, lets say θ0 = 0 , that makes the cost function only dependent on θ1, i.e. “h​(xi​) = θ1x”. Now if you go on plotting θ1 against cost function J(θ1), we will get a graph like below.

Visual intuition of cost function in a 3d space

A quick recap of high school derivatives (Optional Read)

In simple words derivative of a function in calculus is change in the output value in respect to a slight change in the input value. For example, a car covers distance ‘S’ in time ‘T’, then at any point ‘t’ the velocity of the car can be calculated as Slight change in distance “ds” / Slight change in time “dt”, when dt approaches 0.

As dt approaches 0, slope of the line passing through two points converges and become tangent to the graph

Partial derivatives is the same as derivatives except it has more than one variables. Each component of the partial derivative is calculated by keeping rest other variables constant. As you may have already inferred that this is going to be somehow connected with our problem of one feature vs multiple features linear regression.

Goal: Minimizing the Cost function using Gradient Descent

Gradient descent algorithm helps finding the local minima (or maxima) of a given function. The algorithm starts with a random point in the below graph and then based on the slope at that point (derivative) decides which direction it should move next. If the slope is positive θ1 decreases, while the slope is negative θ1 increases until it converges to the local minima where the slope becomes 0 finally (at θ1 = 1).

Mathematically, gradient descent algorithm can be descried as :

Note, if J(θ0​,θ1​) is positive, θj will increase as θj := θj + (some positive value) and if J(θ0​,θ1​) is negative, θj will decrease as θj := θj + (some negative value)

“α” here determines how big the steps are. Larger the value of α, larger the increase or decrease. This is called learning rate. Very large value of α may never reach to local minima as there are high chances that the algorithm will diverge from the local minima and keep oscillating. On the other hand very small learning rate means the number of steps will increase tremendously and thus the algorithm will become extremely slow, hence it is preferred to consider an optimum value of “α”

An important note here is while calculating the parameters during each step of gradient descent; all the parameters should be simultaneously updated

Another important point to note is that we don’t really have to change the learning rate to reach the local minima since the slope will keep on decreasing as the algorithm converges towards local minima, resulting into smaller value of the second component in the below equation, which in turn means smaller steps.

Gradient descent visual intuition in 3d space

Think about this, you are standing on top of a mountain peak and you need to go down to find your friends who have all the supplies. You will look around and then find out which direction you should step forward to, so that your destination become closer and closer. That's pretty much what gradient descent does.

Happy Learning !!

--

--