Gradient Descent Application in 30 mins (NO BS)

Jackie Hu
Jackie Hu
Apr 7 · 6 min read

Let’s first define a simple multinomial model:

Our model example
def sin_model(x, theta):
"""
Predict the estimate of y given x, theta_1, theta_2
Keyword arguments:
x -- the vector of values x
theta -- a vector of length 2
"""
theta_1 = theta[0]
theta_2 = theta[1]
return theta_1 * x + np.sin(theta_2 * x)

Intuition behind Gradient Descent:

  1. optimum theta value minimize the loss function.
  2. the way we find the minimum value of theta is by taking the derivative of the loss function with respect to theta

The first step is to pick a loss function for our model.

Suppose we are using the Mean Squared Loss function as the loss function, therefore: ((y_hat — y_obs) ** 2) / n

def sin_MSE(theta, x, y):
"""
Compute the numerical value of the l2 loss of our sinusoidal model given theta
Keyword arguments:
theta -- the vector of values theta
x -- the vector of x values
y -- the vector of y values
"""
y_hat = sin_model(x, theta)
return np.mean((y - y_hat)** 2)

And we will take the derivative of this loss function with respect to theta_1 and theta_2, we get the following:

def sin_MSE_dt1(theta, x, y):
"""
Compute the numerical value of the partial of l2 loss with respect to theta_1
Keyword arguments:
theta -- the vector of values theta
x -- the vector of x values
y -- the vector of y values
"""

theta_1 = theta[0]
theta_2 = theta[1]

dt_1 = -2 * (y - theta_1 * x - np.sin(theta_2 * x)) * x
return np.mean(dt_1)

def sin_MSE_dt2(theta, x, y):
"""
Compute the numerical value of the partial of l2 loss with respect to theta_2
Keyword arguments:
theta -- the vector of values theta
x -- the vector of x values
y -- the vector of y values
"""
theta_1 = theta[0]
theta_2 = theta[1]

dt_2 = -2 * (y - theta_1 * x - np.sin(theta_2 * x)) * x * np.cos(theta_2 * x)
return np.mean(dt_2)

Finally, we put them together in 1 overall gradient loss function

def sin_MSE_gradient(theta, x, y):
"""
Returns the gradient of l2 loss with respect to vector theta
Keyword arguments:
theta -- the vector of values theta
x -- the vector of x values
y -- the vector of y values
"""
return np.array([sin_MSE_dt1(theta, x, y), sin_MSE_dt2(theta, x, y)])

Implementing Gradient Descent Function

There are a lot of ways to write your Gradient Descent function; bear in mind that the most important thing is for you to understand the logic behind all of these implementations and find the one that makes most sense to you to implement in your own analysis.

Some intuitions to keep in mind:

  • we are trying to find the combination of theta_1 and theta_2 that leads to a overall minimum of the loss function
  • Gradient descent nudge theta in negative gradient direction until theta converges.

Let’s get started!

  1. The Step_by_Step guide to implement Gradient Descent
  • where df was the gradient of the function we are minimizing (aka. the loss function) and initial_guess are the starting parameters for that function (arbitrary).
def gradient_descent(df, initial_guess, alpha, n):
guesses = [initial_guess]
guess = initial_guess
while len(guesses) < n:
guess = guess - alpha * df(guess)
guesses.append(guess)
return np.array(guesses)

As you might notice, our sin_MSE_gradient function takes in 3 parameters (theta, x, y), if we want to use it in the gradient_descent function, we might want to optimize it as following, so that it only takes in 1 parameter(theta):

def mse_loss_derivative_single_arg(theta):
x = data['a', 'b']
y_obs = data['y']

return sin_MSE_gradient(theta, x, y_obs)

Then let’s try running the gradient descent function with initial theta vector [0,0], alpha = 0.001, max_iteration = 100:

gradient_descent(mse_loss_derivative_single_arg, np.array([0, 0]), 0.0001, 100)

2. Gradient Descent for a finite number of iterations

Update Function

At each time step, use the gradient and alpha to update your current theta. We also save the current theta in theta_history, along with the average squared loss (computed with the current theta) in loss_history

  • Let’s first define the initial theta value, which really is just any points on the graph
def init_theta():
"""Creates an initial theta [0, 0] of shape (2,) as a starting point for gradient descent"""
return np.zeros((2,))

The Gradient Descent Function:

def grad_desc(loss_f, gradient_loss_f, theta, data, num_iter=20, alpha=0.1):
"""
Run gradient descent update for a finite number of iterations and static learning rate
Keyword arguments:
loss_f -- the loss function to be minimized (used for computing loss_history)
gradient_loss_f -- the gradient of the loss function to be minimized
theta -- the vector of values theta to use at first iteration
data -- the data used in the model
num_iter -- the max number of iterations
alpha -- the learning rate (also called the step size)

Return:
theta -- the optimal value of theta after num_iter of gradient descent
theta_history -- the series of theta values over each iteration of gradient descent
loss_history -- the series of loss values over each iteration of gradient descent
"""
theta_history = []
loss_history = []

#initial x and y values
x = part_1_data['x']
y = part_1_data['y']


#append the original theta and loss
theta_history.append(theta)
loss_history.append(loss_f(theta, x, y))

#main update function
for iteration in range (num_iter):

#update and append theta
theta_cur = theta - alpha * gradient_loss_f(theta, x, y)
theta_history.append(theta_cur)

#calculate and append loss using cur_theta
cur_loss = loss_f(theta_cur, x, y)
loss_history.append(cur_loss)

#reset theta
theta = theta_cur
return theta, theta_history, loss_history#running gradient descent
theta_start = init_theta()
theta_hat, thetas_used, losses_calculated = grad_desc(
sin_MSE, sin_MSE_gradient, theta_start, part_1_data, num_iter=20, alpha=0.1
)
for b, l in zip(thetas_used, losses_calculated):
print(f"theta: {b}, Loss: {l}")

Experimenting more on Gradient Descent

Decaying alpha (learning rate) gradient descent

Intuition:

By decaying learning rate, instead of just a number 𝛼, the learning should be now 𝛼/(𝑡+1) where 𝑡 is the number of the current iteration of gradient descent.

You can almost recycle most of the code above, instead just update the 𝛼 to 𝛼/(iteration+1) per iteration.

SGD: Stochastic gradient descent

Intuition:

Update function for theta per each batch size

In the update rule above, 𝑏 (batch size) is much smaller than 𝑛, the total size of the data. We will be using a static learning rate and a argument batch_size to represent the size of the mini-batch to sample for each iteration.

The major change of the code from above is the update function as we iterate:

for iteration in range (num_iter):
#sample batch size each iterations
sample = data.sample(batch_size)
out_arr_x = sample['x']
out_arr_y = sample['y']

#update and append theta
theta_cur = theta - alpha * gradient_loss_f(theta, out_arr_x, out_arr_y)
theta_history.append(theta_cur)

#calculate and append loss using cur_theta
cur_loss = loss_f(theta_cur, out_arr_x, out_arr_y)
loss_history.append(cur_loss)

#reset
theta = theta_cur

Note: for the cur_loss, we are calculating and saving the loss_val per batch size, instead of the whole data set.

Visualize and Compare Performance

Let’s visualize our functions, and see how each one perform on converging to the global minimum.

plt.plot(np.arange(len(loss)), loss, label='Static Alpha')
plt.plot(np.arange(len(loss)), loss_decay, label='Decaying Alpha')
plt.plot(np.arange(len(loss)), loss_stoch, label='SGD')
plt.xlabel('Iteration #')
plt.ylabel('Avg Loss')
plt.title('Avg Loss vs Iteration # vs Learning Rate Type')
plt.legend();
#3D plot (gradient descent with static alpha)
plot_3d(thetas[:, 0], thetas[:, 1], loss, average_squared_loss, sin_model, x, y)
gradient descent with static alpha
#3D plot (gradient descent with decaying alpha)
plot_3d(thetas_decay[:, 0], thetas_decay[:, 1], loss_decay, average_squared_loss, sin_model, x, y)
gradient descent with decaying alpha
#3D plot (stochastic gradient descent)
plot_3d(thetas_stoch[:, 0], thetas_stoch[:, 1], loss_stoch, average_squared_loss, sin_model, x, y)
stochastic gradient descent

Gradient descent gives us a generic way to minimize a loss function when we cannot solve for the minimizing value of θ analytically. As our models and loss functions increase in complexity, we will turn to gradient descent as our tool of choice to fit models.

Because calculating a large dataset can be expensive and takes a long time, the Stochastic GD computes updates much faster than batch gradient descent. It can make significant progress towards the optimal θ by the time batch gradient descent finishes a single update.

Whereas the Decaying alpha GD takes into account updating learning rate, which could be beneficial in preventing overshoot step size therefore is more accurate.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Jackie Hu

Written by

Jackie Hu

🌞By day grad @BerkeleyISchool 🌑By night, writing, 3DRendering, DJing, reading about cool stuff (She/her)

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com