# Gradient Descent Application in 30 mins (NO BS)

Let’s first define a simple multinomial model:

`def sin_model(x, theta):    """    Predict the estimate of y given x, theta_1, theta_2    Keyword arguments:    x -- the vector of values x    theta -- a vector of length 2    """    theta_1 = theta    theta_2 = theta    return theta_1 * x + np.sin(theta_2 * x)`

1. optimum theta value minimize the loss function.
2. the way we find the minimum value of theta is by taking the derivative of the loss function with respect to theta

The first step is to pick a loss function for our model.

Suppose we are using the Mean Squared Loss function as the loss function, therefore: ((y_hat — y_obs) ** 2) / n

`def sin_MSE(theta, x, y):    """    Compute the numerical value of the l2 loss of our sinusoidal model given thetaKeyword arguments:    theta -- the vector of values theta    x     -- the vector of x values    y     -- the vector of y values    """    y_hat = sin_model(x, theta)    return np.mean((y - y_hat)** 2)`

And we will take the derivative of this loss function with respect to theta_1 and theta_2, we get the following:

`def sin_MSE_dt1(theta, x, y):    """    Compute the numerical value of the partial of l2 loss with respect to theta_1Keyword arguments:    theta -- the vector of values theta    x     -- the vector of x values    y     -- the vector of y values    """        theta_1 = theta    theta_2 = theta        dt_1 = -2 * (y - theta_1 * x - np.sin(theta_2 * x)) * x    return np.mean(dt_1)    def sin_MSE_dt2(theta, x, y):    """    Compute the numerical value of the partial of l2 loss with respect to theta_2Keyword arguments:    theta -- the vector of values theta    x     -- the vector of x values    y     -- the vector of y values    """    theta_1 = theta    theta_2 = theta        dt_2 = -2 * (y - theta_1 * x - np.sin(theta_2 * x)) * x * np.cos(theta_2 * x)    return np.mean(dt_2)`

Finally, we put them together in 1 overall gradient loss function

`def sin_MSE_gradient(theta, x, y):    """    Returns the gradient of l2 loss with respect to vector thetaKeyword arguments:    theta -- the vector of values theta    x     -- the vector of x values    y     -- the vector of y values    """    return np.array([sin_MSE_dt1(theta, x, y), sin_MSE_dt2(theta, x, y)])`

There are a lot of ways to write your Gradient Descent function; bear in mind that the most important thing is for you to understand the logic behind all of these implementations and find the one that makes most sense to you to implement in your own analysis.

Some intuitions to keep in mind:

• we are trying to find the combination of theta_1 and theta_2 that leads to a overall minimum of the loss function
• Gradient descent nudge theta in negative gradient direction until theta converges.

Let’s get started!

1. The Step_by_Step guide to implement Gradient Descent
• where `df` was the gradient of the function we are minimizing (aka. the loss function) and `initial_guess` are the starting parameters for that function (arbitrary).
`def gradient_descent(df, initial_guess, alpha, n):    guesses = [initial_guess]    guess = initial_guess    while len(guesses) < n:        guess = guess - alpha * df(guess)        guesses.append(guess)    return np.array(guesses)`

As you might notice, our sin_MSE_gradient function takes in 3 parameters (theta, x, y), if we want to use it in the gradient_descent function, we might want to optimize it as following, so that it only takes in 1 parameter(theta):

`def mse_loss_derivative_single_arg(theta):    x = data['a', 'b']    y_obs = data['y']    return sin_MSE_gradient(theta, x, y_obs)`

Then let’s try running the gradient descent function with initial theta vector [0,0], alpha = 0.001, max_iteration = 100:

`gradient_descent(mse_loss_derivative_single_arg, np.array([0, 0]), 0.0001, 100)`

2. Gradient Descent for a finite number of iterations

At each time step, use the gradient and `alpha` to update your current `theta`. We also save the current `theta` in `theta_history`, along with the average squared loss (computed with the current `theta`) in `loss_history`

• Let’s first define the initial theta value, which really is just any points on the graph
`def init_theta():    """Creates an initial theta [0, 0] of shape (2,) as a starting point for gradient descent"""    return np.zeros((2,))`

`def grad_desc(loss_f, gradient_loss_f, theta, data, num_iter=20, alpha=0.1):    """    Run gradient descent update for a finite number of iterations and static learning rateKeyword arguments:    loss_f -- the loss function to be minimized (used for computing loss_history)    gradient_loss_f -- the gradient of the loss function to be minimized    theta -- the vector of values theta to use at first iteration    data -- the data used in the model     num_iter -- the max number of iterations    alpha -- the learning rate (also called the step size)        Return:    theta -- the optimal value of theta after num_iter of gradient descent    theta_history -- the series of theta values over each iteration of gradient descent    loss_history -- the series of loss values over each iteration of gradient descent    """    theta_history = []    loss_history = []        #initial x and y values    x = part_1_data['x']    y = part_1_data['y']         #append the original theta and loss         theta_history.append(theta)    loss_history.append(loss_f(theta, x, y))        #main update function    for iteration in range (num_iter):                #update and append theta        theta_cur = theta - alpha * gradient_loss_f(theta, x, y)        theta_history.append(theta_cur)                #calculate and append loss using cur_theta        cur_loss = loss_f(theta_cur, x, y)        loss_history.append(cur_loss)                #reset theta        theta = theta_curreturn theta, theta_history, loss_history#running gradient descent theta_start = init_theta()theta_hat, thetas_used, losses_calculated = grad_desc(    sin_MSE, sin_MSE_gradient, theta_start, part_1_data, num_iter=20, alpha=0.1)for b, l in zip(thetas_used, losses_calculated):    print(f"theta: {b}, Loss: {l}")`

# Experimenting more on Gradient Descent

## Decaying alpha (learning rate) gradient descent

Intuition:

By decaying learning rate, instead of just a number 𝛼, the learning should be now 𝛼/(𝑡+1) where 𝑡 is the number of the current iteration of gradient descent.

You can almost recycle most of the code above, instead just update the 𝛼 to 𝛼/(iteration+1) per iteration.

Intuition:

In the update rule above, 𝑏 (batch size) is much smaller than 𝑛, the total size of the data. We will be using a static learning rate and a argument `batch_size` to represent the size of the mini-batch to sample for each iteration.

The major change of the code from above is the update function as we iterate:

`for iteration in range (num_iter):        #sample batch size each iterations         sample = data.sample(batch_size)        out_arr_x = sample['x']        out_arr_y = sample['y']                #update and append theta        theta_cur = theta - alpha * gradient_loss_f(theta, out_arr_x, out_arr_y)        theta_history.append(theta_cur)                #calculate and append loss using cur_theta        cur_loss = loss_f(theta_cur, out_arr_x, out_arr_y)        loss_history.append(cur_loss)                #reset        theta = theta_cur`

Note: for the cur_loss, we are calculating and saving the loss_val per batch size, instead of the whole data set.

## Visualize and Compare Performance

Let’s visualize our functions, and see how each one perform on converging to the global minimum.

`plt.plot(np.arange(len(loss)), loss, label='Static Alpha')plt.plot(np.arange(len(loss)), loss_decay, label='Decaying Alpha')plt.plot(np.arange(len(loss)), loss_stoch, label='SGD')plt.xlabel('Iteration #')plt.ylabel('Avg Loss')plt.title('Avg Loss vs Iteration # vs Learning Rate Type')plt.legend();`
`#3D plot (gradient descent with static alpha)plot_3d(thetas[:, 0], thetas[:, 1], loss, average_squared_loss, sin_model, x, y)`
`#3D plot (gradient descent with decaying alpha)plot_3d(thetas_decay[:, 0], thetas_decay[:, 1], loss_decay, average_squared_loss, sin_model, x, y)`
`#3D plot (stochastic gradient descent)plot_3d(thetas_stoch[:, 0], thetas_stoch[:, 1], loss_stoch, average_squared_loss, sin_model, x, y)`

Gradient descent gives us a generic way to minimize a loss function when we cannot solve for the minimizing value of θ analytically. As our models and loss functions increase in complexity, we will turn to gradient descent as our tool of choice to fit models.

Because calculating a large dataset can be expensive and takes a long time, the Stochastic GD computes updates much faster than batch gradient descent. It can make significant progress towards the optimal θ by the time batch gradient descent finishes a single update.

Whereas the Decaying alpha GD takes into account updating learning rate, which could be beneficial in preventing overshoot step size therefore is more accurate.

## Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

## Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Written by

## Jackie Hu 