# Gradient Descent

# Learn by example (Part — 1)

# What is gradient descent?

- GD is an algorithm used upon other algorithms’ cost function in order to minimize it by getting gradients on function’s coefficients (i.e. get the derivative of cost function with respect to each of its coefficients).
- By running GD we can get the best values of system coefficients which achieve the minimum cost (error) value.
- These coefficients are used by machine learning model in the process of values’ prediction.

# Example:

**Suppose that:**

We have the following linear system

- Target Values : y = [1.4, 1.9, 3.2]
- Inputs : x = [0.5, 2.3, 2.9]
*𝑠𝑙𝑜𝑝𝑒 = 𝛽1 = 0.64*- Our equation for linear regression:

*Where:*

*(𝑦̂) is predicted output*

*𝑠𝑙𝑜𝑝𝑒 = 𝛽1*

*𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 𝛽0*

- And the cost function (residual-squared error):

*Where:*

*(𝑦) is actual output*

Tip To get the idea of “Gradient Descent” we will work first on “intercept-coeff.” then on both of coefficients together (intercept & slope). So a fixed value for ‘slope’ will be used (* 𝛽1= 0.64)*

- For our example, the cost (Sum_of_squared_residuals = SSR) in “equation 1.2” could be written as following (substituting by equation-1.1 instead of predicted (y) and put x-values and beta-1 value):

- Draw relation between “cost” and “intercept”

- Now we will go to
**minimize the cost function using “gradient-descent” with respect to “***𝛽0”*by get the derivative of equation-1.3 (using Chain-rule):

- Equation-1.4 represents
**Slope at any point on the Cost-Curve.** - Now our target is to get the minimum slope on cost-curve which could be presented by a tangent to cost-curve’s bottom or to the most point near the bottom.
- At the point where the cost is minimum, we can get the corresponding “
*𝛽0” value.* - To do that, we will start doing calculations with a random value of
*𝛽*0, suppose:*𝛽*0=0 and`learning-rate`

(*𝛼*=0.1) - Learning rate is used to determine the step-size be made by gradient-descent to reach the min-cost (at curve’s bottom) and best value of
*𝛽*0. - Also,”
*𝛼*“ helps gradient-descent to not overshot the curve’s bottom. It’s a hyper-parameter has no fixed value, generally with a value between (0.1 and 0.001).

## Calculations:

- Get the slope of cost-curve at the arbitrary point where (
*𝛽*0=0) as in fig.2 by substituting in Equation-1.4 (curve-slope = -5.7 at point-1).

2. Compute the step-size, gradient-descent will use to jump to next point on the curve.

3. Compute the new-*𝛽*0.

4. Repeat steps from “1” to “3” with the value of (new-*𝛽*0) again and again till **the absolute value of step to **be equal to or less than value of 0.001

**at this value the change in both step and new-**

*(|step| ≤ 0.001) i.e.**𝛽*0 will be too small.

5. At ** (|step| ≤ 0.001)**, gradient-descent approaches its goal to get the best value of system’s intercept-coeff. (

*𝛽*0) with the minimum error (cost).

**And hence our target is approached.**

*The following code is doing this:*

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline# Importing our dataset from CSV file:

dataset = pd.read_csv('./Datasets/weights.csv')x = dataset['weight']

y = dataset['height']

# get relation between "intercept" coef. with squared error:

def sqr_err():

sqError = []

incpt = np.arange(0, 2, 0.1)

slope = 0.64 for i in incpt:

y_pred = i + slope * x

residual = sum((y_pred - y)**2) sqError.append(residual) plt.plot(incpt, sqError)

plt.title('Intercept vs Squared_Error')

plt.xlabel('Intercept')

plt.ylabel('Squared Error')

The above part represents the cost curve relationship with *𝛽*0.

# Draw Squared_Residual / Intercept curve:

sqr_err()# Initial value of intercept

i = 0

learningrate = 0.1while(True):

# Draw a certain value of Squared_Error to get slope at

# any point on cost-curve

y_pred = i + 0.64 * x

cost = sum((y_pred - y)**2)

plt.plot(i, cost, c='r', marker='o')

print(f'Intercept= {round(i,4)}')

print(f'Seqared-Residual: {round(cost, 4)}')

# Draw tangent

costSlope = -2*(1.4 - (i + 0.64 * 0.5)) - 2*(1.9 - (i + 0.64 * 2.3)) - 2*(3.2 - (i + 0.64 * 2.9))

print(f'Cost_Slope: {round(costSlope, 4)}')

line_x_values = np.arange(0, 1.2, 0.1)

# tangent will pass through (0, tangent_intrcept) and (i, cost) and have slope "costSlope"

tangent_intrcept = cost - (i * costSlope)

tangent = tangent_intrcept + costSlope * line_x_values

plt.plot(line_x_values, tangent)

stepSize = learningrate * costSlope

if abs(stepSize) <= 0.001:

print(f'Absolute value of Step Size is less than (0.001): {round(abs(stepSize), 4)}')

break

print(f'Step Size: {round(stepSize, 4)}')

print('\n')

# New Intercept = Old Intercept - Step Size

i = i - stepSize

plt.show()

The Second part of this tutorial will be about how to use gradient descent to get both of linear system coefficients (*𝛽*0 & *𝛽1*) of equation-1.1 in the same approach our target which is min. cost.

Part-2: