# Gradient Descent : Back-bone of most of the popular Machine-Learning Algorithms

Before starting off with Gradient descent, let me summarize machine learning in 4 points:

- Define a model.
- Make predictions on Training data.
- Calculate the error in the model’s prediction.
- Minimize the error by optimizing the model’s weights(coefficients).

The task of optimizing our model’s performance by

minimizing its prediction erroris the most important step in any machine learning algorithm. This is where our model trains on the given data(AKA Training data).

Gradient descent is the most popular and widely used optimization algorithm. It falls under the category of *first order optimization algorithms*. There are many variations of Gradient Descent (like Batch Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent, etc.,). In this post, I’ll be discussing about Batch Gradient Descent.

For better understanding of how gradient descent works, lets consider an example of Linear Regression,

Linear Regression is a supervised Machine Learning algorithms. In this algorithm, we train our machine learning model to predict the value of a dependent variable

y, given some independent variablesX.

Say, our regression model is of the form,

**y_pred = mX + b**

where, m & b are weights/coefficients which are to be optimized in order to minimize the cost/error.

Here, you may notice that the regression model looks exactly like a linear equation, where **m** is the slope of the line and **b** is the Y-intercept.

Now, lets define our cost function,(note that the cost function is a function of m & b i.e., weights. So, in order to minimize the cost function, we need to optimize the coefficients.)

Here, I’m considering squared error function to calculate the error.

Cost, **J = (1/2M) * **∑**(y — y_pred)²**

Where,

M is the number of training samples

y is the actual value of the dependent variable

y_pred is the predicted value of the dependent variable

we have to consider the error in all the M samples hence we are taking the summation. Raising the error term (**y — y_pred**) to the power 2 ensures that the error is never negative.

When our model finds the best fit for the data, it looks something like this.

*How gradient descent minimizes the cost function is quite simple and intuitive!!*

Firstly, we start off with randomly initializing the coefficients of our model. Then we run Gradient Descent on it.

First order derivative of any equation/curve at some point **P** gives the slope the slope of that curve at that point. Gradient descent uses the slope of the cost function to find its way towards the minima of that curve, that maybe local minima or global minima.

Consider this curve, The slope of the curve at the point P(2, 7.39) is positive. If we subtract the the value of slope from X coordinate of P, the new X coordinate and its corresponding Y value on the curve will give us a new point Q(X,Y) which is closer to the minima of the curve. If this process is repeated for some certain number of iterations, until we end up at the minima(i.e., slope of the curve becomes zero), our solution is said to be converged. This process is called Gradient descent.

Python Code::

def GradientDescent(X,y,alpha = 0.001):

iteration = 1m,n = X.shape

weights = np.zeros((n,1))

Jhist = []

tolerance = 1e-6while True:

J, grad = costFunction(X,y,weights)

weights= weights— alpha * gradif iteration%100 == 0:

print(‘Iteration #{} — {}’.format(iteration, J))if iteration!=1 and abs(Jhist[-1]-J) <= tolerance:

print(‘Iteration #{} — {}’.format(iteration-1, Jhist[-1]))

print(‘Iteration #{} — {}’.format(iteration, J))

print(‘Converged!!’)

breakJhist.append(J)

iteration+=1

return weights, Jhist

Here, alpha is a hyper-parameter which adds to the step size for every iteration.

The algorithm terminates when the decrease in Cost between 2 consecutive iterations is less than the tolerance(threshold) value.

The value of alpha is something we need to play with to get the best value.

If we plot a graph between Jhist and number of iterations for different values of alpha we get a graph something like this.

Gradient descent takes small baby steps towards the minima of the curve until it finally converges at the the minimal solution.

This is the basic working behind gradient descent. But, there might be one problem with this algorithm.

There is a risk that the algorithm may converge at local minima instead of global minima. For best estimation of the variable **y ,**we need our gradient descent algorithm to converge at the global minima not the local minima.

Fortunately, this is not that big of a problem since the cost function is a bell shaped curve,(i.e., its concave upwards). Hence our gradient descent algorithm is sure to converge at the global minima(since there are no local minima ).

As our algorithm slowly approaches the minima, our model slowly adjusts to the training data so that the error is minimum.

Since its not easy to visualize 3D plots for checking gradient descent, its better to plot contour plots.

Points to note:

1. Data in real life scenarios may not be so simple that we can fit a linear model to the data, sometimes we may have to fit non-linear model to the data .

2. For fitting a non linear model, we usually prefer polynomial regression instead of linear regression.

3. Running Batch gradient descent on huge data sets is a bit inefficient and computationally slow. That is why there are variations of it like Stochastic Gradient descent, mini-batch gradient descent, etc., that do a pretty good job in optimizing for large data sets.

Hello reader!!

This is my first blog post on Medium. Hope you like it, and find it pretty interesting. In case of any doubts or suggestions please drop your responses in the response section below.

Follow me for more such Machine Learning and Deep Learning content.

=D