Kucharlapatiaparna
3 min readMay 9, 2023

GRADIENT DESCENT
Gradient descent is an iterative optimization algorithm used to find local minima of a differentiable function(cost function),in simple it is a simple algorithm to find minimum of a convex function.

The more the cost is minimized more the machine will be able to make good predictions.
Gradient Descent can be applied to many machine and deep learning algorithms including linear regression, lohistic regression,
neural networks etc.

Cost function:
The cost function or loss function is the function that has to be minimized or maximized by varying the decision variables.
It measures how well the model fits the training data and is defined based on the difference between the predicted and actual values.
Properties of cost function:
1) It is continuous.
2) It is convex function.

  • Local minima:

The point in a curve which is minimum when compared to its preceding and succeeding points is called local minima.

  • Global minima:

The point which is minimum when compared to all points in the curve is called global minima.

In gradient descent we use this local and global minima in order to decrease the loss function.

  • Why Gradient Descent:
    As we know that we use gradient descent algorithm to minimize the cost function i.e. minimizing the parameters/coefficients a,b,c etc.,
    that gives least errors in order to predict accurate results of target variable for unseen data.
  • How it works:
    1) Randomly pick a point xi.
    2) compute xi+1 such that xi+1 is closer to x optimizer.
    xi+1 = xi — something
    3) Repeat step-2 until convergence.

Types of Gradient Descent

  • Batch Gradient Descent:
    It uses entire dataset to compute the cost function which minimize the error, as we use entire dataset so the computation
    is very slow and it is not suitable for large datasets only suitable for smaller datasets.
    To overcome this problem we use Stochastic Gradient descent and mini Batch Gradient descent.
  • Stochastic Gradient descent:
    Stochastic means a system or process based on probability. In SGD, a few samples selected randomly instead of the whole data set for each iteration. It uses only a single sample, i.e., a batch size of one, to perform each iteration.
    The sample is randomly shuffled and selected for performing the iteration.

SGD is more faster to compute, it is Memory efficient and can handle large datasets that cannot fit into memory.
Due to the noisy updates in SGD, it has ability to escape from local minima and converge to a global minimum.

* Drawback of SGD is noisy and have a high variance, which makes the optimization process less stable.
* SGD require more iterations to converge since it updates the parameters for each training example one at a time.
* Due to noisy updates, SGD may not converge to the exact global minimum and can result in suboptimal solution.

  • Mini Batch Gradient Descent:
    It is a combination of both Batch and stochastic gradient descent. It separates the training dataset into several minibatches and then it performs computation.
    Size of the batch varies based on the training dataset i.e., it is suitable for large datasets with a lesser number of iterations.