The solution to an optimization problem can be done by selecting different methods. Moreover, the user can navigate on the surface or curve to establish an initial point and find the optimal or critical point, which can be observed on the plotted function.
1.Single Value Differentiation
2. Minima and Maxima
3. Gradient descent algorithm
4. Steps for Gradient descent algorithm
5. Types of Gradient Descent algorithms
6. Implementation of Stochastic Gradient Descent
For Optimization problems Differentiation is very important, Let’s see some maths,
Single Value Differentiation
Differentiation allows us to find rates of change. Intuitively dy/dx means how much does y changes as x changes.
Geometrically the derivative [f’(x) or dy/dx] of the function y = f(x) at the point P(x, y) (when exists) is equal to the slope (or gradient) of the tangent line to the curve y = f(x) at P(x, y).
Tangent: In geometry, the tangent line to a plane curve at a given point is the straight line that “just touches” the curve at that point. Mathematically tangent is given by,
dy/dx is the slope of the tangent to f(x).
Minima and Maxima
A maximum is a high point and a minimum is a low point.
In a smoothly changing function, a maximum or minimum is always where the function flattens out (except for a saddle point).
Where does it flatten out? Where the slope is zero.
Where is the slope zero? The Derivative tells us!
And there is an important technical point that the function must be differentiable.
A stationary point on a curve is defined as one at which the derivative vanishes i.e. a point (x0, f(x0)) is a stationary point of f(x) if [dx/df]=0.
Clearly, the derivative of the function has to go to 0 at the point of Local Maximum; otherwise, it would never attain a maximum value with respect to its neighbors.
One can see that the slope of the tangent drawn at any point on the curve i.e. dy/dx changes from a negative value to 0 to a positive value, near the point of local minima.
First, we find the points that are maxima and minima using the following steps.
- Find the derivative of the function.
- Set the derivative equal to 0 and solve for x.
- Plug the value you found for x into the function to find the corresponding y value. This is your maximum or minimum point.
But some functions are very complex we can’t find maxima and minima by solving that function is non-trivial shown below,
Gradient descent algorithm
Some functions are very complex to solve, so the computer science technique to solve those types of functions is the Gradient descent algorithm.
Gradient descent is an optimization algorithm that is mainly used to find the minimum of a function. In machine learning, gradient descent is used to update parameters in a model. Parameters can vary according to the algorithms.
Think of a valley you would like to descend when you are blind-folded. Any sane human will take a step and look for the slope of the valley, whether it goes up or down. Once you are sure of the downward slope you will follow that and repeat the step again and again until you have descended completely (or reached the minima).
This is exactly what happens in gradient descent. The inclined and/or irregular is the cost function when it is plotted and the role of gradient descent is to provide direction and the velocity (learning rate) of the movement in order to attain the minima of the function i.e where the cost is minimum.
A gradient is a vector-valued function that represents the slope of the tangent of the graph of the function, pointing the direction of the greatest rate of increase of the function. It is a derivative that indicates the incline or the slope of the cost function.
Consider the objective function f: Rd→Rf that takes any multi-dimensional vector x=[x1,x2,…, xd] as its input. The gradient of f(x) with respect to xx is defined by the vector of partial derivatives.
each element ∂f(x)/∂xi of the gradient indicates the rate of change for f at the point x with respect to the input xi only.
f(x) gives the rates of change of ff at the point xx in all possible directions, to minimize f, we are interested in finding the direction where f can be reduced fastest.
For Linear regression, we use a cost function known as the mean squared error or MSE.
Steps for Gradient descent algorithm
- First, we take a function we would like to minimize and very frequently given the gradient, calculate the change in the parameters with the learning rate.
- Re-calculate the new gradient with the new value of the parameter.
- Repeat step 1 until convergence.
Here is the formula of the gradient descent algorithm:
Convergence is a name given to the situation where the loss function does not improve significantly, and we are stuck in a point near to the minima.
where α, alpha, is the learning rate, or how rapidly do we want to move towards the minimum. We can always overshoot if the value of α is too large.
The gradient is a vector-valued function, and as a vector, it has both a direction and a magnitude. The Gradient descent algorithm multiplies the gradient by a number (Learning rate or Step size) to determine the next point.
One of the problems is the value of the learning rate. The learning rate is a randomly chosen number that tells how far to move the weights. A small learning rate and you will take forever to reach the minima, a large learning rate and you will skip the minima.
The parameters are updated by taking the gradient of the parameters and then the learning rate is multiplied which suggests how quickly we should go towards the minimum. The learning rate is a hyper-parameter and while choosing its value you should be careful.
Typically, the value of the learning rate is chosen manually, starting with 0.1, 0.01, or 0.001 as the common values, and then adapt it whether the gradient descent is taking too long to calculate (you need to increase the learning rate), or is exploding or being erratic (you need to decrease the learning rate).
Types of Gradient Descent Algorithms
There are three variants of gradient descent based on the amount of data used to calculate the gradient:
- Batch gradient descent
- Stochastic gradient descent
- Mini-batch gradient descent
Batch Gradient Descent:
Batch Gradient Descent, aka Vanilla gradient descent, calculates the error for each observation in the dataset but performs an update only after all observations have been evaluated.
Batch gradient descent is not often used, because it represents a huge consumption of computational resources, as the entire dataset needs to remain in memory.
Stochastic Gradient Descent:
Stochastic gradient descent (SGD) performs a parameter update for each observation. So instead of looping over each observation, it just needs one to perform the parameter update. SGD is usually faster than batch gradient descent, but its frequent updates cause a higher variance in the error rate, which can sometimes jump around instead of decreasing.
Mini-Batch Gradient Descent:
It is a combination of both bath gradient descent and stochastic gradient descent. Mini-batch gradient descent performs an update for a batch of observations. It is the algorithm of choice for neural networks, and the batch sizes are usually from 50 to 256.
Implementation of Stochastic Gradient Descent for Linear Regression
- Take the Boston data set from sklearn.
- Write the SGDRegressor from scratch.
- You don’t need to split the data into train and test, you consider whole data for this implementation.
- Get weights( coefs_ and intercept ) from your model and the MSE value.
- Don’t forget to standardize the data, and choose the appropriate learning rate.
- Train your model using SGDRegressor with the same parameters, and find the MSE on the same data.
- Compare these two results.
- You can choose any other metric other than MSE to compare them. They both should be the same.
For Linear Regression, we use a cost function known as the mean squared error or MSE.
Now we will apply partial derivative with respect to m and c and will equate it to zero to find the least value of m and c for which our cost function gets the lowest value as possible.
First, we want to import all the required Libraries
Importing the Boston data set from sklearn.
SGDRegressor for Linear Regression using sklearn Library
Optimal Weights and Intercept using sklearn SGD
Visualizing the results
Error plot for sklearn SGD
We are also implemented using without using Library to know the full code visit my GitHub link here.
comparing the Error plot for Sklearn SGD and Self Implemented SGD
As from the above, we can see that mean of the differences in the prediction of the two models is at 0 As we can see above intercept and weight(coef) is almost same for sklearn SGD and self-implemented SGD.
To understand the full code please visit my GitHub link.
I also Implemented Gradient descent using different Data sets to understand the full code please visit my GitHub link.
To Know detailed information Linear Regression please visit my previous blog link here.
- Applied AI
- Data Camp
Thanks for reading and your patience. I hope you liked the post, let me know if there are any errors in my post. Let’s discuss in the comments if you find anything wrong in the post or if you have anything to add…