About Gradient Descent

Optimization is a key part of almost any problem with practical applications. It always involves a function that outputs a single real number that allows you to compare different things. Such a function is known as a loss or a cost function. In machine learning, you’ll generally want to tune the parameters of a model such that they minimize error rate. How the error rate is defined will depend your methods and application.

Gradient descent is one of the most common optimization methods that are employed in machine learning. Conceptually it’s very simple. Choose a starting point, calculate the gradient of your loss function, and take a step in the opposite direction. Recalculate the gradient at your new location, take another step, and repeat until convergence. The gradient is essentially a multidimensional slope, and it points toward the direction of steepest ascent. Naturally, the negative gradient will point towards the direction of steepest descent. How you calculate the gradient and what size of a step to take is a much more complicated questions, and there is a wide range of gradient descent algorithms that approach this in different ways.

Batch gradient descent, stochastic gradient descent, and mini-batch gradient descent are the three main classes of gradient descent algorithms. These terms have to do with how the gradient is determined. Batch gradient descent calculates the gradient of your loss function using all of your training data. While this is true to the original formulation of the gradient descent concept, this will be a prohibitively slow process for very large datasets.

Stochastic gradient descent will instead calculate the gradient based only on a single observation in your training set. As long as the observations in your training set are first randomly shuffled and you iterate through all of them, stochastic gradient descent can be shown to converge to the same value as batch gradient descent. This will take a less direct path towards your minimum, requiring more iterations than batch gradient descent, but each iteration will take much less time. You may also have to reshuffle and iterate through the entire training set several times in order to reach convergence.

Mini-batch gradient descent is a happy medium between the two. This involves randomly shuffling your training data and splitting those observations into equally sized batches. The size of those batches has to be chosen based on how many observations you have and any time constraints. As with all things in life, balance is key, and thus mini-batch gradient descent is the most commonly used form of it.

The learning rate, usually represented as η or α, is essentially how large of a step you take in each iteration of your gradient descent. Too large of a learning rate, and you may end up overshooting your minimum. It should also be noted that you may want a large learning rate so that the algorithm will overshoot shallow local minimums. Too small of a learning rate, and the algorithm will take prohibitively long to converge. Smart implementations of gradient descent will adjust the learning rate between iterations to find a balance between convergence rate and accuracy. A few commonly used adaptive learning rate algorithms include Adagrad, Adadelta, RMSProp, and Adam. While I won’t (or can’t) go into detail of how these different algorithms work, it may prove useful to recognize what they are for.

If you have worked at all with machine learning, you have probably used gradient descent without implementing any of it yourself. Gradient descent algorithms are a key part of most model-fitting functions. Sklearn employs gradient descent algorithms in the SVMs package. Xgboost (extreme gradient boost) employs gradient descent to fit its models, in fact gradient boosting is technically a form of gradient descent in itself. Tensorflow uses back-propagation to find the gradient (with respect to the weights) of a neural net, and then it uses gradient descent to tune those weights iteratively. Gradient descent is so commonly used in machine learning that it is really crucial for any data scientist to at least have a high-level understanding of how it works.

P.S. — Gradient descent’s convergence rate is highly dependent on scaling of your observations. If any of your models is taking excessively long to fit, try scaling all your features.