Optimization is a key part of almost any problem with practical applications. It always involves a function that outputs a single real number that allows you to compare different things. Such a function is known as a loss or a cost function. In machine learning, you’ll generally want to tune the parameters of a model such that they minimize error rate. How the error rate is defined will depend your methods and application.

Gradient descent is one of the most common optimization methods that are employed in machine learning. Conceptually it’s very simple. Choose a starting point, calculate the gradient of your loss function, and take a step in the opposite direction. Recalculate the gradient at your new location, take another step, and repeat until convergence. The gradient is essentially a multidimensional slope, and it points toward the direction of steepest ascent. Naturally, the negative gradient will point towards the direction of steepest descent. How you calculate the gradient and what size of a step to take is a much more complicated questions, and there is a wide range of gradient descent algorithms that approach this in different ways.

Batch gradient descent, stochastic gradient descent, and mini-batch gradient descent are the three main classes of gradient descent algorithms. These terms have to do with how the gradient is determined. Batch gradient descent calculates the gradient of your loss function using all of your training data. While this is true to the original formulation of the gradient descent concept, this will be a prohibitively slow process for very large datasets.

Stochastic gradient descent will instead calculate the gradient based only on a single observation in your training set. As long as the observations in your training set are first randomly shuffled and you iterate through all of them, stochastic gradient descent can be shown to converge to the same value as batch gradient descent. This will take a less direct path towards your minimum, requiring more iterations than batch gradient descent, but each iteration will take much less time. You may also have to reshuffle and iterate through the entire training set several times in order to reach convergence.

Mini-batch gradient descent is a happy medium between the two. This involves randomly shuffling your training data and splitting those observations into equally sized batches. The size of those batches has to be chosen based on how many observations you have and any time constraints. As with all things in life, balance is key, and thus mini-batch gradient descent is the most commonly used form of it.