Gradient Descent: An Algorithm for Deep Learning Optimisation
Optimisation algorithms, used to minimise a cost function against training data, are the cornerstone of many mathematical methods, from simple curve fitting to deep machine learning.
Within the realm of machine learning, gradient descent and its variants are undoubtedly between the most popular optimisation algorithms. In this article we want to explore the mathematical foundations of gradient descent alongside the practicalities of its implementation — using sampling code running in TensorFlow. Stay tuned.
Let’s start with a typical classification problem like training a model to assign labels to data. For this we’ll use a neural network that will be trained with a set of pre-labelled data. One of the key ingredients during the supervised learning process is the cost function that has to be optimised.
The cost function for a neural network depends on the weights and biases that we want to shape, and it measures how well the model performs with respect to the training dataset. The most common cost function can be expressed as a sum of the squared differences:
where ŷ, or ‘y hat’, represents the prediction of the neural network, y^i the value from the data point i and the sum runs over m data points in the training set.
We now have to identify the weights and biases that minimise the cost function. At the beginning of the training we would have set random values for our parameters and, through a process of back propagation and iteration, we can adjust the weights and biases to reduce the value of the cost function. This is where gradient descent enters.
But before going any further, why gradient descent? Why do we need to optimise the cost function in the first place? Why not run lots of possible and random weights?
We could try out thousands of values for the weights on the cost function and see which ones work best. While this could work for one weight, once we increase the number of layers and synapses we can’t escape the problems that arise within a high-dimensional space. It becomes computationally infeasible to get the minimum value of the cost function in an acceptable amount of time.
Let’s assume that we’re working with the following neural network that has 16 weights:
To try 10,000 values for each weight would imply to run 10⁶⁴ combinations (10,000¹⁶). In a real case scenario, we would have hundreds or thousands of weights. Not very efficient right.
So, what is the gradient descent? At a theoretical level, gradient descent is a first order iterative method for finding the minimum of a function, and thus is very well suited to use with continuous and convex cost functions.
In each iteration the weights of the neural network are modified by a value proportional to the negative gradient of the cost function at the current point, to get closer to the function minima:
Where the w’ on the left side is the vector of neural network weights after a learning step and ∇J(w) is the first-order derivative, the gradient. The proportionality constant 𝜼 is normally referred to as the learning rate.
The equivalent expression for the weight of a single neuron w_j is:
Where we now have the partial derivative of the cost function.
For our simple cost function, we can perform the calculation explicitly in the simplest case of a linear transformation:
However, in most situations the gradient will be calculated numerically.
The learning rate
The learning rate will determine how many iterations will be needed to adjust the parameters with respect to the cost function, because it controls how fast the weights are updated at each interval. Smaller rates will take longer to converge to a minima. Too large a rate may lead to high fluctuations around a minima, or diverge from the optimal solution.
Non-convex error functions
A word of advice — the gradient descent works well for a convex function, but if the surface of the cost function that we’re trying to minimise is not convex, the gradient descent may be stuck in a local minima without ever reaching the global minima (this is the optimal solution for the relationship between cost function and weights and biases).
We can visualise it as a mountain range with different valleys. Independently of the number of iterations, the gradient might end up moving around the same valley, without necessarily reaching the lowest point. Or, the gradient may converge to another stationary point that we’re not interested in, a saddle point, which can be visualised as a plateau in our mountain example.
Now the fun part, let’s see how to run the gradient descent inside a neural network. For simplicity, we’ll import from scikit-learn the Iris dataset, which is perfect for testing purposes, and will split the dataset into two sets; one for training, and another for testing:
Since we’re implementing a neural network we should normalise the data so let’s transform the data with MinMaxScaler:
As good practice, we should vectorise each label (if we don’t do this the model might conclude that the labels have some kind of hierarchy):
In the following step we initialise two variables, one for the number of data points, the other for the number of labelling classes, and at this point we pre-define the two tensors as placeholders (upon execution, those placeholders will be fed with data):
We now need to pass some key parameters. These are the learning rate that we mentioned before, the number of epochs to iterate the gradient descent (the back-propagation process), a regularisation factor to avoid overfitting, and the number of nodes for each network layer.
In a real case-scenario we would have to find the optimal values of the parameters for our neural network.
After we’ve defined the tensors for the weights and the biases:
We can now build the neural network and define the cost function. The prediction (y_hat) will be computed as the output of the last fully connected layer.
As the cost function we have used the mean squared error, as described before, but in a real application of a classification problem like this one other functions like the cross entropy/softmax would be more appropriate.
The real action, and the most time-consuming element lies in the training loop:
Beyond Gradient Descent
To avoid being stuck in undesirable stationary points or dealing with computationally expensive matrixes, there are more advanced methods to run or optimise the gradient. To run the gradient let’s look at Stochastic GD and Mini Batch GD, while to see how to optimise the gradient we can use Adam (Adaptive Moment Estimation).
Stochastic Gradient Descent
SGD may be considered an approximation of the GD, where each iteration is evaluated on a single data sample instead of on the whole training set. Obviously, finding the best fit parameters based on a single sample reduces the overhead costs significantly for large datasets, and it tends to reach convergence faster. The main disadvantage is that the journey to find a global minima tends to be more erratic and noisy.
We can see mini-batch as a compromise between the simple gradient descent and the stochastic version. Instead of running just a training sample or the full dataset to minimise the error function, it uses a small subset of the data, chosen randomly. Thus taking the best of both methods.
Adam, which is not an acronym, derives its name from adaptive moment estimation. As with the above methods, it’s a first order gradient but in contrast to them it allows the learning rates of each of the network weights to be adapted following the learning process itself. It can be seen as a fusion between two other popular methods: AdaGrad (Adaptive Gradient Algorithm), well suited for sparse data, and RMSProp (Root Mean Square Propagation), which works well for online and non-stationary problems.
In this article we’ve introduced some of the key concepts of gradient descent:
- We saw how the gradient descent method minimises the cost function to ensure the convergence to a global minima,
- We worked through a mathematical example of a simple linear transformation to show how to go from a cost function to the gradient descent,
- We’ve seen that the learning rate defines how fast or slow the algorithm moves towards the optimal weights values,
- We’ve gone through a code sample that illustrates how to integrate gradient descent with a neural network.
For anyone planning to explore machine learning, having a grasp on how to minimise a cost function will be essential, and we hope that this article has helped to explain and explore some of the core ideas behind gradient descent.