Debunking loss functions in deep learning

9 min readOct 24, 2017

At the heart of any supervised machine learning algorithm there is a training phase which could not be done without an optimisation algorithm or sometimes referred to as backpropagation. There are other classes of algorithms which could train or rather optimise your model but lets solely speak about classes of algorithms called as gradient descent. Where does learning start in this whole pipeline of training and why do we even need it?

In general terms a supervised learning procedure requires some labeled sample data which basically are features mapped to some answer. Model is what actually describes or tries to predict mappings between features and true values. Each model has parameters which we can tweak. Depending on what values we set those parameters to, we might get an accurate model which on the other hand can bring another problem such as bad generalisation because of overfitting. Luckily tweaking model parameters is not done in a manual fashion instead it is through a process more humanly called learning but in reality it is an effect of a optimisation algorithm which runs for many iterations and has quite a few hyperparameters. Firstly lets look at what is an optimisation.

Optimisation

Optimisation is a task of finding some extremum values of a function. We can further divide optimisation problem into minimisation and maximisation or in other words find parameter which would yield smallest or largest values of a function. Very simply lets say we need to minimise a function

Very broadly we would follow these steps if using gradient descent as optimisation algorithm:

Initialise W and B with some random value
Compute the gradient with respect to the parameter, in this case W and B
Take a small step in the direction where function will decrease for each parameter, since we are minimising a function

Given an iterative nature of gradient descent with enough iterations W should converge to a value where function takes on a smallest value. Of course convergence is dependant on a few hyperparameters such as learning rate and number of iteration and even initial starting point but we are not going to talk about that here.

How does an optimisation algorithm know how to update model parameters?

Given the input data, correct answers and the predictions we need some way of telling us how bad did we do at prediction. If we look at these items independently input data and correct answers are fixed data points which we cannot change during the optimisation process. On the other hand prediction is dependant on the model and model is dependant on model parameters, so we need some measure of quality which would be some function which would allow tweaking of model parameters as well as incorporating all of other data and then informing us whether we are doing better or worse compared to actual answer. Through this measure of quality we are allowing an optimisation algorithm such as backpropagation to specifically focus on model parameters and being able to compare how well it is doing tweaking those parameters.

Introducing loss

Lets say we have just a single labeled example and denote it as x and y. Then during the forward pass having model parameters W and B, and more formally we would write

f(…) is also referred to as a hypothesis. After model done its magic, it gives us its prediction y-hat. Having our prediction we want to know how well we did and thats where learning starts. If we are working with a single labeled example y-hat is feed into a quality measure function in order to see how far away are we from true y value. This quality measure is called a loss function and usually is denoted as L. Here is an example of a loss function.

Loss is usually defined on a single point and measures the error.

Going from loss to cost

We have our loss function but downside is that it works only with a single training sample, when training a deep learning algorithm we tend to work with batches or multiple examples, so lets say we have in addition N training examples. If you combine multiple losses into a single equation you will get a cost. Combination tends to be an average of all individual losses and expressed through model parameters.

Mathematically function parameters are defined as something which can be changed or something we would like to tweak, so in the case of a cost function parameters which will be optimised and parameters which will be changing are weights and biases or W and B in this case because thats what optimisation algorithm actually optimises, everything else is fixed.
X and Y might seem as parameters but actually are fixed during a training process (they are parameters to the model but not to the cost function), Training algorithm can’t go and change them, only thing is allowed to update are weights and biases.

Promise of a cost function

As you can see there is a deep tie between backpropagation algorithm and a cost function.Goal of a backpropagation algorithm is to compute a partial derivative of a cost function with respect to any weight or bias in a model or more generally speaking with respect to any model parameter. In order for backpropagation to work cost function must satisfy these assumptions.

Be an average of the losses
Be a function of model outputs
Assign high value to incorrect values and low values to correct values

Lets look at each of those individually and think why does it need to keep that promise.

Averaging is used such that cost function is not dependent on the number of training examples, sometimes we might use 100 examples and sometimes 1000 or even more, but a cost function should not depend on that.

If output of a model would not be taken into account then there would be no way to compare correct values with predicted and in essence output of a model is a combination of inputs and model parameters, so it would not be possible to optimise model parameters.

Backpropagation works with gradients of a cost function with respect to some model parameter, so by assigning higher value to the incorrect prediction we would get steeper gradient at that point and will move quicker to lower plane compared to if cost would not assign error values in the same way. Overall if error surface is relatively flat then learning process will be slower.

House of commons

We start of with cost function specifically used for regression problems

Mean Squared Error

In the context of neural networks MSE does not require output to be piped through some activation function, just uses raw values. Squaring is done in order to emphasise large mismatches between Y-hat and Y.

TODO: Show error surface plot during training X, Y, MSE

Another category of loss functions are used specifically for classification problems. Simplest case is if we have only two prediction classes and we can expresses that using 0 and1, everything in between being a probability that prediction is actually 1. Usually use a cut off point of 0.5 meaning that everything below 0.5 is labelled as 0 and everything above that is labeled as 1.

Binary Cross Entropy

This requires an output to be piped into a sigmoid function before going to BCE. Sigmoid works here very well as it produces an output between 0 and 1, otherwise if output values would incorporate 0 then BCE could not be computed as log 0 is not defined.

TODO: Show error surface plot during training X, Y, BCE

There is also a class of loss function which in the probabilistic framework are maximised and are very closely related to binary cross entropy.

Likelihood

In order to compare it to binary cross entropy the correct likelihood function is Bernoulli. Bernoulli random variable has a single parameter p which tells the probability of spawning a 1. Lets write a likelihood loss function in terms of Bernoulli random variable parameter p.

In this case p is our predicted values so we can rewrite function using y-hat to look like this

We would like to maximise this function hence you sometimes hear the term maximum likelihood. Maximising products is a bit more difficult hence we take the log of that function and get log likelihood.

Log likelihood

Now you can see it look exactly the same as our cross entropy function, only thing different is singe we are maximising log likelihood it does not have a negative sign at the beginning.

Next if we have several outputs from our model each output giving us a probability that it’s a certain label then we move to a a realm of multi class classification problem and yet another cost function needs to be used.

Categorical Cross Entropy

It is quite similar to BCE but has an extra summation term to sum over all output nodes which in formula is denoted as J.

Cost function works in tandem with softmax activation function. Softmax is a generalised sigmoid activation function for K outputs. It is required because individually outputs would not sum up to 1 so we need to normalise it and thats what softmax does for us.

TODO: Show error surface plot during training X, Y, CCE

This is of course not an exhaustive list of all used cost function, but certainly most of the common ones.

Loss cost customs

Now lets say you want to design your own cost function, because sometimes depending on a problem you are trying to solve, it would be a more sensible thing to do, so as a toy example lets try to do just that.

Let say we have some model which predicts a price of a financial instrument and our P/L is directly tied to the prediction of that price. We need to design a cost function which would in case future brings losses should predict losses rather than profits and if future holds profits it should rather predict profits than losses. At the same time going further away from true value is seen as a riskier behaviour no matter if it’s a loss or profit.

Since this is a regression problem we could have just used MSE as our cost function, problem with MSE in this case is that we are breaking one of the requirements. To illustrate this lets say true return is 0.01 and lets compare two predictions -0.01 and 0.03, one is a loss prediction another is a profit.

MSE = ( y - y-hat )² = ( 0.01 - (-0.01) )² = ( 0.01 – 0.03 )² = 0.04

Here true value is a profit so we should treat prediction of profits differently to losses and hence treating signs differently (negative for losses and positive for profits). Both incorrect predictions from risk perspective are bad, but predicting loss as -0.01 when it is a profit as in this case intuitively is far worse.

In this case loss function should have 2 parts to it, if prediction and true label is of same sign then loss will be a monotonically increasing function as we go further away from true prediction, and if signs are different we should treat prediction much harsher as it goes further away from prediction.

For it to be a cost function we just need to average over all of the loss’es.

Lets have a look at how it compares to MSE if we plot it together.

Comparing MSE to our custom loss and how our loss behaves when sign is crossed, penalises a lot more if true value is profit but predict loss and if true value is loss but predicted profit.

As you can see our designed loss function is much more risk averse for this specific scenario than the MSE function. In case our true value would 0.06 and we predicted -0.01, MSE is barely increasing at that point, but our loss function signalling that prediction was quite bad.

In effect because we structured our loss function to be more risk averse then when actually optimising the model it should converge quicker to desired value. MSE would converge as well to correct P\L values but it would take a lot more interactions and expose us to unnecessary risk more often.