A Walk-through of Cost Functions

Li Yin
Machine Learning for Li
4 min readMar 9, 2017

Mean Squared Error (MSE)

This is one of the simplest and most effective cost functions that we can use. It can also be called the quadratic cost function or sum of squared errors.

The title pretty much spells out the equation for us:

We can see from this that first the difference between our estimate of y and the true value of y is taken and squared. This square isn’t there for no reason, as it allows are result to be quadratic.

You may know that a quadratic function when plotted will always have a sort of ‘u’ shape making it convex, like so:

This shows us that in the future when we need to use something like gradient descent, we won’t run into the major problem of getting stuck in a local optimum.

We then sum each of the results and find the average.

Lets say we have the following dataset and want to predict the following label ‘happiness_scale‘:

minutes_exercise_pdayjob_satisfactionagehappiness_scale.

We run the features through our neural network (the specifics are unimportant for now), and we get the following estimates of our labels:

These estimates look pretty wrong to me, but how wrong exactly? Lets use the mean squared error to tell us how wrong our neural network actually is:

This can guide us in our gradient descent process which will eventually reduce the cost function to its minimum.

This means that our neural network would be able to accurately predict the answers if we give it the same data and hopefully predict them if we give it data it hasn’t seen before.

Gradient descent isn’t something I want to go into too much detail about today in terms of the mathematics and how it’s performed, but I will in the near future in a different post.

As mentioned above, if you want to learn more about gradient descent then I have provided some resources at the end of this article!

Cross Entropy

This cost function originally stems from information theory with the transfer of bits and how much bits have been lost in the process.

We can define cross entropy as the difference between two probability distributions p and q, where p is our true output and q is our estimate of this true output.

This difference is now applied to our neural networks, where it is extremely effective because of their strong usage of probability.

We can see above that p is compared to log-q(x) which will find the distance between the two.

Cross entropy will work best when the data is normalized (forced between 0 and 1) as this will represent it as a probability. This normalization property is common in most cost functions.

Kullback-Leibler (KL) Divergence

We should also note another common cost function used that is very similar to cross entropy, called KL Divergence. In fact, it’s pretty much a mutated cross entropy, and can also be referred to as relative entropy:

The KL divergence will still measure the difference between probability distributions p and q.

However, the difference to note is that in information theory it focuses on the extra number of bits needed to encode the data.

This means that when applied to our data, the KL divergence will never be less than 0. It is only equal to 0 if p = q. Also note that the KL divergence is not a distance, whereas the cross entropy is.

Hinge Loss

The function max(0,1-t) is called the hinge loss function. It is equal to 0 when t≥1. Its derivative is -1 if t<1 and 0 if t>1. It is not differentiable at t=1 . but we can still use gradient descent using any subderivative at t=1.

--

--