Deep learning losses for classification & probabilities

Alvaro Durán Tovar
Deep Learning made easy
3 min readDec 1, 2020

An unified view

Photo by Jan Antonin Kolar on Unsplash

Classification & Probability

Classification & Probability? Why mix both? Well, because classification indeed can be thought as fitting a probability distribution, bear with me. I’ll comment a bit about how to fit distributions and then relate it to classification.

Probability

The loss used for fitting a probability distribution is the negative log likelihood, I explain it with more detail here. What it does is given a distribution defined by some parameters (obtained from the output of a NN in this case), what is the probability of observing, (the target / label) Y? And then update the parameters through backpropagation in the direction where we maximize the probability of observing Y with a different set of parameters.

The observation Y won’t change obviously, as we want something that can fit as best as possible the generation of those observed Y (for future inference), we will change the parameters towards increasing the likelihood of obtaining those observations if sampling from the probability distribution.

One last time, we want a probability distribution parameterized from past data to predict some future data Y given some input X.

Each probability distribution have a different maximum likelihood estimation function, so I’m not going to enter into more detail here but we can reduce the general case to the following, where “p” is the maximum likelihood estimation function (I’m omitting the parameters of the distribution) (some examples gaussian, categorical, poisson…):

And we are lucky because deep learning frameworks already provides them!

Classification

Classification can be seen as using probability distributions, for the binary case we have the bernoulli distribution, for multi-class we have the categorical distribution, and you can think on the binary case as a special case of the multi-class classification.

If we have a probability distribution we know we can fit it with the negative log likelihood as mentioned above.

Also you probably already know we also have cross entropy and binary cross entropy. More weirds terms! Why?! :’(… Don’t worry, although cross entropy and NLL isn’t the same, for our case it’s the same (very well explained here), or better say the math formulas at the end simplify to the exact same thing, therefore cross entropy = binary cross entropy = NLL. They just exposes different apis, but they are the same at the end. Don’t you believe me? Take a look on the following code:

Why so many options for doing the same thing? Depending on the use case you want to use one or the other. Binary? Multi class? Logits? Normalized? …

Why negative on the “negative log likelihood”? Because the frameworks usually only implements miniziation, so we maximize the likelihood by minimizing the negative value.

--

--