Binary Crossentropy in its core!

It is a loss function that is widely used in Deep Learning, but the sad part is everyone just tells the name of the function & maybe the situation in which it can be used, no one tells what is this function when it should be used in reality, & how does it work internally? This blog aims to explain everything about Binary CrossEntropy in complete depth covering every formula & concept used in it.

Harshit Dawar
Analytics Vidhya
5 min readOct 4, 2020

--

Source: Unsplash via Shahadat Rahman

Binary Crossentropy is the loss function used when there is a classification problem between 2 categories only.

It is self-explanatory from the name Binary, It means 2 quantities, which is why it is constructed in a way that fits the problem of classification of 2 quantities.

Before starting the internal working of this loss/cost/error function, I would suggest you please read the blog of the significance of mean squared error written by me(link just below this paragraph), it helps you to make your base more fundamental for the working of this function.

Role of Binary CrossEntropy!

It is a loss/cost/error function that works when there are discrete outputs in the problem, more specifically when there are only 2 discrete quantities then this function is the best choice in Deep Learning.

Internal Working of Binary Crossentropy!

Since this function is used with discrete quantities, therefore Probability Mass Function(PMF) is used in this scenario(return probability) instead of Probability Density Function(PDF)(return density) in the case of continuous values present when Mean Squared Error was used(as mentioned in the above blog).

PMF used in this function is represented by the equation given below:

PMF for Binary CrossEntropy! [Image by Author]

In the equation above, x is constant because it is already present in the data, & mu is the variable.

Therefore likelihood(our desire(in this case probability of a record to fall in a category) which we want to maximize, [for more information refer to the blog mentioned above]) from the PMF can be represented as:

Likelihood (product of all x in PMF)! [Image by Author]

Now, to make the calculations, we should take the log of this function, because then minimizing/maximizing using derivatives becomes easy. The reason why taking the log before processing is allowed is log is a monotonically increasing function, therefore it is used to make the working easy.

Log-Likelihood will be:

Log-Likelihood! [Image by Author]

Since we want to maximize our desire i.e. probability of each record to fall in a specific category, to achieve that, the value of “mu” has to be found out in the above log-likelihood equation.

In order to maximize the likelihood, our old friend “calculus” will help us in this, I believe you all remember the way of calculating the maximum value is by taking the derivative & assigning it to 0. Therefore let’s proceed, after taking the partial derivative of the above log-likelihood function with respect to “mu”, the output is:

Image by Author!

The above equation does not resemble the mean of all the x values, in reality, in this function, in the above equation, x(i) will have the probability value either 1 or 0. For example, in a coin toss, if we are looking for heads, then if head appears then the value of x(i) will be 1, otherwise 0.

In this way, the above equation will calculate the actual probability of the desired outcome in all the events.

Now, there is one very important concept, if we maximize the likelihood, or we minimize the negative log-likelihood(it is the actual error in prediction & actual value), it is exactly the same thing.

Therefore the negative log-likelihood will be:

Negative Log-Likelihood! [Image by Author]

To make the above function as Binary Crossentropy, only 2 variables have to be changed, i.e. “mu” will become y_pred(class corresponding to maximum probability of i) [class into which y(i) is classified based on the maximum probability], it is exactly the same as in the “mu” equation above. The second thing that is changed will x(i), it will be written as y(i) which is the value in the data. This will be calculated by Binary Crossentropy to minimize the error.

Therefore now we have:

Image by Author!

In reality, the average error is taken into consideration, now in the above function, dividing it by the number of samples, will make the real binary cross-entropy cost/error/loss function.

Finally, Binary Cross entropy function:

Binary Cross-Entropy Function! [Image by Author]

This concludes the explanation & internal working of the very important loss function in Deep Learning Binary CrossEntropy.

I hope my article explains each and everything related to the topic with all the deep concepts and explanations. Thank you so much for investing your time in reading my blog & boosting your knowledge. If you like my work, then I request you to give an applaud to this blog!

--

--

Harshit Dawar
Analytics Vidhya

AIOPS Engineer, have a demonstrated history of delivering large and complex projects. 14x Globally Certified. Rare & authentic content publisher.