# Understand Cross Entropy Loss in Minutes

There is binary cross entropy loss and multi-class cross entropy loss. Let’s talk about the cross entropy loss first, and the binary one will hopefully be an afterthought. So now you know your Softmax, your model predicts a vector of probabilities`[0.7, 0.2, 0.1]` Sum of 70% 20% 10% is 100%, the first entry is the most likely. And yes your true label says `[1, 0, 0]` — definitely a cat, not a dog at entry 2, and definitely not a bird at entry 3. So how well did your model’s probability prediction do? A little linear algebra will help — dot product in particular! The dot product is the sum of the multiplication of two vectors entry-by-entry. Pseudo code is `[0.7, 0.2,0.1] dot [1,0,0].transpose` that is a 1 by 3 vector dot a 3 by 1 vector = 1x1 = a number. `0.7 *1 + 0.2 * 0 + 0.1 * 0= 0.7` . Note that the incorrect entry will always be zero because anything multiplied by zero is zero. And the true label always only have one entry that is 1 `[1,0,0], [0,1,0], or [0,0,1]` thou can only be a cat, dog or a bird, can’t have it all!

To calculate how similar two vectors are, calculate their dot product! — said us Uniqtech :)

Why are we doing random calculations? Aren’t we doing cross entropy loss? The above approach is one way to measure the similarity between two vectors. In fact it is used in movie and book recommendation systems, called collaborative filtering! More about that in a different article. Cross entropy is another way to measure how well your Softmax output is. That is how similar is your Softmax output vector is compared to the true vector `[1,0,0], [0,1,0],[0,0,1]` for example if there are only three classes. Just a reminder Softmax is for multi-class classification tasks. For regression tasks, you can use the mean square loss MSL. For multi-class classification tasks, cross entropy loss is a great candidate and perhaps the popular one!

See the screenshot below for a nice function of cross entropy loss. It is from an Udacity video which we will also link to. But let’s go through it together for a few minutes first.

S(y) is the output of your softmax function. It is a prediction, so we can also call it y_hat. L is the ground truth! The one hot encoded label of the true class, only one entry is one, the rest are zeroes. For each entry in your output vector, Step 1 Softmax takes the log of that first entry, usually an less-than-one number, so it’s very negative for example log_base_2 of 0.7 is negative 0.5145731728297583 and 2 to the -0.5145731728297583th power is 0.7 , Step 2 the entry is multiplied by the ground truth `log(0.7)*1` , and then Step 3 we do that for each of the entry `log(0.2)*0` which is of course zero and then `log(0.1)*0` which is also zero, then step 4 because of the big sigma summing simple in front of L_i Log(S_i) we sum all the losses up which is -0.5145731728297583, Step 5 we multiply by -1 because of the big negative sign in the front and turn the loss into a positive number 0.5145731728297583

It’s easy to remember that the log only applies to Softmax output because the one-hot-encoded ground truth almost always have zero entries, and we cannot take log of zero.

Just like Mean Square Loss we can take the advantage of the losses, see the 1/N in the screenshot below. This makes the loss function work nicely even if the number of data points is large. In the screenshot below, y is the ground truth and y_hat is the prediction. Same thing as above, just with a 1/N to calculate the average.

Binary Cross Entropy aka Log Loss!

Binary cross entropy loss looks more complicated but it is actually easy if you think of it the right way. Remember there can only be two state of the world in binary classification, either y the ground truth is one or zero.

If y=0, the first term is zero, so we only calculate (1- y) * log(1-y_hat). (1-y) is 1! If y=1, the second term is zero, only the first term take effect.

Where does the cross entropy function fit in my deep learning pipeline? It sits right after the Softmax function, and it takes in the input from the Softmax function output and the true label.

Now that you know a lot about Cross Entropy Loss you can easily understand this video below by a Google Deep Learning practitioner.

Remember the goal for cross entropy loss is to compare the how well the probability distribution output by Softmax matches the one-hot-encoded ground truth label of the data. One hot encoded just means in each column vector only one entry is 1 the rest are zeroes. It is usually used in encoding categoric data where all the classes or categories are independent — an object cannot simultaneously be a cat, dog and a bird for example.

Follow us for more articles like this!

Why the log in the formula? Intuition tells us that to measure the similarity between two vectors we can take the dot product of the two. Cross entropy formula is rooted in information theory, measures how fast information can be passed around efficiently for example, specifically encoding that encodes more frequent information with small number of bits. It turns out that it is calculated using a log. log(1/y) for example.

Written by

## Data Science Bootcamp

#### Introductory Data Science, Machine Learning and Artificial Intelligence for Bootcamp and Nanodegree Graduates. By a bootcamp grad for bootcamp grads.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just \$5/month. Upgrade