Learning Day 57/Practical 5: Loss function — CrossEntropyLoss vs BCELoss in Pytorch; Softmax vs sigmoid; Loss calculation

De Jun Huang

Published in

dejunhuang

3 min readJun 11, 2021

CrossEntropyLoss vs BCELoss

1. Difference in purpose

CrossEntropyLoss is mainly used for multi-class classification, binary classification is doable
BCE stands for Binary Cross Entropy and is used for binary classification
So why don’t we use CrossEntropyLoss for all cases for simplicity? Answer is at (3)

2. Difference in detailed implementation

When CrossEntropyLoss is used for binary classification, it expects 2 output features. Eg. logits=[-2.34, 3.45], Argmax(logits) →class 1
When BCEloss is used for binary classification, it expects 1 output feature. Eg 1. logits=[-2.34]<0 →class 0; Eg 2. logits=[3.45] >0→class 1

3. Difference in purpose (in practice)

A major difference is that if you want to generate output as a form of probability, you should use BCE.
That’s because we can use sigmoid to process the single output while using BCELoss as the loss function. Eg 1. σ(-2.34)=p₁=8.79% (probability of output being class 1 is 8.79%, p₀=1–8.79%=91.21%. so it is likely to be class 0); Eg 2. σ(3.45)=p₁=96.9% (likely to be class 1)
We cannot use sigmoid for 2 output features while using CrossEntropyLoss as the loss function. σ([-2.34, 3.45])=[8.79%, 96.9%] does not make sense. It does not mean p₀=8.79% (if so, p₁=1–8.79%=91.21%). It also does not mean p₁=96.9%.
For CrossEntropyLoss, softmax is a more suitable method for getting probability output. However, for binary classification when there are only 2 values, the output from softmax is always going to be something like [0.1%, 99.9%] or [99.9%, 0.1%] based on its formula. Eg. softmax([-2,34, 3,45])=[0.3%, 99.7%]. It does not represent meaningful probability.

So softmax is only suitable for multi-class classification.

The above leads to — Sigmoid vs softmax

Sigmoid at output is more suitable for binary classification. (probably suitable for multi-label classification as well).
Softmax at output is more suitable for multi-class classification

For BCE, use BCEWithLogitsLoss(); for CE, use CrossEntropyLoss()

Use BCEWithLogitsLoss() instead of BCELoss() since the former already includes a sigmoid layer. So you can directly pass the logits in the loss function. Technical reason:

This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability. (ref)

CrossEntropyLoss() has already included a softmax layer inside.

Loss calculation in Pytorch

for loss calculation in pytorch (BCEWithLogitsLoss() or CrossEntropyLoss()), The loss output, loss.item() is the average loss per sample in the loaded batch
so the total loss per batch=loss*batch_size=loss.item()*x.size(0), and the average loss for the entire dataset will be total_loss/total_size
This will be useful when plotting train_loss and val_loss in the same plot. You would want to compare loss for train and val sets at a meaningful scale since data size of train and val sets are usually different. Average loss will be the most suitable.

train_loss = 0
criterion = nn.BCEWithLogitsLoss()for step, (x, y) in enumerate(train_loader):
    model.train()
    logits = model(x)
    loss = criterion(logits, y)    optimizer.zero_grad()
    loss.backward()
    optimizer.step()    train_loss += loss.item()*x.size(0)ave_loss = train_loss/len(train_loader.dataset)

Reference

link1: BCEWithLogitLoss() for probability calculation

link2: Loss calculation in pytorch

Learning Day 57/Practical 5: Loss function — CrossEntropyLoss vs BCELoss in Pytorch; Softmax vs sigmoid; Loss calculation

CrossEntropyLoss vs BCELoss

1. Difference in purpose

2. Difference in detailed implementation

3. Difference in purpose (in practice)

The above leads to — Sigmoid vs softmax

For BCE, use BCEWithLogitsLoss(); for CE, use CrossEntropyLoss()

Loss calculation in Pytorch

Reference

Written by De Jun Huang