Learning Day 57/Practical 5: Loss function — CrossEntropyLoss vs BCELoss in Pytorch; Softmax vs sigmoid; Loss calculation

De Jun Huang
Published in
3 min readJun 11, 2021

CrossEntropyLoss vs BCELoss

1. Difference in purpose

  • CrossEntropyLoss is mainly used for multi-class classification, binary classification is doable
  • BCE stands for Binary Cross Entropy and is used for binary classification
  • So why don’t we use CrossEntropyLoss for all cases for simplicity? Answer is at (3)

2. Difference in detailed implementation

  • When CrossEntropyLoss is used for binary classification, it expects 2 output features. Eg. logits=[-2.34, 3.45], Argmax(logits) →class 1
  • When BCEloss is used for binary classification, it expects 1 output feature. Eg 1. logits=[-2.34]<0 →class 0; Eg 2. logits=[3.45] >0→class 1

3. Difference in purpose (in practice)

  • A major difference is that if you want to generate output as a form of probability, you should use BCE.
  • That’s because we can use sigmoid to process the single output while using BCELoss as the loss function. Eg 1. σ(-2.34)=p₁=8.79% (probability of output being class 1 is 8.79%, p₀=1–8.79%=91.21%. so it is likely to be class 0); Eg 2. σ(3.45)=p₁=96.9% (likely to be class 1)
  • We cannot use sigmoid for 2 output features while using CrossEntropyLoss as the loss function. σ([-2.34, 3.45])=[8.79%, 96.9%] does not make sense. It does not mean p₀=8.79% (if so, p₁=1–8.79%=91.21%). It also does not mean p₁=96.9%.
  • For CrossEntropyLoss, softmax is a more suitable method for getting probability output. However, for binary classification when there are only 2 values, the output from softmax is always going to be something like [0.1%, 99.9%] or [99.9%, 0.1%] based on its formula. Eg. softmax([-2,34, 3,45])=[0.3%, 99.7%]. It does not represent meaningful probability.
  • So softmax is only suitable for multi-class classification.

The above leads to — Sigmoid vs softmax

  • Sigmoid at output is more suitable for binary classification. (probably suitable for multi-label classification as well).
  • Softmax at output is more suitable for multi-class classification

For BCE, use BCEWithLogitsLoss(); for CE, use CrossEntropyLoss()

  • Use BCEWithLogitsLoss() instead of BCELoss() since the former already includes a sigmoid layer. So you can directly pass the logits in the loss function. Technical reason:

This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability. (ref)

  • CrossEntropyLoss() has already included a softmax layer inside.

Loss calculation in Pytorch

  • for loss calculation in pytorch (BCEWithLogitsLoss() or CrossEntropyLoss()), The loss output, loss.item() is the average loss per sample in the loaded batch
  • so the total loss per batch=loss*batch_size=loss.item()*x.size(0), and the average loss for the entire dataset will be total_loss/total_size
  • This will be useful when plotting train_loss and val_loss in the same plot. You would want to compare loss for train and val sets at a meaningful scale since data size of train and val sets are usually different. Average loss will be the most suitable.
train_loss = 0
criterion = nn.BCEWithLogitsLoss()
for step, (x, y) in enumerate(train_loader):
logits = model(x)
loss = criterion(logits, y)
train_loss += loss.item()*x.size(0)ave_loss = train_loss/len(train_loader.dataset)


link1: BCEWithLogitLoss() for probability calculation

link2: Loss calculation in pytorch

