Learning Day 57/Practical 5: Loss function — CrossEntropyLoss vs BCELoss in Pytorch; Softmax vs sigmoid; Loss calculation
Published in
3 min readJun 11, 2021
CrossEntropyLoss vs BCELoss
1. Difference in purpose
- CrossEntropyLoss is mainly used for multi-class classification, binary classification is doable
- BCE stands for Binary Cross Entropy and is used for binary classification
- So why don’t we use CrossEntropyLoss for all cases for simplicity? Answer is at (3)
2. Difference in detailed implementation
- When CrossEntropyLoss is used for binary classification, it expects 2 output features. Eg. logits=[-2.34, 3.45], Argmax(logits) →class 1
- When BCEloss is used for binary classification, it expects 1 output feature. Eg 1. logits=[-2.34]<0 →class 0; Eg 2. logits=[3.45] >0→class 1
3. Difference in purpose (in practice)
- A major difference is that if you want to generate output as a form of probability, you should use BCE.
- That’s because we can use sigmoid to process the single output while using BCELoss as the loss function. Eg 1. σ(-2.34)=p₁=8.79% (probability of output being class 1 is 8.79%, p₀=1–8.79%=91.21%. so it is likely to be class 0); Eg 2. σ(3.45)=p₁=96.9% (likely to be class 1)
- We cannot use sigmoid for 2 output features while using CrossEntropyLoss as the loss function. σ([-2.34, 3.45])=[8.79%, 96.9%] does not make sense. It does not mean p₀=8.79% (if so, p₁=1–8.79%=91.21%). It also does not mean p₁=96.9%.
- For CrossEntropyLoss, softmax is a more suitable method for getting probability output. However, for binary classification when there are only 2 values, the output from softmax is always going to be something like [0.1%, 99.9%] or [99.9%, 0.1%] based on its formula. Eg. softmax([-2,34, 3,45])=[0.3%, 99.7%]. It does not represent meaningful probability.
- So softmax is only suitable for multi-class classification.
The above leads to — Sigmoid vs softmax
- Sigmoid at output is more suitable for binary classification. (probably suitable for multi-label classification as well).
- Softmax at output is more suitable for multi-class classification
For BCE, use BCEWithLogitsLoss(); for CE, use CrossEntropyLoss()
- Use BCEWithLogitsLoss() instead of BCELoss() since the former already includes a sigmoid layer. So you can directly pass the logits in the loss function. Technical reason:
This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability. (ref)
- CrossEntropyLoss() has already included a softmax layer inside.
Loss calculation in Pytorch
- for loss calculation in pytorch (BCEWithLogitsLoss() or CrossEntropyLoss()), The loss output, loss.item() is the average loss per sample in the loaded batch
- so the total loss per batch=loss*batch_size=loss.item()*x.size(0), and the average loss for the entire dataset will be total_loss/total_size
- This will be useful when plotting train_loss and val_loss in the same plot. You would want to compare loss for train and val sets at a meaningful scale since data size of train and val sets are usually different. Average loss will be the most suitable.
train_loss = 0
criterion = nn.BCEWithLogitsLoss()for step, (x, y) in enumerate(train_loader):
model.train()
logits = model(x)
loss = criterion(logits, y) optimizer.zero_grad()
loss.backward()
optimizer.step() train_loss += loss.item()*x.size(0)ave_loss = train_loss/len(train_loader.dataset)