Understanding Loss Functions in Computer Vision!

Sowmya Yellapragada
ML Cheat Sheet
Published in
12 min readFeb 18, 2020

Computer vision is the field of computer science that focuses on automated extraction of information from digital images.

In the past decade, innovation in deep learning, the easy availability of vast amounts of data and the accessibility of GPU ($$) units have pushed the field of computer vision into the limelight. It has even started to achieve superhuman performance in some tasks such as face verification and handwritten text recognition. (In fact, automated face verification for flight boarding has become increasingly common these days.)

In recent times we have seen many innovations in network architecture, activation functions, loss functions and more in the field of computer vision.

As discussed in my previous article, loss functions play a pivotal role in a model’s performance. Choosing the right loss function can help your model learn to focus on the right set of features in the data for optimal and faster convergence.

This article particularly aims to summarize some of the important loss functions used in computer vision.

You can find PyTorch implementations of all the loss functions discussed here at this link

Pixel-wise loss function

As the name suggests, this kind of loss function computes the pixel-to-pixel loss of the prediction and the target images. Most of the loss functions discussed in the previous article such as MSE or L2 loss, MAE or L1 loss, cross-entropy loss, etc, can be applied between every pair of pixels of the prediction and target variables

Since these loss functions evaluate the class predictions for each pixel vector individually and then average over all pixels, they thereby assert equal learning to each pixel in the image. These are particularly useful in the semantic segmentation of images, where the models need to learn pixel-level dense predictions.

Variations of these loss functions have also been used in models such as U-Net, where a weighted pixel-wise cross-entropy loss was adopted to tackle the class imbalance* problem when used for image segmentation.

Class imbalance is a common problem in pixel-level classification tasks. This arises when the various classes in the image data are unbalanced. Since the pixel wise losses average the loss across all pixels, training can be dominated by the most prevalent class.

Perceptual loss function

Introduced by Johnson et al (2016), the perceptual loss function is used when comparing two different images that look similar, like the same photo but shifted by one pixel or same images across different resolutions. In these cases, although the images are very similar, pixel-wise loss functions will output a large error value. The perceptual loss function, on the other hand, compare high-level perceptual and semantic differences between images.

Consider an image classification network such as the VGG, having been trained on millions of images of the ImageNet dataset, the first layers of the network tend to extract low-level features (such as lines, edges or color gradients) whereas final convolution layers react to more complex notions (such as specific shapes and patterns). According to Johnson et. al, these low-level features captured in the first few layers can be useful for comparing images that are very similar.

For example, let’s suppose you built a network to construct a super-resolved image from the input image. During training, your target image will be a super-resolved version of the input image. Your goal is to compare the output image of your network and the target image. To do this, we pass these images through a pre-trained VGG network and extract the output values of the first few blocks in the VGG, thereby extracting low-level feature information of the images. These low-level feature tensors can be compared using a simple pixel-wise loss

Loss network pre-trained for image classification

Mathematical representation of perceptual loss

Here, V_j(Y) represents the activations of the jth layer of the VGG network when processing the image Y, and is of shape (C_j, H_j, W_j). We compare the activations of the ground truth image Y and the predicted image Y^ using squared L2 loss, normalized by the shape of the images.

If you want to use multiple feature maps of the VGG network as part of your loss computation, simply add up L_j values for multiple js

Content - Style loss functions — Neural style transfer

Style transfer is the process of rendering the semantic content of an image into different styles. The goal of style transfer models is, given a content image (C) and a style image(S), generate output image with the content of C and the style of S.

Here we discuss one of the simplest implementations of content-style loss functions used to train such style transfer models. Many variants of content-style loss functions have been used in later research. One such loss function is discussed in the next section, called texture loss

Mathematical representation of content/style loss

It has been found that CNNs capture information about the content in the higher levels, and the lower levels are more focused on individual pixel values.

And so, we take one or more top layers of the CNN and compute the activation maps for the original content image (C) and the predicted output (P) —

Similarly, the style loss can be calculated by computed as the L2 distance of the lower level feature maps of the predicted image (P) and the style image (S). The resulting net loss function is then defined as —

alpha and beta here are hyperparameters that can be tuned.

Note: The optimization to reduce only the style and content losses leads to highly pixelated and noisy outputs. To solve this issue, total variation loss was introduced, for ensuring spatial continuity and smoothness in the generated image.

Texture loss function

First introduced by Gatys et al ( 2016) for the style loss component for the purpose of image style transfer.

Texture loss is a loss function introduced as an improvement over the perceptual loss, adapted particularly for capturing image style. Gatys et al found that we can extract a style representation of an image by looking at the spatial correlation of the values within an activation or feature map (from the VGG network). This is done by calculating the Gram Matrix —

Gram matrix (for layer l of VGG network) is nothing but the inner product of vectorized feature maps F_i and F_j (in the layer l). Gram matrix captures the tendency of features to co-occur in different parts of the image

Mathematical representation of Texture Loss

Here, G^l and A^l are the style representations of layer l of the model output and that of the target image respectively. N_l is the number of distinct features maps in layer l and M_l is the volume of the feature maps in layer l (i.e channels*width*height). Finally, E_l is the texture loss for the layer l

The net texture loss is the weighted sum of all texture losses, represented as follows —

Here a is the original image and x is the predicted image.

Note: Although the math here appears a bit complicated, understand that texture loss is simply perceptual loss applied on gram matrices of the feature maps

Topology-aware loss function

Another interesting loss function from recent literature, introduced by Mosinska et al (2017) is the topology-aware loss function. This can be thought of as an extension of perceptual loss, applied to segmentation mask predictions.

Mosinska et al argued that the pixel-wise losses used in image segmentation problems, such as the cross-entropy loss, rely only on local measures and don’t account for the characteristics of the topology such as the number of connected components or holes. As a result of this, traditional segmentation models such as the U-Net tend to misclassify thin structures. This is because misclassification of a thin layer of pixels has low cost in terms of pixel-wise loss. As an improvement to the pixel-wise loss, they suggest introducing a penalty term that is based on the feature maps generated by the VGG-19 network (similar to perceptual loss), to account for the topology information.

From the paper : (c)Segmentation obtained after detecting neuronal membranes using pixel-wise loss. (d)Segmentation obtained after detecting membranes using topology loss

This approach is also particularly useful in road segmentation from satellite imagery when there are occlusions, for example, due to trees.

Mathematical representation of topology-aware loss

Here, on the RHS, l(m,n) represents the mth feature map in the nth layer of the VGG19 network. Mu is the scalar weighing the relative importance of the pixel loss and the topology loss.

Contrastive Losses / Triplet Losses

Triplet loss was introduced by Florian Schroff et al. in FaceNet (2015) with the purpose of building a face recognition system with a limited and small dataset (for example face recognition systems in offices). Traditional CNN architectures for face recognition have consistently failed in such a scenario.

Florian Schroff et al focused on the fact that in a small sample space for face recognition, we not only have to correctly identify face matches, but also accurately differentiate two different faces. To tackle this, the FaceNet paper introduced a concept called Siamese Network.

In Siamese networks, we pass an image A through the network and transform into a smaller representation called the embedding. Now, without updating any weights or biases of the network, we repeat this process for a different image B and extract its embedding. If the image B is of the same person as in image A, then their corresponding embeddings must be very similar. If they are of different people’s then their corresponding embeddings must be very different.

To reiterate, the Siamese network aims to ensure that an image of a specific person (anchor) is closer to all other images of the same person (positive) than it is to any image of any other person (negative).

To train such a network, they introduced the triplet loss function. Consider a triplet — [anchor, positive, negative] (see image). Triplet loss is defined w.r.t these three images as follows —

  1. Define distance metric d = L2 norm
  2. Compute the distance between embeddings of anchor image and the positive image = d(a, p)
  3. Compute the distance between embeddings of anchor image and the negative image = d(a, n)
  4. Triplet loss = d(a, p) — d(a, n) + offset

Mathematical representation of triplet loss

Here, x^a -> anchor, x^p -> positive and x^n -> negative

Note: To achieve fast convergence, it is crucial to sample the right triplet choices for loss computation. The FaceNet paper discusses two approaches to do this — Offline triplet generation and Online triplet generation. We’ll reserve a detailed discussion on this topic for some other time. But meanwhile, you can refer to the FaceNet paper.

GAN Loss

Generative Adversarial Networks, first proposed by Ian Goodfellow et. al (2014), is by far the most popular solution for image generative tasks. GANs are inspired by game theory and use an adversarial scheme so that they can be trained in an unsupervised manner.

GANs can be treated as a two-player game, where we pit the generator (say produces a super-resolved image) against another network — the discriminator. The discriminator’s task is to evaluate whether an image comes from the original data set (real image) or if it was generated by the other network (fake image). The discriminator model is updated like any other deep learning neural network, although the generator uses the discriminator as the loss function, meaning that the loss function for the generator is implicit and learned during training. Typically for machine learning models, convergence is observed as the minimization of the chosen loss function on the training dataset. In a GAN, convergence signals the end of the two-player game. Instead, the equilibrium between generator and discriminator loss is sought.

For GAN, the generator and discriminator are two players and take turns updating their model weights. Here we’ll summarize some of the loss functions used for GAN networks

1. Min-Max Loss function

However, in practice, it was found that this loss function for the generator saturates. That is if it cannot learn as quickly as the discriminator, the discriminator wins, the game ends, and the model cannot be trained effectively.

2. Non-Saturating GAN Loss

Non-Saturating GAN Loss is a modification to the generator loss to overcome the saturation problem, with a subtle change. Instead of minimizing the log of the inverted discriminator probabilities for generated images, the generator maximizes the log of the discriminator probabilities for generated images.

3. Least squares GAN loss

Introduced by Xudong Mao, et al (2016), this loss function is especially useful when the generated images are very different from real images, which can lead to very small or vanishing gradients, and in turn, little or no update to the model.

4. Wasserstein GAN Loss

Introduced by Martin Arjovsky, et al. (2017). They observed that the traditional GAN is motivated to minimize the distance between the actual and predicted probability distributions for real and generated images, the so-called Kullback-Leibler (KL) divergence. Instead, they propose modeling the problem on the Earth-Mover’s distance, which calculates the distance between two probability distributions in terms of the cost of turning one distribution into another.

The GAN using Wasserstein's loss involves changing the notion of the discriminator into a critic that is updated more often (e.g. five times more often) than the generator model. The critic scores images with a real value instead of predicting a probability. It also requires that model weights be kept small. The score is calculated such that the distance between the scores for real and fake images are maximally separated. The benefit of Wasserstein’s loss is that it provides a useful gradient almost everywhere, allowing for the continued training of the models.

5. Cycle Consistency Loss

The image-to-image translation is an image synthesis task that requires the generation of a new image that is a controlled modification of a given image. For example, translating horses to zebras (or the reverse), translating paintings to photographs (or the reverse), etc.

Introduced by Jun-Yan Zhu et al (2018) in the context of image to image translation. Training a model for image-to-image translation typically requires a large dataset of paired examples, which are difficult to find. CycleGAN is a technique that involves automatic training without paired examples. The models are trained in an unsupervised manner using a collection of images from the source and target domain that do not need to be related in any way.

The CycleGAN is an extension of the GAN architecture that involves the simultaneous training of two generator models and two discriminator models. One generator takes images from the first domain as input and outputs images for the second domain, and the other generator takes images from the second domain as input and generates images for the first domain. Discriminator models are then used to determine how plausible the generated images are and update the generator models accordingly.

Cycle consistency is the idea that an image output by the first generator could be used as input to the second generator and the output of the second generator should match the original image. The reverse is also true.

The CycleGAN encourages cycle consistency by adding an additional loss to measure the difference between the generated output of the second generator and the original image, and the reverse. This loss is used as a regularization term for the generator models, guiding the image generation process in the new domain toward image translation.

That concludes the glossary on some of the important loss functions in computer vision. Thank you for reading and I hope you found that helpful.

Follow me on LinkedIn.

You may also reach out to me via sowmyayellapragada@gmail.com

--

--