An overview of Deep Learning for Major Computer Vision Tasks

Michael Avendi
How to AI
Published in
8 min readJul 16, 2020

Master Computer Vision With PyTorch

Computer vision (CV) has been disrupted by deep learning and convolutional neural networks(CNN) in recent years. You can now implement many CV algorithms pretty quickly using deep learning libraries such as PyTorch, Tensorflow, and Keras. In this post, I will provide an overview of major CV tasks and applications. You can find more details on how to solve such problems using deep learning and PyTorch in my book PyTorch Computer Vision Cookbook. All implementation scripts of the book are also available in this link.

The outline of this post is as follows:

  1. Image classification
  2. Object Detection
  3. Image Segmentation
  4. Style Transfer
  5. GANs
  6. Video Classification

Image classification

Image classification (also called image recognition) is probably the most widely used task in computer vision. In this task, we assume that images contain a main object and we want to automatically classify the object into pre-defined categories. In the context of image recognition, you will find binary classification and multi-class classification.

Binary Classification

In binary image classification, we want to classify images into two categories. For instance, we may want to know if a medical image is normal or malignant. Thus, you can assign label=0 to a normal image and label=1 to a malignant image. That is why it is called the binary classification. In the following, you can see an example of binary image classification for image patches of Histopathologic Cancer.

Sample image patches of Histopathologic Cancer and their binary labels

Typically, there are thousands of these patches per patient and clinicians have to go through them one by one. Just imagine how an automatic tool to quickly label thousands of such images can be beneficial to clinicians.

Deep learning models based on CNN are currently state-of-the-art to solve such problems. A block diagram of a CNN model is shown in the following figure.

You can learn to train and develop a binary image classification using PyTorch and deep learning in Chapter 2 of my book.

Multi-class Classification

On the other hand, the goal of multi-class image classification is to automatically assign a label to an image from a fixed (more than two) set of categories. Again, here the assumption is that the image contains a dominant object. For instance, the following figure shows a few samples from a dataset with 10 categories.

We may assign label 5 to dogs, label 2 to cars, label 0 to airplanes, and label 9 to trucks. As you may note, there may be more than one object in the images, however, the labels correspond to the dominant objects.

This task has also many applications in the industry, from autonomous vehicles to medical imaging, to automatically identify objects in images. In Chapter 3 of my book, you can learn to develop a pre-trained resnet18 model for multi-class image classification using PyTorch.

Object Detection

Object detection is the process of finding locations of specific objects in images. Similar to image classification, depending on the number of objects in images, we may deal with single-object or multi-object detection problems.

Single-object Detection

In single-object detection, we are interested to find the location of an object in a given image. In other words, we know the class of the object and only want to locate it in the image. The location of the object can be defined by a bounding box using four numbers, specifying the coordinates of the top left and bottom right corners.

As an example, the following image depicts the location of the fovea (a small pit) in an eye image using a green bounding box:

This task can be formulated as a regression problem to predict two/four numbers corresponding to the bounding box using a CNN model, as shown in the following figure.

You can learn to develop a single-object detection model in PyTorch from Chapter 4 of my book.

Multi-object Detection

On the other hand, multi-object detection is the process of locating and classifying existing objects in an image. In other words, it is a simultaneous classification and localization task. Identified objects are shown with bounding boxes in the image, as shown in the following figure.

As you can see, each object is identified and labeled with a category label and located by a bounding box.

Two methods for general object detection include region proposal-based and regression/classification-based. A popular regression/classification-based approach named YOLOv3 is shown in the following figure.

In Chapter 5 of my book, you can learn to develop the YOLOv3 algorithm for object detection using PyTorch.

Image Segmentation

Object segmentation is the process of finding the boundaries of target objects in images. There are many applications for segmenting objects in images. As an example, by outlining anatomical objects in medical images, clinical experts can learn useful information about patients.

Depending on the number of objects in images, we can deal with single-object or multi-object segmentation tasks.

Single-object Segmentation

In single-object segmentation, we are interested in automatically outlining the boundary of one target object in an image. The boundary of the object is usually defined by a binary mask. From the binary mask, we can overlay a contour on the image to outline the object boundary. As an example, the following figure depicts an ultrasound image of a fetus, a binary mask corresponding to the fetal head, and the segmentation of the fetal head overlaid on the ultrasound image:

The goal of automatic single-object segmentation will be to predict a binary mask given an image. Again, CNN models can be designed in the form of encoder-decoder to solve this task. A block diagram of an encoder-decoder is shown in the following figure.

In Chapter 6 of my book, you will learn to implement an encoder-decoder architecture for single-object segmentation using PyTorch.

Multi-object Segmentation

On the other hand, in multi-object segmentation, we are interested in automatically outlining the boundaries of multiple target objects in an image. The boundaries of objects in an image are usually defined by a segmentation mask that’s the same size as the image. In the segmentation mask, all the pixels that belong to a target object are labeled the same based on pre-defined labeling. For instance, in the following screenshot, you can see a sample image with two types of targets: babies and chairs.

The corresponding segmentation mask is shown in the middle of the figure. As we can see, the pixels belonging to the babies and chairs are labeled differently and colored in yellow and green, respectively.

The goal of multiple-object segmentation will be to predict a segmentation mask given an image such that each pixel in the image is labeled based on its object class. In Chapter 7 of my book, you will learn to develop a multi-object segmentation algorithm using PyTorch.

Style Transfer

You want to do something fun with images. Try neural style transfer. In neural style transfer, we take a regular image called the content image, and an artistic image called the style image. Then, we generate an image to have the content of the content image and the artistic style of the style image. By using the masterpieces of great artists as the style image, you can generate interesting images using this technique.

As an example, check out the following figure:

The image on the left is converted to the image on the right using a style image (middle).

In the style transfer algorithm, we keep the model parameters fixed and instead update the input to the model during training. This twist is the intuition behind the neural style transfer algorithm. A block diagram of the style transfer algorithm is shown in the following figure.

In Chapter 8 of my book, you can learn to implement the neural style transfer algorithm using PyTorch.

GANs

Do you want more fun with images? Try GANs. A GAN is a framework that’s used to generate new data by learning the distribution of data. The following figure shows a block diagram of a GAN for image generation.

The generator generates fake data when given noise as input, and the discriminator classifies real images from fake images. During training, the generator and the discriminator compete with each other in a game. The generator tries to generate better-looking images to fool the discriminator, and the discriminator tries to get better at identifying real images from fake images.

In Chapter 9 of my book, you will learn to develop a GAN to generate new images using PyTorch.

Video Classification

Images are still and static. There is no motion in static images. The real joy comes from the motion. And that is how videos come into play. A video is, in fact, a collection of sequential frames or images that are played one after another. Check out a short clip of Matrix (the movie) in the following.

Similar to image classification, you can think of video classification. Due to a large number of frames in videos, the task is daunting but doable with the help of deep learning and PyTorch. A block diagram of an RNN-based video classification is shown in the following figure.

In Chapter 10 of my book, you can learn to build a video classification algorithm using PyTorch.

Summary

In this post, I provided an overview of major computer vision tasks and how to solve them. You can find more detail about each task and its implementation here.

--

--