Semantic Segmentation using DeepLabv3

Published in

Technovators

6 min readNov 24, 2020

Semantic Segmentation is a challenging problem in computer vision, where the aim is to label each pixel in an image such that pixels with the same label share certain characteristics. Many deep learning approaches have evolved to tackle this problem starting with FCN till DeepLab family models.

Photo by Trust "Tru" Katsande on Unsplash

Here, we’ll explore the state-of-the-art deep learning model “DeepLabv3” and fine-tune it on the face segmentation dataset.

Topics Covered

Face Segmentation Dataset exploration and preparation
Fine-tuning DeepLabv3
Inference on Face Segmentation model

In my previous blog post, I had done a comprehensive survey on image segmentation, covering the basics and types of segmentation, datasets and metrics used for segmentation, traditional image-processing and deep learning based approaches for image segmentation. If you’re completely new to segmentation or interested to know more about different architectures, please refer to that blog to get better insights that will be helpful for implementing segmentation solutions.

Face Segmentation Dataset exploration and preparation

Face/Head Segmentation is the task of segmenting different areas of the face/head like ears, hair, nose, eyes which has applications in Facial Expression Recognition, Facial Alignment, AR applications like Face Animoji, etc.

For this blog post, we’ll use Mut1ny’s Face/Head Segmentation dataset available for free for non-commercial purposes, you can get this dataset here by submitting a form.

Dataset Exploration

This dataset contains over 16.5k (16557) fully pixel-level labeled segmentation images. Facial images are included from different ethnicity, ages, and genders making it a well-balanced dataset. Also, there is a wide variety of facial poses and different camera angles to provide good coverage from -90 to 90 degrees facing.

Example images and masks from mut1ny face/head segmentation dataset, source

For each real image there exist a PNG RGB label image pair. It encodes the 11 different labeled areas of the face using the following RGB labels:

Let’s start exploring the dataset to understand the distribution of images,

Data Exploration

Steps,

Dataset is organized into multiple folders based on its types like real image, multiple faces, gender, and name of the person/model. Example folder name: femalealison1, meaning it contains “Alison” images part 1 belonging to the “female” class.
We need to group these folders/images into broad 4 categories: male, female, real images, multiple faces. So, we iterate each folder to know its category and copy them to the relevant folder.
We need to understand the distribution of this dataset, it looks like the dataset is evenly distributed to avoid gender, ethnicity, and age bias. And has 2500 real images which are really helpful for segmenting faces in natural scenes.
Organize the dataset into a single folder, this is helpful for data preparation which we will encounter in the next section

Data preparation

Now that we have analyzed the distribution and organized the dataset, let’s start to clean up the dataset and make it ready for model training.

Removing duplicate and near-duplicate images

After manually exploring the dataset, I’ve found that there are more near-duplicate images which may be due to different viewing angles of the same image or continuous frames in a video.

Why do we need to care about near-duplicates?

Introduces bias to your dataset
Reduces the ability of your model to generalize to images outside your training distribution

How can we remove them?

We need to first detect near-duplicates. To do this, we first need to hash images to have a numerical representation of each image
Remove near-duplicates: Once we compute the hashes, we can consider images with the same hash value as duplicates and remove them

Detect and Remove Near Duplicates

You can find a detailed tutorial on detecting and removing duplicate images from the dataset in pyimagesearch.

Subsampling

After cleaning up the dataset we have around 14k samples. Let’s create subsamples for faster experimentation.

sub-sampling

For this subsampling, we follow the below strategies,

Use all samples from “real images”, since almost all images are unique and real-world/natural images
Randomly samples 50 images from “multi-person images” since it has the same images from different viewing angles and lighting conditions
Randomly sample 10 images from other categories (i.e) images of the same person like female: Alison, male: Gabriel

After this subsampling, we now have 2103 samples which can be used to model training.

Fine-tuning DeepLabv3

DeepLab is a real-time state-of-the-art semantic segmentation model designed and open-sourced by Google.

DeepLabv3 made few advancements over DeepLabv2 and DeepLab(DeepLabv1). It comprises of some key components, such as,

ResNet Architecture as its backbone network with some modification addressed in coming points
Dilated Convolutions to have high-resolution feature maps (available in DeepLab family from DeepLabv1)
Atrous Spatial Pyramid Pooling (ASPP) to represent objects at multiple scales (from DeepLabv2)
Cascaded and Parallel modules of Atrous convolutions for multi-scale representation
Global average pooling in ASPP on last feature map to have a global context

For this tutorial, we’ll use the PyTorch framework for model building and torchvision to load the DeepLabv3 model.

Data Loader

We use torch Dataloader for loading the dataset, and then apply some transformations using transforms function from torchvision.

DataLoader

Steps:

Create a train and test split
Apply Transformations: Resize the image to 256x256, NumPy to torch Tensor conversion, Normalize the image
Create Dataloader with specific batch size

Model Building and Fine-tuning

As mentioned earlier we’ll load the deeplabv3_resnet_101 backbone which was pre-trained on the coco dataset.

Model Building

The next step is to create a training pipeline which includes the following,

Define loss function, optimizer, and model evaluation metrics
Fine-tune for a specified number of epochs
Save the best model weights based on test loss

Fine Tuning

Recap on what we have done so far,

We have created a data loader which loads and transforms our dataset
Then loaded a pre-trained DeepLabv3 model from torchvision
Created training pipeline by specifying loss function, optimizer, and few hyperparameters
Finally, we have fine-tuned and saved the best model using our face segmentation dataset

Inference on Face Segmentation model

We now have a face segmentation model fine-tuned on the Face Segmentation dataset. Let’s quickly build an inference pipeline that could load new images, transform it, and perform inference to get segmented output.

Inference

We follow the below steps to infer an image on the fine-tuned model,

Load model and image to the device
Resize Image(optional since DeepLab can handle arbitrary image size), convert ndarray to Float tensor based on device, and Normalize the image
Perform inference and visualize

Note: If you’re saving the model using the torch.save you need to have a similar directory structure to load the model class.
Instead use torch.save(model.state_dict(), PATH) option and load the model using the model class and load_state_dict

Here is an example prediction on a multi-person image,

Inference on DeepLabv3 fac segmentation model

Please feel free to pull the code from my Github for your experimentation.

References