Human Image Segmentation: Experience from Deelvin

Published in

Deelvin Machine Learning

6 min readOct 20, 2020

Hi there! This article is about a new tool developed by the Deelvin team for human image segmentation. This tool allows one to change the image background as shown in the example below.

This is how it works. Artificial intelligence identifies a person in a video or a picture, highlights him/her and changes the background. Using this tool, you can, for example, hide the mess in your room when you are video-calling or streaming. For people who work remotely, there are requirements that there are no distractions in the background during a video call. This tool comes in very handy in such cases. You can also use it to create a meme by placing a character from a movie in a funny setting.

Description of the task

Let’s consider the situation when one needs to highlight one person in the image. In this case, the object is not far from the camera, it can be a selfie, a web camera or a regular shooting image, where the person is not too far away from the camera. If a person has an object in his/her hands, it should also stand out along with the person.

Let’s consider the approaches Semantic Segmentation and Instance Segmentation.

The difference is that Semantic Segmentation will highlight all people in the image, while Instance Segmentation will highlight each person separately.

We chose Semantic Segmentation because for our task, the alternative is redundant, as we usually have one person in the image. If there are several people in the image, then there is no need to select them separately. We just need to select all people and change the background. This is within the Semantic Segmentation task.

Datasets

I will now describe several open datasets that can be used for the task of human image segmentation.

Coco

Coco is one of the most popular segmentation datasets. Besides the class “people”, there are many additional classes, such as “apple”, “horse”, “car”, and others. Thanks to the annotation, you can select only images with people and train the neural network on such images. Below is an example of an image from this dataset.

The disadvantage of this dataset is that the markup in some images is not precise enough for some frames. In the image, you can see that sections of the head, arms and legs are not selected. In addition, for our task it is not necessary to segment individuals who are far from the main object.

Supervisely Person Dataset

An alternative is the Supervisely Person Dataset. It contains over 5000 images of people. The mask is more accurate than in COCO, as seen in the image below.

However, the mask strictly distinguishes people and no objects in their hands are allowed. Imagine this situation: you have a video where you relax with a cocktail in your hands, you change the background to the beach, but you no longer have the cocktail in your hands. This shouldn’t happen.

Besides, 5,000 images are not enough, and there are some images where people are far from the photographer. This is not in line with our task.

Human image dataset developed by Deelvin

Several well-known open access segmentation datasets have been described above. Unfortunately, none of them suits our task, for the reasons described above. Therefore, we decided to create our own dataset.

Compiling this dataset took a lot of time and now it contains over 40,000 samples. It takes into account all possible options: long-haired and short-haired individuals, different options for clothes, objects in hands, as well as people wearing glasses, hats, and so on.

This number of images is enough for training, however many more possibilities could be dealt with through augmentation.

Augmentation

Augmentation in this context is the application of various transformations to images. Examples can be seen below.

The first and second images (named ‘Original’ and ‘Brightness Contrast’ respectively) differ in contrast. To the naked eye, these images appear similar, but for a neural network, they are completely different. It turns out that if you apply Brightness Contrast to each image, the number of samples will double! And given that there are many augmentation methods, the number of samples can be increased many fold and, consequently, the model will learn better.

Augmentation is already good enough for increasing the number of images in the dataset for training the network, but there is also a second advantage: it is close to real life. In real life, the quality of images is often reduced due to artifacts and this has a negative impact on the accuracy of the neural network interpretation (by the way, there is an article on this topic in Deelvin blog).

For example, if a user does not have a good enough camera, then when people move, blur will appear in the video. This will significantly reduce the accuracy of the neural network interpretation, which has never “seen” such images in the train dataset. But if we apply Motion Blur, similar distortions will appear and the accuracy of the model will increase.

Neural network

After preparing the dataset, experiments were carried out with various architectures of neural networks. As a result, we got our own implementation of a convolutional neural network, which is built on the basis of an autoencoder. The encoder is pre-trained on the imagenet dataset, and the decoder contains custom layers to help maximize accuracy.

The model was trained for several days on an RTX 2080 Ti video card and the final accuracy turned out to be more than 0.98 IOU (maximum value 1.0).

Blue curve — train dataset, red curve — valid datatset

As the graph above demonstrates, at the beginning of training, the results on the train dataset are lower than those of the valid dataset (although more often the opposite happens). This happened due to the fact that the network learns the augmentation, which is contained only in the train dataset more slowly. At the end, the lines converge, which means that the obtained numbers are correct.

The more pixels are assigned to the correct class, the higher the IOU metric is. It can be seen from the graph that the network has learned very well and now we need to see how it handles real data.

Results

As these two images demonstrate, the result is quite sharp, however there are small flaws at the edges and it is clear that it is more difficult for the network to interpret the hair. It is interesting to see what the quality will be in other images, including photos of individuals with long hair.

The model did pretty well in these cases. In the picture with the individual with a folder, not only the hand held object stands out correctly, but the network very accurately segments a thin strand of hair. These are computationally tricky cases and the model did a great job!

Conclusion

As a result of this work, we got an effective tool that allows one to accurately segment human images in photos and videos. We post all our developments on the Deelvin website. The described model will appear there in the upcoming release in 2 weeks. We will of course inform you about the release in due course.