Semantic Segmentation of Human Face and some observations regarding the behaviour of neural networks.

Dibya Haldar
Bobble Engineering
Published in
5 min readAug 10, 2020

Neural networks are driven by data and one of the most common approaches to control neural networks is to control the data being fed to them.

I would like to admit early on that my experience with neural networks is not extensive by any means, and I am here to share my observations about whatever few experiments I have carried on them.

The Network Architectures

There are 3 popular neural network architectures used for semantic segmentation: a) FCN, b)U-Net, c) Segnet.

I shall briefly explain the underlying structure common to the above mentioned architectures.

A Semantic Segmentation architecture takes an image as its input and returns an image at its output. It usually can be divided into 2 mirroring halves. The first half can be denoted as the Encoder half and the second half as the Decoder half. The input and output usually have the same dimensions. The Encoder half has a number of pooling layers (down sampling layers) which allows the information in the input image to be compressed towards the end of the encoder half. Similarly, the decoder half has a number of up sampling layers which expands the information to ultimately match the dimensions of the segmented output (which output we will be referring to as the mask from here on.)

The SegNet architecture is peculiar in its handling of the upsampling operation. The index in the next layer to which the information in the current layer would be written during an upsampling operation is not learned in SegNet, rather the pooling indices in the Encoder half are saved and reused in the Decoder half.

Figure 1: This figure depicts pooling and upsampling in SegNet
Figure 1: This figure depicts the use of pooled indices reused during upsampling in SegNet.

Using pooling indices during upsampling in SegNet helps to preserve the Spatial Locality of Segmentation boundaries, which may be lost if the upsampling indices are learned during the training process.

The IoU Score

The term IoU stands for Intersection over Union. It is used to measure the accuracies of semantic segmentation models. The IoU score is measured by taking the ratio of Number of Pixels (approximation of area) in the Intersection of the Output Mask and the Ground Truth Mask and Number of Pixels (approximation of area) in the Union of the Output Mask and the Ground Truth Mask . This formula works because intersection in the numerator ensures that the Output Mask covers all the pixels present in the Ground Truth Mask and the Union in the denominator ensures that the Output Mask has minimum number of pixels outside the Ground Truth Mask.

The Data Collection Process

For Data Collection, we created a cropper app to manually mark the boundaries of head, hair and face of people belonging to the demographics of our user base. After getting the images cropped, we got them validated for correctness.

For head segmentation, we collected pictures of people in different environments, in different lighting conditions. We also collected pictures of people wearing turbans, caps, shawls, and flowers and included these accessories as part of there head crops. We also added many images without any human head in them for negative learning.

For hair segmentation, we collected pictures of people with various hair colours and hair textures. We also applied augmentations for representing more lighting conditions and hair colours. We also got specific images cropped of people having discontinuous hair. For some people, part of the hair appears from behind their neck, so the entire hair is not present in a continuous stretch. More on this point later.

For face segmentation, we collected pictures in a similar fashion to head segmentation, though we did not add negative images this time. Also added augmentations to represent more lighting conditions.

The Training Process

In Bobble, we had to segment the human head, the scalp hair and the face. But we did not segment all the parts together. We separately segmented hair, face and the entire head along with accessories like turban, scarf, flowers, etc.

However, when I was training the Neural Network based model for these three types (head, hair, face), I achieved different accuracies for each of the tasks.

After training the models, I was surprised when I looked at the IoU accuracies of the trained models. While the head segmentation and face segmentation models had accuracies of 90% and 92% respectively, the maximum accuracy I could achieve during hair segmentation was only 80%.

On examining the images in which the model did poorly in hair segmentation, we found that the hair part which appeared from behind people’s necks was not being identified as hair by the model. Even collecting more data of this type did not make the model learn. It was being adamant not to learn. Consequently, I got curious. I looked for papers in which hair segmentation was implemented. Those papers also reported IoU scores around 81%. Of course these accuracies are data dependent, and neural networks trained for one demographic usually does not generalise to other demographics. Another thing that I found surprising was that the validation loss fell to 0.0016 for hair segmentation even after a lot of effort, but it easily fell to 0.0009 for face segmentation, denoting that the model is finding it hard to learn the boundaries of hair as opposed to learning the boundaries of face.

It is then that the entire problem unravelled before me. Hair segmentation was a more difficult problem to solve than face segmentation. Reason being: probability. The probability space for the different types of scalp hair is much larger than the different types of human faces. Different human faces have much more in common with each other than they have uncommon. All human faces have similarly shaped noses, eyes, mouths, and ears. The structure and shape of all human faces are quite similar. But can we say the same things about human hair. Human scalp hair arrives in different shapes, some straight, some wavy, other curly. Also the number of hair styles are also ever increasing. Though all the hair have the common feature of starting out from the boundaries of human face, some hair also appears from behind the neck. Some people have super long hair, others may have hair in a very small part of their head. Some bring their hair towards the front, for others it stays behind their head.

Figure 1: Segmentation models
Figure 2: Segmentation models inference on my own face. Image captured using laptop front camera.

Conclusion

The above commentary only showcases that the probability space for different kinds of human hair is huge and it is difficult for a neural network to generalise all these different types and styles of hair under a single category. It may require much more data to generalise better, or it may require larger neural networks than those used currently for semantic segmentation. Whatever the case may be, it is a problem that Data Scientists and Engineers may solve soon.

I only wanted to give my two cents about how two similar problems may have different levels of complexity and the same tools may not be adequate for solving both.

Dibya Kanti Haldar,

Bobble AI

--

--