NeuroNuggets: Segmentation with Mask R-CNN

Published in

Neuromation

11 min readMar 27, 2018

In the third post of the NeuroNuggets series, we continue our study of basic problems in computer vision. I remind the reader that in this series we discuss the demos available on the recently released NeuroPlatform, concentrating not so much on the demos themselves but rather on the ideas behind each deep learning model. This series is also a great chance to meet the new Neuromation deep learning team that has started working at our new office in St. Petersburg, Russia.

In the first installment, we talked about the age and gender detection model, which is basically image classification. In the second, we presented object detection, a more complex computer vision problem where you also have to find where an object is located. Today, we continue with segmentation, the most detailed problem of this kind, and consider the latest addition to the family of R-CNN models, Mask R-CNN. I am also very happy to present the co-author of this post, Anastasia Gaydashenko:

Segmentation: Problem Setting

Segmentation is a logical next step of the object detection problem that we talked about in our previous post. It still stems from the same classical computer vision conundrum: even with great feature extraction, simple classification is not enough for computer vision, you also have to understand where to extract these features. Given a photo of your friends, a landscape scene, or basically any other image, can you automatically locate and separate all the objects in the picture? In object detection, we looked for bounding boxes, i.e., rectangles that enclose the objects. But what if we require more detail and label the exact silhouettes of the objects, excluding background? This problem is called segmentation. Generally speaking, we want to go from pictures in the first row to pictures in the second row on this picture:

Formally speaking, we want to label each pixel of an image with a certain class (tree, road, sky, etc) as shown in the image. The first question, of course, is why? What’s wrong with regular object detection? Actually, segmentation is applied widely: in medical imaging, сontent-based image retrieval, and so on. It avoids the big problem of regular object detection: overlapping bounding boxes for different objects. If you see three heavily overlapping bounding boxes, are these three different hypotheses for the same object (in which case you should choose one) or three different objects that happen to occupy the same rectangle (in which case you should keep all three)? Regular object detection models can’t really decide.

And if the shape of the object is far from rectangular segmentation provides much better information (this is very important for medical imaging). For instance, Google used semantic image segmentation to create the Portrait Mode in its signature Pixel 2 phone.

These pictures also illustrate another important point: the different flavours of segmentation. What you see above is called semantic segmentation; it’s the simpler version, when we simply want to classify all pixels to categories such as “person”, “airplane”, or “background”. You can see in the picture above that all people are labeled as “person”, and the silhouettes blend together into a big “cuddle pile”, to borrow the expression used in certain circles.

This leads to another, more detailed type of segmentation: instance segmentation. In this case, we would want to separate the people in the previous photo and label them as “person 1”, “person 2”, etc., as shown below:

Here all people on the photo are marked in different colors, which mean different instances. Note also that they are labeled with probabilities that reflect the model’s confidence in a particular class label; the confidence is very high in this case, but generally speaking it is also a desirable property of any AI model to know when it is not sure.

Convolutions and Convolutional Neural Networks

So how can a computer vision system parse these images in such an accurate and humanlike way? Let’s find out. The first thing we need to study is how convolution works and why do we use it. And yes, we return to CNNs in each and every post — because they are really important and we keep finding new things to say about them. But before we get to the new things, let’s briefly go through the old, concentrating on the convolutions themselves this time.

Initially, the idea of convolution has come from biology, or, more specifically, studies of the visual cortex. David Hubel and Torsten Wiesel studied the lower layers of the visual cortex in cats. Cat lovers, please don’t watch this:

When Hubel and Wiesel moved a bright line across a cat’s retina, they noticed an interesting effect: the activity of neurons in the brain changed depending on the orientation of the line, and some of the neurons fired only when the line was moving in a particular direction. In simple terms, that means that different regions of the brain react to different simple shapes. To model this behaviour, researchers had to invent detectors for simple shapes, elementary components of an image. That’s how convolutions appeared.

Formally speaking, convolution is just a scalar product of two matrices taken as vectors: we multiply them componentwise and sum up the results. The first matrix is called the “kernel”; it represents a simple shape, like this one:

The second matrix is just some patch from the picture where we want to find the pattern shown in the kernel. If the convolution result is large, we decide that the target shape is indeed present in this patch. Then we can simply run the convolution through all patches in the image and detect where the pattern occurs.

For example, let us try to find the filter above in a picture of a mouse. Here you can see the part where we would expect the filter to light up:

And if we multiply the corresponding submatrix with the kernel and sum all the values, we indeed get a pretty big number:

On the other hand, a random part of the picture does not produce anything at all, which means that it is totally different from the filter’s pattern:

Filters like this let us detect some simple shapes and patterns. But how do we know which of these forms we want to detect? And how can we recognize a plane or an umbrella from these simple lines and curves?

This is exactly where the training of the neural networks comes in. The shapes defined by kernels (the filters) are actually what the CNN learns from the training set. It is usually simple lines and gradients on the first layers of the neural network, but then, with each layer of the model, these shapes are combined with one another into recognizable kernels; a set of kernels is called a map:

The other operation necessary for CNNs is pooling. It is mostly used to reduce computational costs and suppress noise. The main idea is to cover a matrix with small submatrices and leave only one number in each, thus reducing the dimension; usually, the result is just a maximum or average of the values in the small submatrix. Here is a simple example:

This kind of operations is also sometimes called downsampling as they reduce the amount of information we store.

The R-CNN models and Mask R-CNN

Above, we have seen a simplified explanation of techniques used to create convolutional neural networks. We have not touched upon learning at all, but that part is pretty standard. With these simple tools, we can teach a computer model to recognize all sorts of shapes; moreover, CNNs operate on small patches so they are perfectly parallelizable and well suited for GPUs. We recommend our previous two posts for more details on CNNs, but let us now move on to the more high-level things, that is, segmentation.

In this post, we do not discuss all modern segmentation models (there are quite a few) but go straight to the model used in the segmentation demo on the NeuroPlatform. This model is called Mask R-CNN, and it is based on the general architecture of R-CNN models that we discussed in the previous post about object detection; hence, a brief reminder is again in order.

It all begins with the R-CNN model, where R stands for “region-based”. The pipeline is pretty simple: we take a picture and apply an external algorithm (called selective search) to it, searching for all kind of objects. Selective search is a heuristic method that extracts regions based on connectivity, color gradients, and coherence of pixels. Next, we classify all extracted regions with some neural network:

Due to the high number of proposals, R-CNN worked extremely slow. In Fast R-CNN, the RoI (region of interest) projection layer was added to the neural network: instead of putting each region from proposals through the whole network, Fast R-CNN takes the whole image through the network once, finds neurons corresponding to a particular region in the feature map in the network, and then applies the remaining part of the network to each found set of neurons. Like here:

The next step was to invent the Region Proposal Network that could replace selective search; the Faster R-CNN model is now a complete end-to-end neural network.

You can read about all these steps in more detail in our object detection post. But we’re here to learn how Faster R-CNN can be converted to solve the segmentation problem! Actually, it is extremely simple but nevertheless efficient and functional. The authors just added a parallel branch for predicting an object mask to the original Faster R-CNN model. Here is how the resulting Mask R-CNN model looks like:

The top branch in the picture predicts the class of some region and the bottom branch tries to label each pixel of the region to construct a binary mask (i.e., object vs. background). It only remains to understand where this binary mask comes from.

Fully Convolutional Networks

Let us take a closer look at the segmentation part. It is based on a popular architecture called Fully Convolutional Network (FCN):

The FCN model can be used for both image detection and segmentation. The idea is pretty straightforward, and the network is actually even simpler than usual, but it’s still a deep and interesting idea.

In standard deep CNNs for image classification, the last layer is usually a vector of the same size as the number of classes that shows the “scores” of different classes that could be then normalized to give class probabilities. This is what happens in the “class box” on the picture above.

But that if we stop at some middle layer of the CNN and instead of vectors do some more convolutions? And on the last convolutional layer, we get the same number of features as the number of classes. Then, after proper training, we can get “class scores” in every pixel of the last layer, getting a kind of a “heatmap” for every class! Here is how it works — regular classification on top and a fully convolutional approach on the bottom:

For segmentation via this network, we will use the inverses of convolution and pooling. Meet… deconvolution and unpooling!

In deconvolution, we basically do convolution but the matrix is transposed, and now the output is a window rather than a number. Here are two popular ways to do deconvolution (white squares are zero paddings), animated for your viewing convenience:

To understand unpooling, recall the pooling concept that we discussed above. To do max-pooling, we take the maximum value from some submatrix. Now we want to also remember the coordinates of the cells from which we took it and then use it to “invert” max-pooling. We create the matrix with the same shape as the initial and put maximums to the corresponding cells, reconstructing other cell values with approximations based on known cells. Some information stays lost, of course, but usually upsampling works pretty well:

Through the use of deconvolution and unpooling, we can construct pixel-wise predictions for each class, that is, segmentation masks for the image!

Segmentation demo in action

We have seen how Mask R-CNN does segmentation, but it is even better to see the result for yourself. We follow the usual steps to get the model working.