# Detailed Explanation of RoI Pooling vs RoI Align vs RoI Warping

## These feature extraction methods help you save information in your images!

In the last article, we went through the steps of building an object detector using Mask R-CNN which is able to output a mask over the detected object pixel by pixel. One of the steps in Mask R-CNN different from one in Faster R-CNN is the feature extraction method after the image is passed through a convolutional net. This step is called RoI Pooling in Faster R-CNN and RoI Align in Mask R-CNN.

What is the difference between RoI Pooling and RoI Align?

Clearly, they are both different methods in which the information loss going from one layer to the next is generally improved. Let’s go through these two feature extraction methods before we arrive at the RoI Warping because it is in the middle of the two former methods.

# RoI Pooling

Recall that after the feature extraction from the convolutional neural network layer, we will have a bunch of feature maps and we are ready to obtain region proposals through a region proposal method. Let’s just use VGG16 as the network and with an initial scale factor of 32.

Now, when the region proposals are generated, they are simply coordinates from the original image. Once these coordinates are converted to coordinates on the feature map, they undergo a quantization process where information is lost to a certain extent. Picture this:

Already, mapping our region proposal from the image to the feature map involves some slight loss of information and some unwanted gain of information when the coordinates are taken as integers and not floating points. How so? Take a look below.

The green part is the effect of quantization of the coordinates and as such we gained some new (unnecessary) data while the red part indicates the loss of information.

Note that the size of the region proposal is 4x6 at this point in the feature map. The following step performs quantization once more to get it to an arbitrary size of 3x3(honestly, this can be a 7x7 pooling layer) before the features are fed into the final layer. To do this, we simply take the 4 from 4x6 and divide by 3 from the pooling layer and the same operation on the 6 from 4x6. This will be how we separate our RoI layer into bins before performing max-pooling.

This is how our RoI layer will look like after quantization. From here, it is much simpler as we only need to take the maximum value from each bin — max-pooling!

As seen, the last row of data is lost and this is the effect of quantization. Basically, max-pooling takes the maximum value within the bins (or colored pairs here) and output as a 3x3 matrix. Because our VGG16 takes in an image of fixed size 224x224, we will have the output as 3x3x224 matrices. Generally, this is how RoI Pooling works — the key takeaway is that there is some loss of information after the two stages of quantization.

Check out this article for reference.

# RoI Align

Since RoI pooling does 2 stages of quantization and this causes a “huge” loss of information when the input is fed into the final layer, RoI Align is designed to solve this problem by NOT performing quantization at both the stages — mapping and pooling stage. To understand this, you need to have a little understanding of **bilinear interpolation.** Using the example above:

Notice that the coordinates or the size is not quantized anymore. We are using the floating points directly this time to prevent loss of information. This is the mapping stage. Now onto the pooling stage.

Since we are still using a 3x3 pooling layer(honestly, you can still change this to a 7x7 layer), we will divide the RoI feature map into grids of 3x3, like this:

Here’s where the RoI Align method differs from the RoI Pooling method. Because we need data to perform max-pooling from, we need to sample data from within the grid of boxes. RoI Align samples 4 data points from within a box and before we do that, we need 4 sampling points so we can perform bilinear interpolation on. Here’s how to get the 4 sampling points:

You can calculate the coordinates of each of the 4 points using the formula below:

X = X_coord_box + (width / pooling_layer_size) * sampling_point_id Y = Y_coord_box + (height / pooling_layer_size) * sampling_point_id

Now, we are ready to get the data points from these sampling points. To perform bilinear interpolation to get the data points, there is a handy tool for us to utilize:

Input your coordinates and values of each box in the grid and you’re good to go! Once we have all four data points, it’s all the same step — max-pooling! Getting the maximum value from each box out of the 9 boxes give us our final 3x3 matrix.

As you’ll see later, even a slight gain of information will give us better precision!

As you can see, with RoI Align, there is much less information loss as we try to fully utilize the feature maps to pool data from — of course, at the expense of more calculations! This is where RoI Warping comes in.

# RoI Warping

Thanks for reading up until this point. At this point, if you get what’s happening above, this part is a piece of cake! RoI Warping basically performs quantization only at the mapping stage while keeping the bilinear interpolation at the pooling stage.

Using RoI Warping, according to this article here, does not give a tremendous improvement to the Average Precision of the model, but RoI Align does! Think about all the information that is recovered from RoI Pooling and this alone makes the model more precise. Here is what the original author on the topic wrote about the Average Precision of the models.

Now that we understand how RoI pooling, RoI align and RoI warping work, I believe understanding any other R-CNN models should be straightforward. I hope I have given a clear explanation on how these actually work. Full credit definitely goes to the original author of the article I referred to.

Next up, I definitely want to do some hands-on experiment or projects with image processing. See you in the next article!

# References

Original paper of Mask R-CNN