The Best of ICCV 2017

Computer vision experts from academia and industry gathered in Venice in late October for the 16th International Conference on Computer Vision(ICCV 2017). I visited this year’s ICCV on behalf of Onfido’s research team. In the following I will share some of my experience of the conference as well as a review of the best papers of this conference; namely, Mask R-CNN and Focal Loss.

ICCV 2017 was held in Venice, the city of canals and Gondolas

Starting in 1987, the International Conference on Computer vision (ICCV) has established itself as one of the most important events in the field. The 16th ICCV 2017 was held in Venice, Italy between 22nd and 29th of October.

2143 valid submissions were made this year, which marks an increase of 26% over the previous edition of the conference in 2015. Out of all submitted papers, a total of 621 papers were accepted (overall acceptance rate 28.9%) with 45 oral presentations (2.09%), 56 spotlight papers (2.61%), and 520 posters (24.26%).

As customary, the conference was also accompanied by a number of co-located events including a record number of 44 workshops (63% more than the previous edition), 9 tutorials, a doctoral consortium, industrial exhibitions, and demos.

On the first day, I attended a tutorial on Generative Adversarial Networks, organised by Ian Goodfellow. There was many insightful talks by Ian Goodfellow, Sanjeev Arora and Alexi Efros, among others. I particularly enjoyed David Pfau’s talk on the Connections between adversarial training and Reinforcement Learning.

In the second of the conference, I attended a tutorial named Beyond Supervised Learning for some very interesting talks by Jitendra Malik, Vladlen Koltun, Michal Irani, Alexei Efros and others.

The first slide of the talk by Alexei Efros in the ICCV workshop: Beyond Supervised Learning. If only there was a best slide award!

Best Paper: Mask R-CNN

The best paper award went to a paper by Kaiming He et al. Their paper named Mask R-CNN addresses the problem of instance segmentation.

A Venetian mask. Was it a coincidence that the best paper award went to “Mask” R-CNN?

Instance Segmentation can be thought of as the combination of semantic segmentation and object detection. More specifically, the aim is to label, at pixel level, every instance of an object in an image.

At a high level, the Mask R-CNN performs instance segmentation by combining the state-of-the-art in object detection and semantic segmentation. The best modern object detection frameworks are based on the two-stage framework of Regions with CNNs (R-CNN) proposed by Girshick et al. R-CNN splits the task into two stages: The first stage produces a number of region proposals using a process known as Selective Search which starts with an over-segmentation of the input image and combines similar regions into object region proposals (see here for more information about selective search). The second stage of R-CNN then classifies each region proposal into an object or background by using a CNN to extract deep features and a set of per-class SVM classifiers. A bounding box regressor is then used to regress a new tight bounding box for the object from the CNN features.

The R-CNN approach was later enhanced by Girshik in the Fast R-CNN paper which includes multiple enhancements to R-CNN to improve its speed. The R-CNN framework is slow partly because it needs to run a forward pass of a CNN for every single region proposal. This results in thousands of passes per image but many of these would be overlapping, so many of the forward passes are redundant. Instead, the Fast R-CNN framework only does one pass of convolutional feature extraction, and pools the features within each region proposal into a fixed size feature vector using a process known as RoI Pooling. An RoI in this context is a rectangular window into the a convolutional feature map. RoI Pooling works by dividing the h×w RoI window into an H×W grid of sub-windows of approximate size h/H × w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. The RoI Pool layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets in which there is only one pyramid level.

Note that RoI Pooling includes quantising the real valued RoI location, x, to the nearest location on a coarse grid, [x]. Similar quantisation is performed when dividing the RoI into bins. This quantisation introduces misalignment between the RoI and the extracted features. This is illustrated in the figure below:

ROI Pooling results in quantisation of the RoI location.

Another insight used in Fast R-CNN was to combine the multiple tasks performed by R-CNN into a single multi-tasking network. The architecture of the Fast R-CNN network is shown in the figure below.

The architecture of Fast R-CNN. The network includes two parallel branches, one for classification and one for bounding box regression.

The network first processes the whole image to produce a convolutional feature map. Then, for each object proposal an RoI Pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected layers that finally branch into two sibling output layers: one that produces softmax probability estimates over all classes to replace the SVMs in R-CNN and another layer to regress the refined bounding boxes directly from the same features.

The Fast R-CNN framework still uses Selective Search to find region proposals. This is a rather slow process. In Faster R-CNN, Ren et al. replaced this mechanism with a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.

Following the string of innovations in the R-CNN family of algorithms, we can now look at how Mask R-CNN builds on and adapts these innovations to achieve state-of-the-art in instance segmentation. As mentioned previously, instance segmentation is essentially object detection plus semantic segmentation. Mask R-CNN extends the Faster R-CNN object detection framework by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.

The mask prediction branch predicts an m×m mask from each RoI using a Fully Convolutional Network. During training, a multi-task loss on each sampled RoI is defined as L = L_cls + L_box + L_mask. The classification loss L_cls and bounding box loss L_box are identical to those used by Fast R-CNN and Faster R-CNN. The mask branch has a K×m×m dimensional output for each RoI, which encodes K binary masks of resolution m×m, one for each of the K classes. A per-pixel sigmoid is applied to this output, and L_mask is defined as the average binary cross-entropy loss. For an RoI associated with ground-truth class k, L_mask is only defined on the k-th mask. Other mask outputs do not contribute to the loss. This definition of the mask loss decouples mask and class prediction by allowing the network to generate masks for every class without competition among classes, in contrast to the common practice of applying Fully Convolutional Network to semantic segmentation where masks across classes compete.

In order to achieve good results for instance segmentation, Mask R-CNN also makes another alteration to the Faster R-CNN architecture in order to ensure pixel to pixel alignment of the masks. As previously mentioned, the Fast R-CNN and Faster R-CNN approaches use RoI Pooling to pool the convolutional features within each region proposal into a fixed size feature vector which effectively results in quantisation of the RoI boundaries, leading to poor segmentation results. To address this problem, the authors proposed a simple, quantization-free layer, called RoI Align, that faithfully preserves exact spatial locations, to replace the RoI Pooling layer. Instead of rounding the real valued boundaries of the RoI to nearest location on a coarse grid, as done in RoI Pool, the RoI Align method uses bilinear interpolation to compute the values of the input features at four regularly sampled locations in each RoI bin, and aggregates the result (using max or average). This way, RoI Align avoids any quantisation of the RoI or the Bin locations. This is illustrated in the Figure below:

Unlike RoI Pool, the proposed RoI Align operation avoids quantisation by using bilinear interpolation to computer the values at given locations within each bin.

RoI Align seems like a very simple, even trivial change. However, the authors report significant improvement to their results in ablation studies. The figure below illustrates some results with Mask R-CNN.

Mask R-CNN results on the COCO test set. Masks are shown in colour, and bounding box, category, and confidences are also shown.

For more results and experiments as well as application of Mask R-CNN to other tasks such as human body keypoint detection, see the original paper.

Best Student Paper: Focal Loss

The best student paper award of the conference was also awarded to a paper by FAIR. The paper “Focal Loss for Dense Object Detection” by Lin et al. looks at object detection using a single stage method.

The R-CNN family of object detection methods reviewed above (i.e. R-CNN, Fast R-CNN and Faster R-CNN) are examples of two-stage algorithms where a classifier is applied in the second stage to a sparse set of candidate object locations generated in the first stage. Such two-stage methods have consistently outperformed their single-stage counterparts, such as YOLO and SSD, which are applied over a dense sampling of possible object locations. Such single-stage algorithms have lower accuracy but can potentially be faster and easier to train.

The authors of the Focal Loss paper argued that the low accuracy of the single-stage detectors is primarily due to a large imbalance in the background and foreground object samples during training. More specifically, they observed that a large number of object candidate locations are in fact background patches which are easily classified, but still receive a considerable loss during training under the standard cross-entropy criterion.

Cross Entropy Loss. An “easy” example with p_t=0.6 receives a loss of 0.5 (blue circle) under the Cross Entropy criterion while a “hard” example with p_t=0.6 receives a loss of approximately 2.3 (blue square)

They argued that when using the cross-entropy loss, the cost resulting from the large number of “easy” background examples in the training set can dominate the overall cost. This would not be a problem for two-stage detectors since the first stage would already filter out many of the easy background examples by Selective Search (as in R-CNN and Fast R-CNN) or using a Region Proposal Network, as in Faster R-CNN. To address this problem, they proposed adding a term to the loss which changes the loss function such that the easy examples receive a significantly lower loss compared to the harder foreground examples. Concretely, defining the cross-entropy loss as CE(p_t) = - log(pt), where:

and p is the model’s estimated probability for the foreground class, the authors formulate their loss as:

α_t in the above equation is a weighting factor to balance the foreground and background classes. The main contribution of the paper, is the addition of the term (1 − pt)^γ in the loss function. For γ>0 this term changes the loss function such that the loss received by the well-classified samples (p_t > 0.5) is considerably smaller while the loss received by the harder examples (p_t < 0.5) does not change much.

Comparison of Cross Entropy Loss with Focal Loss (with γ=5). An “easy” example with p_t=0.6 (shown with circles) receives a loss of 0.5 under the Cross Entropy criterion, whereas the corresponding loss for this example under the Focal Loss criterion is 0.005, i.e. 2 orders of magnitude smaller. On the other hand the loss for a “hard” example, (shown in squares) is reduced by only a relatively small amount, from 2.3 to roughly 1.4.

To evaluate the effectiveness of their proposed loss, the authors designed and trained a simple dense detector, called RetinaNet, shown in the figure below. It is made of a feedforward ResNet followed by a Feature Pyramid Network (FPN) to extract rich, multi-scale features. This is in turn followed by two branches, one for classification and one for bounding box regression.

RetinaNet architecture

In extensive ablation studies, the authors show that RetinaNet trained with the Focal Loss outperforms a similar model trained without it and was fairly robust to the exact values of α_t and γ. Furthermore, RetinaNet trained using the proposed Focal Loss managed to outperform all single-stage and two-stage detectors on the bounding box detection task of the COCO challenge. See the original paper for details of their experiments.

The Focal loss paper was presented in the 4th oral session.

This post reviewed the best papers of ICCV 2017. We also briefly reviewed the developments in the recent years that provided the basis for the Mask R-CNN work. In summary, some insights can be drawn from the collection of works reviewed here:

  • Sharing features between multiple tasks can lead to better results and faster inference times.
  • Multi-tasking objective functions are extremely effective when the multiple tasks are related.
  • Using the right operations for a given task is crucial. While RoI Pooling is fine for object detection, it does not provide sufficient alignment accuracy for pixel-level segmentation. Thus replacing it with a simple operation that avoid misalignment significantly improved the segmentation results in Mask R-CNN.
  • Label imbalance in training data can lead to large decreases in performance. Handling this imbalance appropriately can improve performance considerably.

The success of the papers reviewed in this post confirms once again that designing powerful deep learning algorithms is not just about having lots of data and compute power. It is also important to fully understand the problem at hand and adapt the approach to the requirements of the specific problem.