Paper Summary: Instance-aware Semantic Segmentation via Multi-task Network Cascades

4 min readNov 23, 2018

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/07.

Instance-aware Semantic Segmentation via Multi-task Network Cascades (2015) Jifeng Dai, Kaiming He, Jian Sun

I just never get tired of the variations, it seems. Yesterday I learned about something called “RoI warping” and I had to go down the rabbit hole… and here I am. This is another segmentation paper, so there will be faint echoes of U-Net, but the lineage hews closer to Faster R-CNN. The nice thing about reading a family of related papers is that they build on one another, so the summaries can get shorter, which I’ll aim for here.

There are a couple of new things in the paper. The first thing to note is that the problem at hand is “instance-aware” semantic segmentation (which is not strictly novel to this paper, but none of the papers I’ve covered so far do it). This means that we’re not just producing per-pixel class labels, we’re also tracking object instances, which afaict means a bounding box in addition to a pixel mask. The second thing is what the authors call Multi-task Network Cascades (MNCs), which is a fancy way of saying there are multiple sub-networks with dependencies — the output of one flows into the next, flowing like, well, a cascade. The third and final new thing is the differentiable cropping and warping layer that drew me to this paper in the first place: the RoI warping layer.

The 3 stage architecture. The stages share convolutional features and they are otherwise laid out sequentially. Each has its own loss function; the paper makes a big deal of the fact that the losses are not independent of one another and so you have to be careful about propagating gradients correctly, but honestly this network doesn’t seem particularly more complex than the others we’ve considered here. The stages:

Proposing boxes. The is the region proposal network (RPN) of Ren et al 2015’s Faster R-CNN
Mask-level instances. These are still class-agnostic; includes 14x14 RoI pooling (i.e. adaptive pooling, thought the term isn’t used in this paper); FC+relu to 256 dimensions, followed by another FC layer yielding a 282-d mask
Categorizing instances. RoI pooling features are masked by the previous stage. From this point there are two pathways, one masked and one not (to handle examples with low IoU? or negative examples?), each with two 4096-d FC layers (that’s a lot of parameters!). These are concatenated and put through a (N+1)-way softmax (plus one for a background class)

RoI warping. I had a little trouble following the math here, honestly. Not because it’s complicated exactly, I just didn’t quite get why the

term works as intended (update: reading Jaderberg 2015 clears things up quite a bit, stay tuned for tomorrow’s summary). In any case, the idea is to define a cropping and warping transform, which can be expressed as a separable bilinear interpolation, mapping a parameterized rectangular region from the input onto a fixed HxW rectangle. Which can then be max pooled and treated like an RoI pooling layer.

The authors draw a comparison to the Spatial transformer networks from Jaderberg et al 2015, which coincidentally I was planning to summarize next. (Incidentally in trying to understand how this layer works I stumbled across tf.image.crop_and_resize (code) and https://github.com/longcw/RoIAlign.pytorch, which do this.)

Another kind of new thing the authors do is repeat the 2nd and 3rd stages of the network at inference time, to refine the masks and bounding boxes. Going back and training with the full 5-stage network saw some accuracy gains in their testing, unsurprisingly. My take: I’m a little surprised they got good results on a 5-stage network that was only trained in the 3-stage configuration. I suppose this is the result of good API design, for lack of a better term, that the networks are modular enough to be glued together without additional training.

Again a grab bag of details:

Non-maximum suppression on candidate boxes. This is a recurring theme and I finally had to ask myself how this discrete step can fit into a differentiable network. The authors helpfully point out that NMS is similar to max pooling and maxout which are “implemented as routers of forward/backward pathways.” I periodically have to remind myself that in deep learning, mostly differentiable is good enough.
Producing the final masks is pretty cool — they do “mask voting,” which starts with another round of non-max suppression, then for each non-suppressed instance find similar suppressed instances and do pixel-wise weighted averaging of their masks (the weight is the classification score).
Ablation results: significant gain from training the sub-networks jointly (over and above just sharing conv features); some gain from 5-stage training; big gains on MS COCO with ResNet-101.

And that’s it! More warping and transforming tomorrow; I’ll go into the math in a bit more detail.

Max Jaderberg et al 2015 “Spatial transformer networks” https://arxiv.org/abs/1506.02025

Shaoqing Ren et al 2015 “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks“ https://arxiv.org/abs/1506.01497

Paper Summary: Instance-aware Semantic Segmentation via Multi-task Network Cascades

Written by Mike Plotz Sage