(PPS) Deformable Convolutional Networks

Kevin Shen
Mini Distill
Published in
3 min readJun 18, 2018

The basic idea of this paper is to give the convolution and pooling layers the ability to model different orientations and scales of objects in images. They do this by making the shape of the convolution filter learnable. For intuition, consider the hypothetical benefit of a deformable convolutional filter. Suppose we want to detect a cat. Using conventional filters, we would have to learn a single filter for each scale or orientation of the cat. Now suppose we can learn a set of deformable filters. This would only requires us to learn two filters: a cat detector and a deformation filter to tell us how to deform the cat filter to find the cat in the image. The deformation filter molds the cat filter to suit the situation. For example if the cat in the image is rotated 180 degrees, we can rotate the cat filter 180 degrees. Given these two filters, we would be able to detect cats at any scale or orientation. But hang on a minute, aren’t we pushing the problem back by learning a deformation filter in that we would need a deformation filter at every scale and orientation? Perhaps, but intuitively it feels like it’s much easier to predict the scale or orientation of an object than to identify the object itself. Therefore, maybe we can find a one-fits-all deformation filter. This is the motivation for deformable convolutional networks.

I believe most computer vision papers on invariance/equivariance (capsule networks included) understate the distinction between a model which has the capacity to be invariant/equivariant and a model which will learn invariance/equivariance when trained using the method proposed in the paper. Most papers fall into the first camp. The authors will propose a method with theoretical efficiency gains over existing models but the way they train the model gives us no reason to believe the model will actually achieve the efficiency gains in practice. Capsule networks is such an example. Deformable convolutions is another. Sometimes the authors worsen the problem by claiming their new model removes the need for data augmentation. Just because your model can efficiently model different scales and orientation, it doesn’t mean it will learn those scales and orientations for free during training. The model might still need to see augmented data!

As hinted, this paper falls into the first camp. While we have theoretical suspicion that a deformable convolution framework can have efficiency gains over regular convolutions, it’s not clear these gains will be realized in practice. But without further ado, here is the deformable convolution:

Usually a convolution samples from a 3x3 square in the input feature map to compute the value at a single location in the output feature map. Now we compute offsets so that the convolution is on arbitrary points around the 3x3 square in the input map. The offsets themselves are computed using a convolution layer. All in all, we have two filters for each layer: the usual convolution filter which computes the output map from the input map, and the deformation filter which computes the offsets of the convolution filter given the input map. The forward pass occurs sequentially: first compute the offsets using the deformation filter, then compute the output feature map using the offsetted convolution filter.

Both filters are learned using backpropagation and a couple of tricks have to be employed to make everything differentiable. In particular, the offset is treated as a continuous value and pixel values are interpolated when the convolution for the output feature map is performed. For example if the offset is 0.25 then you use 0.25 times the value of the center pixel plus 0.75 times the value of the neighboring pixel when computing the output feature map.

The authors report moderate improvements over previous state-of-the-art on detection and segmentation task although it’s kind of weird they didn’t explicitly evaluate the model’s ability to model rotations.

--

--