Notes on “Deformable Convolutional Networks”

Felix Lau
3 min readMar 22, 2017

--

Dai, Jifeng, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. “Deformable Convolutional Networks.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1703.06211

  • This paper introduces a new form of convolution and pooling: deformable convolution and deformable RoI pooling. The authors claim these modules can be swapped into existing networks easily. These modules effectively have dynamic and learnable receptive field.
  • Motivation: The author argues that CNNs are inherently invariant to large and unknown transformation and relies on data augmentation to learn geometric transformation. In fully convolutional network (FCNs), being able to dynamically adjust receptive field is especially desirable.

Deformable Convolution

Note that the arrow in the offset field corresponds to how the blue squares are shifted in the input feature map
  • Deformable convolution consists of 2 parts: regular conv. layer and another conv. layer to learn 2D offset for each input. In this diagram, the regular conv. layer is fed in the blue squares instead of the green squares.
  • If you are confused (like I was), you can think of deformable convolution as a “learnable” dilated (atrous) convolution which the dilated rate is learned and can be different for each input. Section 3 is a great read if you’d learn more about the relationship of deformable convolution with other techniques.
  • Since the offsets are not integer (fractional), bilinear interpolation is used to sample from the input feature map. The author points out that this can computed efficiently. (see Table 4 for forward-pass time)
  • The 2D offsets are encoded in the channel dimension (e.g. conv. layer of n channels is paired with offset conv. layer of 2n channels)
  • Note that offsets are initialized to 0 and the learning rate for these offset layers are not necessarily the same as the regular convolution layer (but they are by default in this paper)
  • The authors empirically show that deformable convolution is able to “expand” the receptive field for bigger object. They measure “effective dilation” which is the mean distances between each offsets (i.e. the blue squares in the Fig. 2). They found that deformable filters that are centered on larger objects has larger “receptive field”. See below.
From Fig. 5. Red dots are sampling locations (from the learned offset) of a deformable convolution filter. Green squares are corresponding outputs. Filter on larger object has larger receptive field.

Deformable ROI Pooling

  • Deformable RoI pooling also consists of 2 parts: regular RoI pooling layer and another fully connected layer to learn the offset.
  • Instead of predicting the raw offset (in pixel), the offsets are normalized (i.e. divided) by the width and height of the RoI region such that is it is invariant to RoI size.
  • There is a curious constant scalar gamma which further scale the normalized offset. (?)

Open Questions / Comments / Thoughts:

  • I’m quite impressed and surprised that the offsets do not need to be specifically regularized. (Maybe batch normalization indeed solves everything, contrary to http://nyus.joshuawise.com/batchnorm.pdf)
  • I’m interested in the results which the offsets are simply determinted by a affine transformation of the regular grid. This should significantly lower the number of parameters while still having some form of “deformability”. Maybe there are already some previous works on this?
  • It’s impressive that the implementation seems to be quite efficient!
  • There is no experiment on the effect of reducing the extent of data augmentation (esp. scale) with deformable convolution. It will be even more convincing if there is.
  • I’m curious about the results of applying deformable offsets to regular max-pooling which the authors didn’t mention
  • It’s awesome to see more fundamental research into convolution which is the basic block of deep learning. Recent examples include grouped convolution and separable convolution etc.
  • I am wondering what this means in terms of model interpretability.
  • The experiments are quite well done but some of the figures are not so self-explainatory (e.g. Fig. 5)

Let me know your thoughts in the comments below! Follow me on Twitter if you’d like to read more of these paper notes.

--

--

Felix Lau

Senior Machine Learning Scientist at @arterysinc. Writes about Machine Learning , Deep Learning, Kaggle and Data Visualization.