Paper Summary: Spatial Transformer Networks

4 min readNov 17, 2018

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/08.

Spatial Transformer Networks (2016) https://arxiv.org/abs/1506.02025 Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu

Continuing on the warping theme, this paper does indeed share many similarities to the previous RoI warping paper. So the problem with CNNs is that they don’t efficiently learn the invariances we’d like them to (translation, scaling, rotation, and other distortions). Efficiency is the key word here: they do learn invariances (see the discussion on “image pyramids” and scale invariance in the Fast R-CNN paper, e.g.), but they require lots of training data and lots of network parameters to do it. Since we expect many of our real-world datasets to contain regularities that can be represented by simple transforms, it seems reasonable to expect that we can gain from encoding these priors into our network architectures.

Enter the Spatial Transformer module. The idea is similar to attention in that the spatial transformer can act like a cropping function, giving subsequent layers access to only a part of the input. This paper might be closest to Gregor et al 2015, which uses a differentiable attention mechanism with Gaussian kernels, but the approach here is more flexible.

The spatial transformer module consists of three parts:

A localisation network (I’ll follow the paper in using British spellings), which is a network that can be of any shape, provided that it outputs transformation parameters θ in its final regression layer
A grid generator, which takes the θ from the localisation network and defines a parameterized transformation that maps target (output) coordinates (xit, yit) back onto the source (input) feature coordinates (xis, yis) — note that this is a backwards mapping!
A differentiable sampler. The formula goes like this: for each channel, for each output location, range over the entire input and add up the input pixels weighted by the sampling kernels (more on this shortly).

So to recap, we generate transform parameters from the input convolution features, we map each output pixel back onto coordinates in the input space, and sample from the input using these mapped coordinates. The transform defined by the grid generator can be any parameterized differentiable transformation: affine, translation + isotropic scaling, piecewise affine, splines, etc.

This is basically a more flexible version of the RoI warping layer from yesterday’s paper. I found the math quite a bit easier to follow this time around, which might be because of the presentation, or might be because multiple passes generally help understanding. In any case, here’s the formula for the output pixels Vic (i is just a pixel index, c is the channel) in terms of the input coordinates and the input features Umnc:

Note that for each i the kernel functions (max(0, …)) are uniformly 0 except for the 4 input pixels closest to (xis, yis). This can be efficiently implemented by replacing the double sum with a sum over the kernel support. Gradients w.r.t. θ can be calculated since

are readily available.

The sampling kernels can be any function with subgradients defined — in the above we assumed bilinear sampling kernels, which you can think of as a bilinear interpolation of the 4 closest input pixels, but you could go even simpler and just copy the nearest pixel (the paper shows this too; I assume the gradients are less informative in this case).

These modules can be just dropped into CNNs and are very fast and are even interpretable to some extent. E.g. the network the authors used for the CUB-200–2011 birds dataset has two ST modules in parallel, one of which learns to track the bird’s head, while the other tracks the body — you can think of this as doing rudimentary pose estimation, not just cropping. It occurs to me that you might get good results cascading ST modules in a similar way to yesterday’s Multi-task Network Cascades, depending on the task.

In any case, here’s a video of some of their results: https://goo.gl/qdEhUu.

Karol Gregor et al 2015 “DRAW: A Recurrent Neural Network For Image Generation” https://arxiv.org/abs/1502.04623

Paper Summary: Spatial Transformer Networks

Written by Mike Plotz