Paper Summary: U-Net: Convolutional Networks for Biomedical Image Segmentation

Mike Plotz
3 min readNov 19, 2018

--

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/03.

U-Net: Convolutional Networks for Biomedical Image Segmentation (2015) https://arxiv.org/abs/1505.04597 Olaf Ronneberger, Philipp Fischer, Thomas Brox

This is a classic paper based on a simple, elegant idea — support pixel-level localization by concatenating pre-downsample activations with the upsampled features later on, at multiple scales — but again there are some surprises in the details of this paper that go a bit beyond the architecture diagram. (Note: localization refers to per-pixel output, not l10n.)

Localization and image segmentation (localization with some extra stuff like drawing object boundaries) are challenging for typical CNN image classifier architectures since the standard approach throws away spatial information as you get deeper into the network. You can get per-pixel output by scaling back up to output the full size in each forward pass (as in Long 2014) or you can use a sliding window approach (Ciresan 2012 — good results, but slow). Both of these approaches exhibit this sort of Heisenbergian trade-off between spatial accuracy and the ability to use context. This paper’s authors found a way to do away with the trade-off entirely.

Here’s the U-Net architecture they came up with:

The intuition is that the max pooling (downsampling) layers give you a large receptive field, but throw away most spatial data, so a reasonable way to reintroduce good spatial information might be to add skip connections across the U. The architecture is basically in two phases, a “contracting path” and an “expansive path.” The contracting path has sections with 2 3x3 convolutions + relu, followed by downsampling (a 2x2 max pool with stride 2). The expansive path is basically the same, but — and here’s the big U-Net idea — each upsample is concatenated with the cropped feature activations from the opposite side of the U (cropped because we only want “valid” pixel dimensions and the input is mirror padded). The whole thing ends with a 1x1 convolution to output class labels.

As I mentioned above, there were some additional details needed to get good results overall:

  • The authors used an overlapping tile strategy to apply the network to large images, and used mirroring to extend past the image border
  • Data augmentation included elastic deformations
  • The loss function included per-pixel weights both to balance overall class frequencies and to draw a clear separation between objects of the same class (see screenshot below)

Data augmentation: along with the usual shift, rotation, and color adjustments, they added elastic deformations. This was done with a coarse (3x3) grid of random displacements, with bicubic per-pixel displacements. (Oddly enough, the only mention of drop-out in the paper is in the data augmentation section, which is strange and I don’t really understand why it’s there and not, say, in the architecture description.)

Pixel weighting:

The basic idea is to add a class weight (to upweight rarer classes), plus “morphological operations” — find the distance to the two closest objects of interest and upweight when the distances are small. This encourages the network to learn to draw pixel boundaries between objects.

The data augmentation and class weighting made it possible to train the network on only 30 labeled images! Also they used a batch size of 1, but with 0.99 momentum so that each gradient update included several samples — GPU usage was higher with larger tiles. So, pretty cool ideas, appealingly intuitive, though if I’m reading the results correctly it appears that this approach is still far behind human performance. The next paper I’ll summarize uses a U-Net architecture (that’s how I ended up reading this one), and the idea seems to be pretty common in image segmentation even ~3 years later.

Ciresan et al 2012 “Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images” https://papers.nips.cc/paper/4741-deep-neural-networks-segment-neuronal-membranes-in-electron-microscopy-images

Long et al 2014 “Fully Convolutional Networks for Semantic Segmentation” https://arxiv.org/abs/1411.4038

--

--

Mike Plotz

yet another bay area software engineer • learning junkie • searching for the right level of meta • also pie