Understanding Wide Residual Networks

Michael Poli
Sep 9, 2018 · 4 min read

Today, as a first post in this series about why ResNets work and how they can be improved, I would like to take a look at a fairly recent paper published in 2016: Wide residual networks ( Sergey Zagoruyko, Nikos Komodakis, 2016)

The idea is very clever but fairly simple to understand. In classic ResNet architectures typically the focus has been in trying to improve the representation capability of the model by increasing the number of layers, thus making the network deeper and thinner. This, in turn, has a problem of “diminishing feature reuse”.

What does that mean? Recall that in CNN we call feature maps the inputs and kernels or filters the “weight matrices”. By making a network deeper and thinner , we stack more layers of convolutions that have to manipulate different feature maps. If we instead widen the network, we have as a result more weights working on the same set of inputs. But there’s more. The problem of diminishing feature reuse was formulated in another paper: Highway networks, where the authors talk about how during backpropagation in really deep ResNets gradients can sometimes flow through the skip “identity” connections, thus effectively rendering a big portion of the model useless. Perhaps a post for another time.

Skip connection in ResNets

Thin networks have been historically more popular because they tend to be less computationally expensive. Circuit theory is often cited when talking about model complexity, as it has been proven that with more depth circuits require less components overall to express more complicated functions of the input. In other words, more depth can express a richer set of functions of the input.

Additionally, this paper makes the contribution of reintroducing dropout into ResNets. Widening the networks increases the number of parameters, and with more parameters more regularization is often required to avoid overfitting.


First, a bit of notation. The authors use k to denote the widening factor of a block and l to denote the deepening factor. As such, they call architectures with k = 1 “thin” and ones with k > 1 “wide”. The number of parameters scales linearly with l and is quadratic in k. Why is that?

Assuming we want to keep the channel dimension (so-called width) the same before and after the convolution, adding N layers simply increases the total number of parameters by N*l, where l = (filter size)² * #filters. However, as an example, in the case in which we want to double (k = 2) the width instead of keeping it constant throughout the convolution block (#output channels = #input channels * k) then resulting increase is quadratic: (filter size)² * n_c * 2 as a first step (upsample) and then (filter size)² * 2*n_c * n_c (downsample). That is, if we consider a block type B(3,3) and for the sake of simplicity keep stride = 1 and padding = 0, we obtain a first (wider) output of size:

(input size - filter size + 1, input size - filter size + 1, 2 * input channels)

after the first convolution and then a second and final output of size:

(input size-2 * filter size + 2, input size - 2 * filter size + 1, input channels).

As an additional resource, please refer to this guide on convolution arithmetic: A guide to convolution arithmetic for deep learning (Vincent Dumoulin, Francesco Visin, 2016)

k=4 ends up increasing the total number of parameters by a factor of 16 in a convolution block B(3,3) as we can see from this simplified Excel example:

Michael Poli

Written by

Master’s student at KAIST. Research interests include optimization for deep learning, spatio-temporal time series analysis, multi-agent reinforcement learning.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade