A gentle dive into the anatomy of a Convolution layer.

Apil Tamang
5 min readNov 21, 2017

--

In the realm of deep learning, the convolution network is a workhorse for many awe-inspiring results. After 2012’s publication of the AlexNet, it is hard to cite a ground-breaking computer-vision architecture that does not use convolutional architecture at its very foundation.

With existing deep-learning frameworks, writing a convolution layer is often a one-line statement which abstracts away many structural details. And yet, sometimes it is a good idea to take a step back and peel open some abstractions. This blog is an attempt to elaborate one a specific anatomical feature of the convolution layer that is unfortunately overlooked in most blogs and discussions on this matter.

Many convolution architectures start off with an exterior convolution unit that map an input image of channel RGB into a series of interior filters. In most popular deep-learning frameworks, this code may look as the follow:

out_1=Conv2d(input=image, filter=32, kernel_size=(3,3), strides=(1,1))
relu_out=relu(out_1)
pool_out=MaxPool(relu_out, kernel_size=(2,2), strides=2)

To many, it is well understood that the result of the above is a series of filters 32 layers deep. What is not always understood is how an image with 3 channels is exactly mapped into these 32 layers! What isn’t also apparent is exactly how the max-pool operator is applied. For instance, is the max-pool applied over all the filter layers at once to effectively produce a single filter map at the end? Or, is it that the max-pool is applied to each filter independently to produce the same 32 layers of pooled filters?

The How?

A single picture speaks a thousand words, so here’s a diagram to show all the actions packed in the above code snippets.

Application of the convolution layer.

The most notable observation from the figure above is that each filter (i.e filter-1, filter-2, … et. al) in Step-1 actually comprises of a set of three convolution kernels (Wt-R, Wt-G, and Wt-B). Each one of this kernel is reserved respectively for the Red (R), Green (G), and Blue (B) channels in the input image.

During the forward propagation, the R,G, and B pixel values from the image are multiplied against the Wt-R, Wt-G, and Wt-B kernels respectively to produce an intermittent activation map (not shown in the figure). The outputs from the three kernels are then added to produce one activation map per filter.

Subsequently, each of these activations are subject to the ReLu function, and finally run through the max-pooling layer, the latter being primarily responsible for reducing the dimension of the output activation map. What we have at the end is a set of activation map, usually whose dimensions is now half of the input image, but now whose signals span a set of 32 (by our choice as the number of filters) two-dimensional tensors.

The output from a convolution layer often serves as the input to the subsequent convolution layer. Thus, if our second convolution unit started as the follows:

conv_out_2 = Conv2d(input = relu_out, filters=64)

then the framework needs to instantiate 64 filters, each filter using a set of 32 unique kernels.

The Why?

Another subtle, but an important point that escapes scrutiny is the discussion of why we used 32 filters for our first convolution layer. In many popular architectures, the number of filters used gets increasingly larger (e.g. 64 for the second, 128 for the third, and so on..) as we go deeper in the network.

Matt Zeiler, in this paper, employs a the deconvolution operator to visualize how the kernels at different layers and depth of a deep convolution architecture get tuned during the training process. The general consensus is that in an optimally trained convolution network, the filters at the very edge (close to the image) becomes sensitive to basic edges and patterns. The filters in the deeper layers become sensitized to gradually higher orders shapes and patterns. The phenomenon is very well summarized in these diagrams extracted from Matt’s paper:

Visualization of the activation of filters on the first and second (outermost) layers.
Visualization of the filters activation on the third layer.
Visualizations of the filters activation on the 4th and 5th layers.

Another question that I wondered for a considerable time is why different filters, even in any given layer, get tuned to a specific shape or pattern. After all, there was nothing extraordinary about the weights in any kernel which would’ve guaranteed the observed outcome. Precisely to that point: the process of stochastic gradient descent (SGD) automagically corrects the weights so that the kernels acquire the specialized features above. It is only important that:

  • the kernels (or weight matrices) be initialized randomly, so that we ensure each kernel is optimized to a unique solution space, and
  • we define enough filters to maximize the capture the various features in our dataset, while striking the balance against the incurred computational cost.

And finally, many papers also suggest that the visualization of the filter activations often provide a window into the performance of a convolution architecture. A balanced and performant network often displays activations as discussed above, marked with the manifestation of well-defined edge and shape detectors. A network that over-fits, under-fits, and/or generalizes poorly often fail to show these observations. Hence, it is always a good idea to test the network using the process used in (2) to see if an experimental convolution network is yielding good results.

References:

  1. ImageNet Classification with Deep Convolutional Neural Networks, Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
  2. Visualizing and Understanding Convolutional Networks, Matthew D Zeiler, Rob Fergus https://arxiv.org/abs/1311.2901

--

--