6. Introduction to Deep Learning with Computer Vision — 3x3 is a lie! 1x1 convolutions

Inside AI
Deep-Learning-For-Computer-Vision
3 min readOct 17, 2019

written by Nilesh Singh and Praveen Kumar.

Prerequisites: Filters & Kernels, Channels & Features

So far, we’ve looked at a very specific type of convolution with a kernel size of 3, more commonly called 3X3 convolutions. We also said that this the only type of convolution that we are going to use. But what if I told you that we were indeed lying.

There is a very important kind of convolution that is under some circumstances more important than our feature extractor 3X3.

Before looking at what it is, let’s look at the problem faced with 3X3 convolutions.

Let’s say that we have an image with some arbitrary size, also let’s assume that we are doing a series of feature extractions on it using 3X3 kernels. We know that it is in the best interest of the model to increase the number of channels in each convolutional layer.

So, if we generate 32 channels in 1st convolution then our GPU will have 32 images after 1st convolution. In the second iteration if we generate 64 channels, then for each of 32 images generated in the previous step 64 new channels will be generated.

total no. of images in GPU=32X64

if in the 3rd convolution, we generate 128 channels, then no of images=32X64X128.

We have networks with 1000s of such layers, just try to imagine the number of images that GPU will have to handle. Even state of art Titan X GPUs is tiny when confronted with such a huge number.

What do we do?

No seriously, what do we do?

Well, we thank Google and the University of North Carolina for releasing their paper and introducing the world to the magic of 1X1 convolutions.

A 1X1 convolution is used to reduce the dimensionality of channels. We can pass 1000 channels to it and tell it to give 10 channels and it will do so without losing any information.

Let’s look at an animation explaining the process.

1X1 convolutions a linear traversal of each feature maps (channels) from top to bottom and left to right. During this traversal, it takes a weighted summary of each of the channels. This weight is decided by the number of channels that we are expecting as output.

The very same thing is happening in the animation above. We initially have 10 feature maps (channels), then we tell our 1X1 convolution that we want 4 channels as it’s output. What then happens is that all those 10 feature maps are scanned 4 times and are summarized each time using weights, this is done to ensure that all 4 output channels have features that are different.

Yeah, trust me when I say, I know this confusion.

Let it sink in for a while now. It is a difficult concept, actually, counter-intuitive to grasp in one go and it requires a wee bit of practice as well.

Here are some fantastic resources to read more about it.

Hope you enjoyed it. See you soon again.

NOTE: We are starting a new telegram group to tackle all the questions and any sort of queries. You can openly discuss concepts with other participants and get more insights and this will be more helpful as we move further down the publication. [Follow this LINK to join]

--

--

Inside AI
Deep-Learning-For-Computer-Vision

We write about NLP, Speech Recognition, Computer Vision, Kaggle, and Data Science Competitions.