Convolutional Models Overview

Convolutions, Kernels, Downsampling & Properties

Jake Batsuuri

Published in

Computronium Blog

8 min readMar 5, 2021

What does a CNN model look like in code?

from keras import layers 
from keras import models  seq_model= models.Sequential() 
seq_model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1))) 
seq_model.add(layers.MaxPooling2D((2, 2))) 
seq_model.add(layers.Conv2D(64, (3, 3), activation='relu')) seq_model.add(layers.MaxPooling2D((2, 2))) 
seq_model.add(layers.Conv2D(128, (3, 3), activation='relu'))

There is a model:

from keras import modelsseq_model= models.Sequential()

Models can be sequential and non-sequential.

from keras.models import Sequential, Modelnon_seq_model = Model(input_tensor, output_tensor)

Models can consist of layers:

from keras import layersseq_model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))

Here the input_shape is (28, 28, 1), which takes an input tensors of shape (image_height, image_width, image_channels). Where image channels is 1 for black and white, and would be 3 for RGB images.
The first parameter, is 32, which is the number of filters. To understand filters, we gotta understand convolutions first…

What are convolutions?

Remember this tidbit:

General Architectural Design Considerations for Neural Networks

Universal Approximation Theorem, Depth, Connections

levelup.gitconnected.com

Dense vs Sparse Connections
And the last key consideration is how each layer is connected to the next. By default, it is connected in a dense or fully connected manner. But using less connections than full is popular too. Generally using less connections reduces computation and parameters. Using sparse connections is often problem dependent, which we will study in separate when considering CNNs and RNNs and so on.

Fully connected layers have full connections from the previous layer to the current layer. Meaning every node from i-1 layer is connected to every node in i layer. This is represented by a matrix multiplication. This way of connecting learns global patterns. Which turns out isn’t very useful, I mean its useful, but not as good as convolution layers in vision tasks. Because of few key properties of convolutional models, which we’ll see in a bit.

The convolutional layers instead learn local patterns, by dividing up the image into smaller subsections, called kernels. The kernels have:

Size of the kernel: will have to be (3,3) or (5,5) or (7,7) and so on

Number of kernels, can be clarified by saying number of filters, where each filter is a set of weights, that is convoluted with the input image. And the output image is sort of an activated version of the input image. Activated by the filter weights.

https://www.amazon.ca/Deep-Learning-Python-Francois-Chollet/dp/1617294438

To give some concreteness to the idea.

model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))

Here we input an image of size 28 by 28, with only grayscale values, sometimes referred to as depth.

Then we apply 32 filters in a kernel of 3 by 3 size. Usually you wanna increase the number of filters as you add more layers, to give it capacity to learn more combinations of features.

What was in input tensor of (28, 28, 1), just one image, becomes a tensor of (28, 28, 32), or 32 different filtered images.

Also note that the relu activation is just relu(x) = max(0, x). Remember:

Hidden Units in Neural Networks

What are the hidden layers in deep neural networks? How are they constructed?

medium.com

The earliest gates were discrete binary gates.
Then there were sigmoidal gates, which allowed for differentiation and backpropagation.
As networks got deeper, these sigmoidal proved ineffective. So ReLU was adopted into deep neural nets. Which makes hard decisions based on the input’s sign, this developed around 2010.

Remember that convolutions in math, a convolution is an integral that expresses the amount of overlap of one function g as it is shifted over another function f. This does describe explain the idea behind filter layers and the idea of activations. A learned function g goes over the input image f, and if the convolution is positive, we activate it, saying that a feature is present.

Also note that, although the output tensor represent a bunch of images, the last tensor is generated depth wise first, meaning each 32 deep layer is generated one by one, then the final tensor is reassembled. Very much like the last stages of solving a Rubik’s cube, you create a row or column then insert the row into the right place.

What’s Max Pooling?

Let’s first review what’s happening in code and inside the model in detail:

seq_model= models.Sequential() 
seq_model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1))) 
seq_model.add(layers.MaxPooling2D((2, 2))) 
seq_model.add(layers.Conv2D(64, (3, 3), activation='relu')) seq_model.add(layers.MaxPooling2D((2, 2))) 
seq_model.add(layers.Conv2D(128, (3, 3), activation='relu'))

To see the details:

model.summary()

Which gives us:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 3, 64)          36928     
=================================================================Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0

First of all, the output shape of the first layer is 26 because of the border effects. Subtract one from each side, and 28 becomes 26.

But the max pooling layer divides it by 2. What’s happening there?

A max pooling layer downsamples the feature maps. It downsamples by choosing the max value of each kernel. The kernel is usually a 2 by 2 window and it steps or strides by 2. This has an effect of downsampling by 2.

The primary purpose of the downsampling is because eventually at the end of the model we wanna condense the information into a result, such as classification or regression, which is done usually with a dense connection.

A dense connection from (22, 22, 64) tensor flattened which would be 30'976 elements to a layer with 512 elements, we would be fourteen million six hundred and five different connections. Computers are not infinity stones, that is too much compute. So we downsample almost after every convolution step like so:

from keras import layers
from keras import modelsmodel = models.Sequential()model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

The model summary:

Model: "sequential" _________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= conv2d (Conv2D)              (None, 26, 26, 32)        320        _________________________________________________________________ max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0          _________________________________________________________________ conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496      _________________________________________________________________ max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0          _________________________________________________________________ conv2d_2 (Conv2D)            (None, 3, 3, 64)          36928      _________________________________________________________________ flatten (Flatten)            (None, 576)               0          _________________________________________________________________ dense (Dense)                (None, 64)                36928      _________________________________________________________________ dense_1 (Dense)              (None, 10)                650        _________________________________________________________________ flatten_1 (Flatten)          (None, 10)                0          _________________________________________________________________ dense_2 (Dense)              (None, 64)                704        _________________________________________________________________ dense_3 (Dense)              (None, 10)                650        ================================================================= Total params: 94,676 
Trainable params: 94,676 
Non-trainable params: 0 _________________________________________________________________

So at the end of our convolutional part of the model, we have 64 filters of 3 by 3 kernel, which flattens to one vector of 576 elements. Which has to densely connect to 64 elements layer.

If however we had not used downsampling strategy, like a max pooling layer throughout, we would have a model like this:

from keras import layers
from keras import modelsmodel = models.Sequential()model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

Which gives us a model of:

Model: "sequential" _________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= conv2d (Conv2D)              (None, 26, 26, 32)        320        _________________________________________________________________ conv2d_1 (Conv2D)            (None, 24, 24, 64)        18496      _________________________________________________________________ conv2d_2 (Conv2D)            (None, 22, 22, 64)        36928      _________________________________________________________________ flatten (Flatten)            (None, 30976)             0          _________________________________________________________________ dense (Dense)                (None, 64)                1982528    _________________________________________________________________ dense_1 (Dense)              (None, 10)                650        ================================================================= Total params: 2,038,922 
Trainable params: 2,038,922 
Non-trainable params: 0 _________________________________________________________________

The difference is 94k parameters vs 2 million parameters. Which you can imagine really helps the compute time or the training duration of your models.

From one perspective, it seems like by downsampling the data, essentially you’re losing information, but because we select for the max of the kernel. The important information seems to pass through while the unimportant gets filtered out, which is also a nice feature.

Generally a nice pattern that I noticed here is that, initially we use very small kernel sizes and few filters, which learns few basic features, then we build on these features and increase the image kernel size, and number of filters, which keeps learning ever more bigger and complex features.

However the increasing number of filters and kernel sizes is computationally heavy, so we apply max pooling or some downsampling layer to only take the most important features after each convolution, and disregard the weak learnings.

What are the properties of the Convolutional Model that make it better than Densely Connected for vision tasks?

Translation Invariance

A convolutional model can learn a certain pattern in the lower right area, then after that point detect it anywhere on the image. Whereas a densely connected model will have to relearn it only for that lower right area.

This is not only computational efficient, this matches the visual world in that it is also translation invariant.

Spatial Hierarchy

A convolutional model can learn patterns in a hierarchical fashion, much like we do. The first layers will learn relatively simple patterns, like horizontalness and verticalness etc. Then the second layers will put these together to learn such things as corners. And so on with each new layer.

This also mirrors the visual world.

These properties are important to remember as we think about new problems which might require translation invariance, even in text and sequence data.

Up Next…

Coming up next is probably Convolutional Model Regularization. If you would like me to write another article explaining a topic in-depth, please leave a comment.

For the table of contents and more content click here.