Convolutional Models Overview

Convolutions, Kernels, Downsampling & Properties

Jake Batsuuri
Computronium Blog
8 min readMar 5, 2021

--

What does a CNN model look like in code?

from keras import layers 
from keras import models
seq_model= models.Sequential()
seq_model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
seq_model.add(layers.MaxPooling2D((2, 2)))
seq_model.add(layers.Conv2D(64, (3, 3), activation='relu')) seq_model.add(layers.MaxPooling2D((2, 2)))
seq_model.add(layers.Conv2D(128, (3, 3), activation='relu'))

There is a model:

from keras import modelsseq_model= models.Sequential()

Models can be sequential and non-sequential.

from keras.models import Sequential, Modelnon_seq_model = Model(input_tensor, output_tensor)

Models can consist of layers:

from keras import layersseq_model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
  • Here the input_shape is (28, 28, 1), which takes an input tensors of shape (image_height, image_width, image_channels). Where image channels is 1 for black and white, and would be 3 for RGB images.
  • The first parameter, is 32, which is the number of filters. To understand filters, we gotta understand convolutions first…

What are convolutions?

Remember this tidbit:

Dense vs Sparse Connections

And the last key consideration is how each layer is connected to the next. By default, it is connected in a dense or fully connected manner. But using less connections than full is popular too. Generally using less connections reduces computation and parameters. Using sparse connections is often problem dependent, which we will study in separate when considering CNNs and RNNs and so on.

Fully connected layers have full connections from the previous layer to the current layer. Meaning every node from i-1 layer is connected to every node in i layer. This is represented by a matrix multiplication. This way of connecting learns global patterns. Which turns out isn’t very useful, I mean its useful, but not as good as convolution layers in vision tasks. Because of few key properties of convolutional models, which we’ll see in a bit.

The convolutional layers instead learn local patterns, by dividing up the image into smaller subsections, called kernels. The kernels have:

  • Size of the kernel: will have to be (3,3) or (5,5) or (7,7) and so on
https://setosa.io/ev/image-kernels/
  • Number of kernels, can be clarified by saying number of filters, where each filter is a set of weights, that is convoluted with the input image. And the output image is sort of an activated version of the input image. Activated by the filter weights.
https://www.amazon.ca/Deep-Learning-Python-Francois-Chollet/dp/1617294438

To give some concreteness to the idea.

model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))

Here we input an image of size 28 by 28, with only grayscale values, sometimes referred to as depth.

Then we apply 32 filters in a kernel of 3 by 3 size. Usually you wanna increase the number of filters as you add more layers, to give it capacity to learn more combinations of features.

What was in input tensor of (28, 28, 1), just one image, becomes a tensor of (28, 28, 32), or 32 different filtered images.

Also note that the relu activation is just relu(x) = max(0, x). Remember:

The earliest gates were discrete binary gates.

Then there were sigmoidal gates, which allowed for differentiation and backpropagation.

As networks got deeper, these sigmoidal proved ineffective. So ReLU was adopted into deep neural nets. Which makes hard decisions based on the input’s sign, this developed around 2010.

Remember that convolutions in math, a convolution is an integral that expresses the amount of overlap of one function g as it is shifted over another function f. This does describe explain the idea behind filter layers and the idea of activations. A learned function g goes over the input image f, and if the convolution is positive, we activate it, saying that a feature is present.

Also note that, although the output tensor represent a bunch of images, the last tensor is generated depth wise first, meaning each 32 deep layer is generated one by one, then the final tensor is reassembled. Very much like the last stages of solving a Rubik’s cube, you create a row or column then insert the row into the right place.

What’s Max Pooling?

Let’s first review what’s happening in code and inside the model in detail:

seq_model= models.Sequential() 
seq_model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
seq_model.add(layers.MaxPooling2D((2, 2)))
seq_model.add(layers.Conv2D(64, (3, 3), activation='relu')) seq_model.add(layers.MaxPooling2D((2, 2)))
seq_model.add(layers.Conv2D(128, (3, 3), activation='relu'))

To see the details:

model.summary()

Which gives us:

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 26, 26, 32) 320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 11, 11, 64) 18496
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 3, 3, 64) 36928
=================================================================
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0

First of all, the output shape of the first layer is 26 because of the border effects. Subtract one from each side, and 28 becomes 26.

But the max pooling layer divides it by 2. What’s happening there?

A max pooling layer downsamples the feature maps. It downsamples by choosing the max value of each kernel. The kernel is usually a 2 by 2 window and it steps or strides by 2. This has an effect of downsampling by 2.

The primary purpose of the downsampling is because eventually at the end of the model we wanna condense the information into a result, such as classification or regression, which is done usually with a dense connection.

A dense connection from (22, 22, 64) tensor flattened which would be 30'976 elements to a layer with 512 elements, we would be fourteen million six hundred and five different connections. Computers are not infinity stones, that is too much compute. So we downsample almost after every convolution step like so:

from keras import layers
from keras import models
model = models.Sequential()model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

The model summary:

Model: "sequential" _________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= conv2d (Conv2D)              (None, 26, 26, 32)        320        _________________________________________________________________ max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0          _________________________________________________________________ conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496      _________________________________________________________________ max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0          _________________________________________________________________ conv2d_2 (Conv2D)            (None, 3, 3, 64)          36928      _________________________________________________________________ flatten (Flatten)            (None, 576)               0          _________________________________________________________________ dense (Dense)                (None, 64)                36928      _________________________________________________________________ dense_1 (Dense)              (None, 10)                650        _________________________________________________________________ flatten_1 (Flatten)          (None, 10)                0          _________________________________________________________________ dense_2 (Dense)              (None, 64)                704        _________________________________________________________________ dense_3 (Dense)              (None, 10)                650        ================================================================= Total params: 94,676 
Trainable params: 94,676
Non-trainable params: 0 _________________________________________________________________

So at the end of our convolutional part of the model, we have 64 filters of 3 by 3 kernel, which flattens to one vector of 576 elements. Which has to densely connect to 64 elements layer.

If however we had not used downsampling strategy, like a max pooling layer throughout, we would have a model like this:

from keras import layers
from keras import models
model = models.Sequential()model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

Which gives us a model of:

Model: "sequential" _________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= conv2d (Conv2D)              (None, 26, 26, 32)        320        _________________________________________________________________ conv2d_1 (Conv2D)            (None, 24, 24, 64)        18496      _________________________________________________________________ conv2d_2 (Conv2D)            (None, 22, 22, 64)        36928      _________________________________________________________________ flatten (Flatten)            (None, 30976)             0          _________________________________________________________________ dense (Dense)                (None, 64)                1982528    _________________________________________________________________ dense_1 (Dense)              (None, 10)                650        ================================================================= Total params: 2,038,922 
Trainable params: 2,038,922
Non-trainable params: 0 _________________________________________________________________

The difference is 94k parameters vs 2 million parameters. Which you can imagine really helps the compute time or the training duration of your models.

From one perspective, it seems like by downsampling the data, essentially you’re losing information, but because we select for the max of the kernel. The important information seems to pass through while the unimportant gets filtered out, which is also a nice feature.

Generally a nice pattern that I noticed here is that, initially we use very small kernel sizes and few filters, which learns few basic features, then we build on these features and increase the image kernel size, and number of filters, which keeps learning ever more bigger and complex features.

However the increasing number of filters and kernel sizes is computationally heavy, so we apply max pooling or some downsampling layer to only take the most important features after each convolution, and disregard the weak learnings.

What are the properties of the Convolutional Model that make it better than Densely Connected for vision tasks?

Translation Invariance

A convolutional model can learn a certain pattern in the lower right area, then after that point detect it anywhere on the image. Whereas a densely connected model will have to relearn it only for that lower right area.

This is not only computational efficient, this matches the visual world in that it is also translation invariant.

Spatial Hierarchy

A convolutional model can learn patterns in a hierarchical fashion, much like we do. The first layers will learn relatively simple patterns, like horizontalness and verticalness etc. Then the second layers will put these together to learn such things as corners. And so on with each new layer.

This also mirrors the visual world.

These properties are important to remember as we think about new problems which might require translation invariance, even in text and sequence data.

Other Articles

This post is part of a series of stories that explores the fundamentals of deep learning:1. Linear Algebra Data Structures and Operations
Objects and Operations
2. Computationally Efficient Matrices and Matrix Decompositions
Inverses, Linear Dependence, Eigen-decompositions, SVD
3. Probability Theory Ideas and Concepts
Definitions, Expectation, Variance
4. Useful Probability Distributions and Structured Probabilistic Models
Activation Functions, Measure and Information Theory
5. Numerical Method Considerations for Machine Learning
Overflow, Underflow, Gradients and Gradient Based Optimizations
6. Gradient Based Optimizations
Taylor Series, Constrained Optimization, Linear Least Squares
7. Machine Learning Background Necessary for Deep Learning I
Generalization, MLE, Kullback-Leibler Divergence
8. Machine Learning Background Necessary for Deep Learning II
Regularization, Capacity, Parameters, Hyper-parameters
9. Principal Component Analysis Breakdown
Motivation, Derivation
10. Feed-forward Neural Networks
Layers, definitions, Kernel Trick
11. Gradient Based Optimizations Under The Deep Learning Lens
Stochastic Gradient Descent, Cost Function, Maximum Likelihood
12. Output Units For Deep Learning
Stochastic Gradient Descent, Cost Function, Maximum Likelihood
13. Hidden Units For Deep Learning
Activation Functions, Performance, Architecture
14. The Common Approach to Binary Classification
The most generic way to setup your deep learning models to categorize movie reviews
15. General Architectural Design Considerations for Neural Networks
Universal Approximation Theorem, Depth, Connections
16. Classifying Text Data into Multiple Classes
Single-Label Multi-class Classification
17. Convolutional Models Overview
Convolutions, Kernels, Downsampling & Properties

Up Next…

Coming up next is probably Convolutional Model Regularization. If you would like me to write another article explaining a topic in-depth, please leave a comment.

For the table of contents and more content click here.

--

--