5. Introduction to Deep Learning With Computer Vision — Kernels, Channels & Neural Architecture

Inside AI

Published in

Deep-Learning-For-Computer-Vision

9 min readJul 21, 2019

by Praveen Kumar and Nilesh Singh

Prerequisites:

Filters and kernels

Channels and feature

Datasets

Hello everyone, welcome back. Till now we’ve been trying to get a hang of basics and get used to some common terminologies used. This article covers the core concepts and officially marks the beginning of our journey. While building our model, we’ll be interacting directly with many entities discussed here, so they’re tangible in a sense. For the very same reason, we will first re-look previously covered topics and build this article gradually using those concepts as the base.

Concepts from previous sessions

In the above animation, what do you see? Hard to remember?

Alright, so let’s imagine that our image is divided into a matrix of 5X5, the block on right. This is the base image for our kernel (recall that we’ll only be using 3X3 kernels). Now the aim of this animation is to convince you that the white matrix, forming on left is not the kernel, but rather it is an output of the shadowed mask (kernel) moving over our original 5X5 matrix.

So, what is that white block formed?

That block is actually our output image formed via an operation called convolution (explained later), which involves kernels and the base image.

The new image (3X3) retains some of the features of the original image (5X5) because the kernel are forced to extracts only a specific set of features. What exactly did we do here? Let’s break it down into simpler steps.

First, we choose a kernel, which is used to extract features from an image. How does a kernel extract features? By the magic of convolution.
Then it creates a new image with the extracted features, the size of the new image is reduced by 2 (why? explained later).
This new image now contains only those extracted features and is fed forward to other kernels waiting in line which extracts even more finer features.

Seems complicated? Well, fret not, it gets easier with time as you learn to muscle your way in.

So don’t worry. Let’s continue…

In neural architecture, we have numerous kernels and all of those kernels produce one image each, all having a different set of features. These images accumulate in our network and become an eternal source of sorrow for our computer hardware. This is the main reason why people prefer to use high-end GPU’s and High Ram Storage capacity. GPU is much faster than RAM at processing images and those images are stored in RAM before and after processing, but remember that no amount of GPU is enough. We have thousands (sometimes millions) of kernel images, hence even with the kind of hardware provided by colab, it’s quite plausible that more often than not, you’ll get similar messages.

What about the image size? and Is it generalized?

Each time we perform a convolution, a new feature extracted image is produced whose size is reduced.

By how much?

It depends on the kernel. Say size of the kernel is K X K, and size of image used is M X M,

Then, size of new image after convolution will be -

(M-(K-1)) X (M-(K-1))

In our case,

image size= 5X5

kernel size=3X3

then, output image size=(5-(3–1)) X (5-(3–1)) ==> (5–2) X (5–2) ==> 3X3

We loose 2 pixels on both x-axis & y-axis when we perform convolution with the 3x3 kernel (now would be an apt time to go and look back at that first image).

It is important to highlight that image width and height dimensions are only changed during the normal 3x3 kernel, not the depth. Depth remains the same. What is the depth of an image? It's the number of channels an image posses. Generally, we have 3 channels (RGB). Hence in more precise terms, we say,

Size of image is not simply 5x5 but 5x5x3 where 3 is the number of channels. Now wait….

I’m confused….. if the image has channels, then our kernel must-have channels right? Exactly, our kernel also has same number of channels as our input image. So our size of the kernel is not just 3x3 but 3x3x3, where last 3 is same as the number of channels in the input image.

Wait! First, tell me what is convolution?

You’d be disappointed if you came looking for complicated code and fancy concept. In actuality, convolution is really boring. It is just a bunch of multiplication and addition. We can define it as:

Convolution is multiplication between kernel matrix and base image matrix followed by successive element-wise addition.

Consider the following image

Our image size is 7x7 and kernel size is 3x3.

When our kernel is at the top left corner… it extracts features from this top left corner of the image. How it extracts? Look at the calculations on the top right of the image. Each pixel in the image is multiplied with each pixel in the kernel and the final output is obtained by adding all the multiplications.

That’s all there is to convolution.

Another example of convolution…

We can see above that our 3x3 kernel has these values:

Whenever our kernel is stopping on a 3x3 area, we are looking at 9 multiplications and the sum of the resulting 9 multiplications being passed on to the output (green) channel as shown in the image above.

The values in the output channel can be considered as the “confidence” of finding a particular feature. Higher the value, higher the confidence, and lower (or more negative) the value, “higher” the confidence of the non-existence of the feature.

Why do we need layers? & How many?

We say that we need kernels to extract features to be able to detect an object or classify an image. Imagine you need to detect numerical digits which looks something like

Now, if you were to detect 8 from these images, how will you do? As humans are very smart, it is easier for us to know, but for a computer system to do this, it’s very difficult. Let’s see how machines are able to detect 8. The following steps are involved.

It first tries to find edges and gradients. Every digit (including 8) is made up of thousands of small edges joined together. Gradients are the intensities or colors in an image. Each intensity or color have different values between the min-max (0–255) range.
Based on these edges and gradients, it tries to find small textures. It basically gives us spatial information about the arrangement of colors or image intensity values.
Textures are then combined to form patterns. Patterns for number 8 could be arcs and curves.
From patterns, we obtain parts of objects. Parts of the object for number 8 could be 2 circles.
Parts of objects combine to form the object itself. Here it’s 8.

So the process by which a computer vision system is able to detect any object or classify an image involves these 5 steps (some of them are skipped depending on images and task at hand):

Steps involved in detection by a Computer Vision system

In the following image, we are trying to detect 4 objects namely Faces, Cars, Elephants, and Chairs.

We look that only 3 steps are shown here. Tracing the image of Faces from bottom to top, we first obtain edges and gradients of faces, then directly we form parts of objects. Here textures and patterns are not shown but they are also performed before going to the parts of the object step. And finally, we see faces as the complete object itself. The same procedure is followed for other 3 objects as well.

IMPORTANT NOTE: Not all the images need 5 steps. Some images might require a direct jump from Edges&Gradients to Patterns. This is because some images are not rich in features based on the class. For example, if you are detecting CAR, you will need all those 5 steps in sequence but detecting just TYRE of a CAR would not need all those 5 steps. Hence it depends on the image or the Dataset you are working.

Whatever we have seen up to now is based on the discussion of using just one kernel. But in reality, kernels are stacked together to perform all the 5 steps. Just one kernel can not perform all the 5 steps. So, we form different layers which consist of kernels and channels, where each layer (or set of layers) are responsible for detecting those 5 steps.

Hence, In a Neural Network architecture, each of these 5 steps is carried out by a set of layers. One of the main tasks of DL researchers is to figure out how many layers does it take to carry out each of these steps. This number is dynamic and varies for each dataset.

Example of neural architecture (also called a Network or Just simply Architecture)

Here input data is our matrix of the Image. We have several convolution layers starting from conv1 to conv5. Ignore the last 3 layers. We will cover them in upcoming articles. Do not wonder about the depth of each convolution layer. We are still yet to cover all concepts of depth and working with depth in an image and kernels.

In the above image, each convolution layer detects the corresponding 5 steps which we just discussed.

Wait….. So……if we have too many layers, we will get more features and the better our architecture is ????

No. As already mentioned above, each image has some set of features. If an image has fewer features and you try to detect too many, it will lead to bad network accuracy & Vice versa.

So the question still remains, how many layers should use?

This is one of the most interesting questions in computer vision. The way we define the number of layers in a network is chosen based on an algorithm called MMZ. It is a rather interesting algorithm and takes some time to master it.

So, what is MMZ?

You must be thinking if we are being serious? Yes, we indeed are. MMZ is expanded as Meri MarZi.

It depends on your wish and whims to choose the number of layers you want in your network. It all depends on the dataset you are working on. No one can tell you what will work best, you have to choose wisely based on your intuition and understanding of the dataset. If you feel you are working on hard images, you might require more layers & if you working on simple datasets such as MNIST (The numerical digit), you require comparatively lesser number of layers. Take the last statement with a pinch of salt, it is true in most cases but will produce awful results in others.

Always remember, Your most valuable treasure is your intuition.

Alright, Time to wrap up. Usshhh !! That was intense

That was intense but we expect you to digest a lot more than this article as we delve deeper and deeper. We will cover the receptive field, 1x1 kernel & MaxPool convolutions and calculations involving the same.

To summarize,

The output of a kernel is produced by the process of convolution.
The output is a new image with a reduced dimension but contains a specific set of features extracted from the original image.
A calculation involving dimension reduction based on the size of the Kernel.
Learning process of convolution & understanding 5 main steps involved in object detection by a computer vision system.
Concept and intuition of the number of layers as well as the beautiful MMZ algorithm.

Hope you enjoyed it. See you soon again.

NOTE: We are starting a new telegram group to tackle all the questions and any sort of queries. You can openly discuss concepts with other participants and get more insights and this will be more helpful as we move further down the publication. [Follow this LINK to join]