Fundamentals of ConvNets

“Computer vision is the science of programming a computer to gain a high-level understanding of digital imagery.”

Published in

The Startup

5 min readDec 5, 2020

ConvNets are at the heart of image recognition and computer vision. One of the most fascinating things about this is how a sequence of raw mathematical operations like multiplication and addition goes all the way to process and recognize complex visuals. In this episode, we talk about a few fundamental concepts of computer vision, define and illustrate the working procedure of ConvNets and jot down some technical terms associated with that. First, let's talk of the idea since they are bulletproof(!)

The idea

The idea is simple, detect what’s inside an image, playback video, or a live recording (since all videos are at the most rudimentary level a sequence of fast-changing images). Technically speaking, computer vision is the science of programming a computer to gain a high-level understanding of digital imagery by replicating the working strategy of the human visual system. We use a variation of artificial neural networks called Convoluted Neural Network to achieve this. From here on, I’ll be referring to the technology as CNN. The core task of a CNN is to detect the valuable features in an image so that the image can itself be represented mathematically in the form of a vector of numerical figures which can, in turn, be compared against that of a labelled image to confirm to which class the subject belongs or what's in it depending on the initial requirements.

Now, CNN may have numerous layers, each of a different type — which I’ll not be discussing in this episode. The layer we’ll be focusing here today is the convolution layer.

The Convolution Layer

The convolution layer as the name suggests performs the convolution operation. To begin with, let point out the fact that digital images arrays of numbers (typically ranging between 0 to 256) denoting the colour combination of each pixel. For instance, an1 KB grayscale square image is in binary an array of size 1024 x 1024. Please note that coloured images use a third dimension called the colour channel and that for simplicity, we’ll be dealing with grayscale images in this lecture.

The primary job of a convolution layer is to detect significant features in an image. We begin with edge detection. An edge (no, not the WWE one!) in an image is a line, straight or curved, both sides which have a sharp change colour contrast. To detect one, a filter window is slid over the original image that attempts to look for an edge within its scope. Here’s how the convolution is done,

Take an image of dimensions nh x nw
A filter of fh x fw
Slide the filter over the image, one step at a time, and on each step, perform an element-wise multiplication of each element of the image with that of the filter, within the scope of the window and finally add the products and store the sum in a 2d resultant array, we’ll talk about the dimensions of the resultant array in a short while.

Here’s a pictorial view,

The shaded image on the bottom is the subject, there are two edges marked with a red ellipse, the 2d array on the upper left is the numerical form of the image. The conventions used is that lower the value of an individual pixel (an index in the 2d array), lighter is the shade in the image. Finally, the 2d array on the upper right is the filter. Note that in this example, we use an image of dimensions 5 x 5 and a filter of 3 x 3.

Let’s capture the convolution in action,

For the first value in the resultant matrix,

resultant[0][0] = 10x1 + 10x0 + 0x(-1) + 10x1 + 10x0 + 0x(-1) + 10x1 + 10x0 + 0x(-1) = 30

As a matter of fact, each fh x fw (3 x 3, in this case) subset of the original matrix (the image) goes into a single cell of the resultant matrix. The resultant has a dimension of 3 x 3, which is just one dimension greater than the difference of dimensions of the image and the filter.

Hence, output dimension = input dimension — filter dimensino+1

Take a closer look at the resultant matrix, there is a column of higher values (60) sandwiched between two columns of lower values(30), that is there are two changes in contrast, that corresponds to the two edges in the original image. You might be tempted to say that that’s no edge, that’s a whole shaded region, but lemme remind that we’re dealing with a 5 x 5 image, whereas real images are of much higher resolution.

The Accessories

There are a few factors to consider while convolving,

Padding: Note that the size of the image shrinks upon convolution, from 5 to 3, meaning we’re missing out potential features at level of convolution and finally may end up with a single-dimensional image at the end. Here’s where the concept of padding comes in. We might pad the image with a line of 0s at each step to preserve the dimension. This is called same padding , where the dimensions of resultant remain the same as original after convolution.

Stride: Stride is the skip of columns at each iteration of the convolution. A stride of 2 means filter is to be placed on every 2nd columns.

Here’s a visual representation of padding and stride,

The illustration depicts a padding of 2 (same padding for 5 x 5 image) and a stride of 2.

Incorporating padding and stride into the equation of output dimension,

The Algorithm

Finally, as usual, let’s end the discussion with a pseudocode for the algorithm,