# Convolutional NNs 1/7: The convolution operation

Hello Everyone! While my work with the MLOps specialization is going on, I also plan to recall my old notes on CNNs and RNNs from the Deep Learning specialization. This series of articles majorly focuses on the course-4 of the specialization titled Convolutional Neural Networks. Hope you enjoy these notes, and comment on the areas of improvement!

When creating an image recognition model, using a standard NN, until now, we have used images which are only small sized i.e., (64X64, 24X24, etc.,). But in real life, the dimensions of the image that we generally want to use will be (~1000 x ~1000). i.e., each image will be a vector of (~3 million , 1). and the dimensions of the weights will be even more. With these high dimensional parameters, we need sufficiently large data to prevent overfitting along with very high computational requirements. This is highly inefficient.

To tackle this problem, we use convolution operation, which is one of the fundamental building block of CNN and we will illustrate Convolutions using the example of edge detection.

# Convolution operation:

We use this operation to ensure the spatial relationship between pixels by learning image features using small squares of input data., i.e., we take a set of pixels, learn their relationship determined by the filter and store the output that indicates the spatial relationship between the pixels.

# Now what is this ‘spatial relationship’?

To understand this, consider a book placed on a table. How do we determine that the book is different from the table? We humans can instinctively know that the book is different from the table as we can see the 3-Dimensions of the book, analyse it and process it i.e., We see the shadows cast by the book, the height of it and many other features.

Similarly, in Computer vision, we first need to differentiate an object from its surroundings, that is done by comparing a pixel to its surroundings. In a standard NN, what we did was to study each pixel and tried to map the entire set of pixels to an output, by using a function. Now, in CNN, we study the differences between a pixel and it’s surroundings to extract features. This is nothing but the aforementioned “Spatial Relationship”.

📢 The operation that we do is actually called cross-correlation in the mathematics. But in DL literature, we call it a convolution operation.

To gain more clarity on the process of convolution, let us take a simple example of a 5x5 matrix (image in this case), and a 3x3 filter:

In the above GIF image, we can see that the entire 5x5 matrix is transformed into a 3x3 matrix with the help of a 3x3 filter. This 3x3 output contains the spatial relationship of the pixels in the middle of the 5x5 matrix [i.e., starting from the value in the position (2,2) to (4,4)]. Which makes us lose the information on the borders of the matrix. We will see the steps to tackle this loss of information later on when we discuss padding.

# Example: Edge Detection

Edge detection is the easiest application using convolution operation, as the pixel values differ greatly at the edges. When we observe the pixels on the edge or the boundary between 2 different objects, say a face and its background, the edge of the face will be surrounded by the face pixels which indicate skin colour on one side, and the colour of the background on the other side.

We can also go back to the earlier book on the table example and see that the texture, colour of the book is different from the table on which it is placed.

Now, let’s look at a far simpler example to understand the essence of what we have just read,

In this example, you can see that the border is at the 3rd and 4th columns, where one side is bright, and the other side is dark. Which when passed through a filter, we get bright columns in the middle indicating that the edge is in the middle of the considered pixels. Now, if the input image is flipped horizontally, we get an opposite output as we can see below,