A Complete Overview of Convolutional Neural Networks

Abhi Siripurapu
The Startup
Published in
9 min readDec 1, 2020


From diagnosing cancer to Amazon recommendations to playing the ancient game Go, convolutional neural networks(CNN’s) are some of the most versatile forms of AI. So, how do they work? CNN’s are world-class at image classification, whether it be finding patterns in images or classifying different pictures.

What makes a CNN different?

CNN’s differ from regular neural networks in their ability to understand and analyze spatial information. While other types of NN’s only look at individual inputs, CNN’s process the image as a whole, and in groups of pixels. The way they do this is through something called the convolutional layer (hence the name). Despite the fancy wording, all the convolutional layer does is apply a series of filters to the original image. These filters, aka convolutional kernels, remove unnecessary information and highlight differences. Kernels usually outline one of two things, color or shape. Let’s focus on shape for a moment. Computers understand shape as a change of intensity. Intensity is basically how light or dark a certain thing is, a lot like brightness. CNN’s understand and classify images by looking for extreme changes in intensity, where an area goes from very dark to very light. These changes indicate an edge of an object. This helps CNN’s better visualize and understand the images that we input. Let’s look at one type of shape filter, a high-pass filter.

High pass filters and convolutional kernels

High pass filters are used to highlight abrupt changes in intensity as we discussed before, once applied, a high pass filter will blackout pixels that have low intensity, but when a pixel is much darker our lighter than its neighbors, the filter will create a line, making edges much more apparent. So how do they do that?

To start, to simplify things for our network it’s best if we convert our images into grayscale (black and white). To grasp how the filter works, we’re going to have to modify our definition of convolutional kernels (remember those?). While it’s true that kernels are filters, a more accurate definition would be to say that kernels are matrixes of numbers that change an image. To help us out, let me bring in a visual.

What a kernel looks like Credit: Udacity

Here is an example of a kernel used for edge detection. It is a 3X3 matrix whose elements all add up to zero. It’s important that all the elements add up to zero as the filter is finding the difference/change between adjacent pixels. This difference is calculated by subtracting the values of adjacent pixels from the center pixels. If we pass the filter over a set of pixels and the sum does not equal zero, that number is then marked as positive or negative (depending on whether the elements add up to be positive or negative. By the way, positive is generally associated with being light, and negative for being dark.) all the respective pixels in that kernel are then turned either light or dark. Now let’s expand our definition of kernel convolution even further. Kernel convolution is the function of taking a kernel land passing it over each pixel in an image, highlighting edges. Now that we fully understand high pass filters, let’s go through an example.

Our example Credit: Udacity

Let’s use the 3X3 group of pixels shown on the right, (if you’re wondering how they are labeled, the computer automatically does that for us as a number indicating how bright the pixel is, the higher the number the lighter the pixel, the lower the darker). All the kernel does is multiply each value in the group of pixels to the corresponding value in our kernel. So 120x0, 140x-1, etc you get the point. This forms a new matrix with our new values, when we add them all up, we get 60. As you can see from the picture, that indicates a slight edge. If you wanted, you could tell this just by looking at the matrix of pixels. Going from bottom to top, the pixels do get darker, but it is pretty gradual. The numbers in our kernel are generally referred to as weights because that’s what determines how we multiply per each group of pixels. Notice how the center pixel is given the most priority, then the ones directly adjacent to it, and finally the corner pixels. This is no fluke, remember what I told you before? The difference is calculated by subtracting all the surrounding pixels from the middle ones. This kernel function is repeated thousands of times for each pixel. The question arises, what happens when the matrix can’t nicely contain all of the pixels, ex: the edge of the image. Well, there are a couple of ways to deal with that.

  • Extension
  • Padding
  • Cropping

In this article, we’ll go over the most commonly used method, padding. Padding involves simply marking missing pixels as zero, allowing our kernel to continue to move around the image. Each number in the new 3x3 matrix is often referred to as a node. If you’re still reading, great job! We’ve managed to completely grasp what a convolutional kernel is. Now let's take a step back and look at the convolutional layer as a whole.

The convolutional layer

The convolutional layer isn’t just composed of one kernel/filter, but of many. We create many filters and nodes by changing the weights inside the 3x3 kernel. It’s common to have hundreds of collections of output 3x3 nodes, each collection corresponding to its own kernel/filter. Let’s visualize this for better understanding.

Our example Credit: Udacity

Let’s take a look at the image above. We have a car in grayscale (notice how the image is interpreted in 2d, it will come up later), in which 4 different filters have been applied. Take a couple of seconds to try and guess their function based on the output? Filter 1/2 detect vertical lines on the car, while filters 3/4 detect horizontal lines. Each of these filters probably has hundreds of nodes behind them, as remember, the kernel is passed through every single pixel. But can we glean more from these images? (Hint, look at the shading of the filters.) Notice how in each of the filters, it appears one side has been outlined more than the other. This is also no mistake, for those asking why we couldn’t do this all in one filter, let me explain. In the first filter, it recognizes low intensity, to high intensity. This only works on one side of the car though, as the right side of the car is darker than the background, leading to abrupt changes in intensity, which outlines and edge in the picture. Now imagine the filter moved to the left side of the car, the filter detects only from dark to light. The background is not darker than the car itself, leading the filter to believe that there is no edge. That is why we need 4 filters just to find horizontal and vertical lines.

Notice how certain filters detect certain edges Credit: Udacity

Let’s now take a look at color images. Computers interpret grayscale in 2d (Height, width), but they interpret color images in 3d (Height, width, depth). For RGB( red, green, blue) the depth is three. Let’s think of the image as a stack of 3, 2-dimensional matrixes of pixels. Remember, our depth is three and we divide the image into red, green, and blue 2-dimensional matrixes. Basically making 3 separate images, all in one color. Now, what about our kernel, well, it’s pretty much the same it’s still going to move over every pixel but now, the kernel itself will also be a stack of 3, 2-dimensional matrixes. Let’s look at a picture

A visualization of a color image Credit: Udacity

Now, when we sum our nodes to detect an edge, it’s also the same, just with 3 times more elements. Here comes the cool stuff, once we’ve fed through a convolutional layer. Let’s expand on why CNN’s are different. While other network's nodes or elements are fully connected with every other node. CNN’s are only connected with a small number of nodes in the previous layer. The cool thing is, this allows us to keep feeding through convolutional layers. After we’ve run through the first one, we can input it through a second to find even more intricate details and patterns. But our CNN isn’t complete yet, the more filters we add, the more and more parameters we need to input. This can lead to overfitting. A tendency computers have to become so good at what they’ve been taught, they lose the ability to work with unknown data. Let’s go over a method to fix this, pooling.

Pooling layers

Before we jump into pooling layers, let’s talk about stride. Remember how I said the kernel goes through every pixel in an image, we can control how it does. Stride is just by how many pixels the filter goes over the image. If it was one, you’d get an end result that is roughly the same size as the original. If two, you get an image roughly half of the original, etc., etc. Alright, now let me explain a bit more about pooling layers. Pooling layers take convolutional layers as input, and they reduce all the parameters and dimensions from convolutional layer to convolutional layer. A pooling layer looks over all the nodes in the convolutional layer, and it takes the maximum number from each run of the kernel, this is much easier to visualize though.

An example of what a pooling layer would do Credit: Udacity

In this example, our pooling layer is using a 2x2 matrix with stride 2. It begins at the far left, of those 4 nodes, the one with 9 in it has the greatest value so the pooling layer stores that. It would then move horizontally and store 8, go down to the next row and store 7, etc. Once we’ve gone all the way through, we end up with half of the original nodes, greatly simplifying our parameters for the next convolutional layer. It’s not uncommon to have a pooling layer after every convolutional layer.

Putting it all together

So now that we understand all the layers in a CNN, moves the information from one layer to the next? Between each layer, a ReLu activation function (Rectified Linear Unit) is applied. Each of the green lines in the

The CNN as a whole Credit: Udacity

CNN’s put input images through many convolutional and pooling layers to understand and identify patterns in the images. The first time a CNN saw a car, it wouldn’t be able to identify what it was, but as it sees hundreds and even thousands of cars, it learns to find the intricate patterns that define a car. So the next time a CNN is shown a car, it recognizes those patterns and identifies the object as a car. If you’re still here, congratulations! You now have the knowledge to understand a basic CNN, but you shouldn’t stop there. Even if you don’t know how to code, I would highly recommend going on Github and finding a CNN architecture to play around with. Here’s one of my favorite examples, where you’ll be classifying images from the CIFAR-10 dataset. (Copy and paste the code into preferably a Jupyter Notebook).


All of the info and visuals in this article came from this Udacity course. This course is a must, if you’re interested in AI, though it does require some coding knowledge beforehand.



Abhi Siripurapu
The Startup