The Next Step in Computer Vision: Convolutional Neural Networks

Published in

The Startup

7 min readOct 12, 2020

The human brain is a vast and capable tool that can easily distinguish between objects like cats and dogs or between plants and animals. However, when it comes to computers, they seem to fail at even the simplest of image classifications. This is because it is extremely difficult to replicate the human brain due to the numerous amount of interactions that these neurons have with one another. However, ML specialists have made tremendous breakthroughs that have allowed computers to classify images in a more productive and efficient way. The one most widely used today? Convolutional neural networks or CNNs.

In order to understand how CNNs work, a knowledge of the basics of neural networks is necessary. If you’re interested in exploring the fundamentals, I suggest watching this video. Or if you are more of a reader, I suggest reading this article.

What are CNNs?

Now that we’ve gotten the basics out of the way, let’s understand how exactly a convolutional network actually looks at these images and produces an output.

When a computer processes an image, it simply sees an array of pixel values that are further described using RGB values (e.g. 400 x 400 x 3 where 3 represents the RGB values). The essential premise of these CNNs is to input this array of values into the neural network and determine the probability of the image being a certain class.

CNNs try to emulate humans in that it attempts to find the most distinctive features in the image and use it to classify the object. They do this by first looking at the rudimentary features like the edges and sides, and then through the subsequent layers, they build up to more complex features such as a face or legs. At the final layer, they should have a deep understanding of what the image looks like and make an informed decision based on their knowledge.

The Nitty Gritty Terms

These layers are called convolutional layers, and they find these distinct features through filters. The region that these filters are placed over are often called the receptive field. These filters are often much smaller than the original picture; for example, if the input image is a 200 x 200 x 3 array of pixel values, then the filter would cover a 5 x 5 area at a time.

As this filter slides throughout the image, it computes element wise multiplications between the filter and the input receptive field. The summation of these multiplications produces a single number that will essentially show the filtrated version of that image through a single point.

This process is iterated through the rest of the locations on the image. After sliding the filter through the entire input image, we are left with an image that is (k-1) times smaller than the input image horizontally and vertically where k represents the dimensions of the filter. This new image is called the activation map.

What are These Filters?

So, what are these filters that can do seemingly magical tasks? These filters are often known as feature identifiers which can see elements like edges or curves.

Let’s pretend that our first filter in our first convolutional layer can detect the curves in an object. In this scenario, this filter would then have a pixelated structure similar to the shape of a curve, like the image shown above.

Now, it becomes obvious how these filters would be able to detect significant curves and lines. If there is a shape that briefly resembles the curve like the one shown above, then when performing this element-wise multiplication, an extremely large value will be outputted. By looking at the areas with higher numbers, we can identify the places that resemble curves and make assumptions based on this.

There are many distinct filters built into these CNNs from lines that curve to the right to straight line edges. These layers are often used in the preliminary convolutional layers to understand where these features lie.

Using this activation map as the input, subsequent convolutional layers can use their own filters in order to build on the information given. For example, if the first convolutional layer described lines and edges, then the next hidden layer can be used to describe shapes like squares and triangles.

The Output Layer

The last layer is undoubtedly the most important part of a neural network as it makes the classifications depending on the input data given.

Depending on the number of classes that you want for your neural network (whether you want it to be ten classes to classify digits or just two classes to classify between cats and dogs), an N dimensional vector is outputted where N represents the number of classes. This vector contains the probabilities that the input image was that classification; if done correctly, one of the probabilities should be much higher than the other ones!

By looking at the output of the previous layer (showing extremely high level features), it will determine which features correspond to a particular class. For example, if we were looking at features that distinguish the dogs from cats, we would look for paws for dogs and whiskers for cats.

At the end, you should have a neural network that can find the distinct features of an image and find the most pertinent classification. But, how do the filters in each layer know what to look for and which values to have?

How does It Learn?

All neural networks have to start somewhere. Before performing these calculations, they first randomize weight and filter values because they have no idea what the correct output is. However, through time, the neural network is able to learn through training sets with thousands of images with the correct label classification (e.g. pictures of cats and dogs with a label of whether they are a cat or dog). The next section is extremely high level and requires a complex understanding of neural networks, so I would watch the video and article linked above before reading below!

After completing its first feedforward pass through the layers, it then uses the loss function in order to understand how badly the model performed. It’s understandable that in the first few runs, the loss function will be extremely high because the network has no clue what it is doing! Afterwards, using backpropagation, the neural network tweaks the values, striving to make the model perform better and reduce the cost function. Using gradient descent, the neural network is able to find the set of weights and filter values that will produce the minimal loss function. As you can see on the image above, the neural network starts by performing inaccurately, and as time progresses, it slowly learns and becomes more precise.

Now, that we have understood how a CNN works, let’s look at some common applications that CNNs have in modern day society today.

What’s Next for CNNs?

Convolutional neural networks are mainly used in computer vision, training computers to understand the world around us. This is understandable due to CNN’s unique functionality. From facial recognition to self-driving cars, CNN has a vast array of applications that will be interesting to explore in the upcoming future. The current roadblock for most major tech companies is actually getting this input data to train the neural network; however, companies like Facebook and Google are finding ingenious solutions to these problems.

Currently, I believe that the most exciting field for CNNs are self-driving cars, being able to use sensors and cameras in order to map the world around us and make impactful decisions based on that information. Comment below on what you think is the most impactful technology today!

TL;DR

Convolutional neural networks try to emulate the human brain by stimulating artificial neurons that produce an input from received input.
Using convolutional layers and filters, these CNNs are able to classify these images. Starting by looking at rudimentary features and then working their way up to more abstract features, CNNs are able to detect the probability of an image being in a certain class.
This CNN trains using traditional methods of feedforward, backpropagation, and gradient descent to figure out how to adjust these weights to give it the best accuracy.
CNNs are mainly being used to allow computers to understand the world around us with applications in facial recognition and driverless vehicles.

Additional Resources

Article: An Intuitive Explanation of CNNs (if you are interested in understanding the more technical aspects)
Article: An Introduction to CNNs (suggest reading after the first article)
Article: Business Applications of CNNs (how CNNs can be applied to real-life scenarios particularly in business)
Video: CNNs Explained (re-explaining concepts in this article)

Hi! I am a 16 year old currently interested in the fields of machine learning and biotechnology. If you are interested in seeing more of my content and what I publish, consider subscribing to my newsletter! Check out my September newsletter here! Also, check out my LinkedIn and Github pages.