Beginner’s Guide to Everything Image Recognition

Published in

The Startup

7 min readDec 6, 2020

Image recognition. You’ve probably heard this term thrown around a lot, and now you’ve come to this article to learn about it.

In this article, you will learn the basics of image recognition, convolutional neural networks, and the latest and greatest technologies such as YOLO that are being created and implemented. With that being said, let’s get started!

What is image recognition?

Image recognition, also known as computer vision, is a scientific field that deals with the methods in which computers can gain a deep understanding of their surroundings from digital photos or videos. Countless companies all around the world are working to program greater and greater models that can be implemented to solve pertinent problems in the real world.

One example of image recognition in the real world would be autonomous driving. As I’m sure you all know, autonomous cars are cars that are able to drive without the assistance of a person. All of these cars utilize image recognition to recognize different objects around them (e.g. pedestrians, signs, cars, sidewalks, etc.) and manoeuvre themselves accordingly.

Now that we know what it is, how does it work?

At a high level, image recognition/computer vision is usually achieved through the use of neural networks, specifically convolutional neural networks. If you are not familiar with the basics of machine learning and neural networks, I highly suggest reading this link to learn more. Assuming you do know the basics, here’s a brief explanation of what a convolutional neural network is:

I’m sure that this looks pretty complicated, but let me explain the key features: Convolution, ReLU, and Pooling.

Convolution:

One of the most important concepts to understand is that every image or frame of a video is just a matrix of numbers. Essentially, this means that every pixel can be represented by a number. Whether it be black and white or fully coloured, each pixel of each image can be represented by just a number, or a few numbers. For example, each black and white pixel will have a value from 0–255 (0=black, 255=white), and each coloured pixel will have a 3 values of both red, green, and blue, all ranging from 0–255. For the purpose of simplicity, we will focus on gray-scale images in this article.

Every pixel can be represented by a single number

In this step, a filter is chosen (initially randomized) to move pixel by pixel from left to right and top to bottom. Before each move, matrix multiplication is performed on the pixels that the filter covers. After each matrix multiplication, the filter will slide over by one pixel.

Consider this simplified example:

Here is the initial matrix that will be convolved over (i.e. the initial image):

Here is the “filter” that will convolve over the matrix above (with weights that are trained):

In the image below, the feature map (convolved feature) is calculated by using convolution.

As you can see, the result of each matrix multiplication is added into what we call a feature map, which is essentially the result. This feature map will create an entirely new image. Many of these filters can be used for things like edge detection, sharpening, blurring, and more. More often than not, filters are used to detect certain features that are useful towards classifying objects. Here is an example:

As you can see, different filters will create different feature maps and results, and they will identify different features. In practice, the CNNs will “learn” the values of all of its filters, but we still need to specify several features, including filter size, number of filters, the architecture of the network, etc.

ReLU:

As I’m sure you know, ReLU (Rectified Linear Unit) is an activation function that aims to introduce non-linearity to the model. As a short review, activation functions are used to introduce non-linearity, which means that the model will be able to learn more complex patterns. Without activation functions like ReLU, the model will never be able to learn anything other than a linear relationship.

In short, ReLU is an element-wise operation, which means that it is applied per pixel, and it replaces all negative values in the feature map with zero. Essentially, all-black values (i.e. negative values) are negated.

Here is an example:

Pooling

Pooling is mainly used to reduce the dimensional of each feature map while retaining the most important information. There are a few variations of pooling, including Max, Average, Sum, and many more. Since the most common type is Max Pooling, we will focus on that.

Max Pooling works by defining a spatial window and taking the largest element from the feature map within that window. Even though Max Pooling has been shown to be the most effective method of pooling, we could still take the average or sum of all the elements in that window as well.

Here is an example of the Max Pooling operation after convolution and ReLU. In this case, we used a 2x2 window.

In this case, we slide our 2x2 window by 2 cells each time so that there is no overlap, and take the maximum value in each window/region. This is done to reduce the dimensionality and make the feature map simpler for the model.

Here is an example of the effect of Max and Sum Pooling on a rectified feature map:

Conclusion

Now that we know how all of these different features work, the image that I showed above should make much more sense. After we complete convolution, ReLU, and pooling, we can just put our values for the pixels into a fully connected neural network just like a normal neural network!

Latest and greatest technologies:

Great! Now that we know all about how CNNs work, we can learn about the more exciting stuff: The latest and greatest innovations and implementations!

1. YOLO (You Only Look Once)

In my opinion, this is one of the most interesting implementations and a new take on image recognition. It is beneficial in many ways. This includes being extremely fast, only looking at the entire image once (hence the name), and many more. It’s main claim to fame is that it that it is extremely fast, making it great for real-time webcam use. Here’s how it works at a very high level:

By treating object detection as a single regression problem rather than looking through every single pixel one by one, YOLO is able to convert image pixels straight to bounding box coordinates and individual class probabilities. Simply speaking, YOLO divides the image into an S x S grid and predicts B bounding boxes, confidence for those boxes, and C class probabilities in each grid cell. Here’s an explanation of each step:

Divide the image into a (typically) 13x13 or 19x19 grid of cells
Each cell is responsible for predicting a number of bounding boxes (typically 5) in the image, confidence that the bounding box actually encloses an object, and the probability of the enclosed object belonging to a particular class.
Each bounding box can be described using 5 descriptors: bxby (center of a bounding box), bw (width), bh (height), c (class of an object), and pc (probability that here is an object in the bounding box).
Any bounding boxes that have too low a confidence score or enclose the same object as another bounding box that has a very high confidence score will be eliminated. This is called non-maximum suppression.

Overall, YOLO utilizes a completely unique approach to object detection that allows for insane speeds and real-time detection. Furthermore, its model is accessible to everyone, making it very interesting to play around with as well. Click this link to find out how to implement it in code!

Conclusion

Hopefully, you now know all the basics of CNNs and how they are used in object detection. You have learned about convolutions, ReLU, Pooling, and even some of the latest and greatest technologies. You should now be able to traverse the internet for image recognition research papers and tutorials, and should be able to understand all of it. Thanks for reading!

Images Cited

New Deep Learning Research Breaks Records In Image Recognition Ability Of Self-Driving Cars …

People, bicycles, cars or road, sky, grass: Which pixels of an image represent distinct foreground persons or objects…

www.eurasiareview.com

An Intuitive Explanation of Convolutional Neural Networks

What are Convolutional Neural Networks and why are they important? Convolutional Neural Networks (ConvNets or CNNs) are…

ujjwalkarn.me