What Is Image Recognition?

Published in

Dataman in AI

8 min readNov 8, 2018

The first question you may have is what the difference is between computer vision and image recognition. Indeed, computer vision has been vigorously developed by Google, Amazon and many AI developers, and the two terms “computer vision” and “image recognition” may have been used interchangeably. Computer vision (CV) is to let a computer imitate human vision and take action. For example, CV can be designed to sense a running child on the road and produces a warning signal to the driver. In contrast, image recognition is about the pixel and pattern analysis of an image to recognize the image as a particular object. Computer vision means it can “do something” with recognized images. Because in this post I will describe the machine learning techniques for image recognition, I will still use the term “image recognition”. In that article, I give a gentle introduction to the image data and explain why the Convolutional Autoencoders is the preferred method in dealing with image data.

I thought it is helpful to mention the three broad data categories. The three data categories are (1) Multivariate data (In contrast with serial data), (2) Serial data (including text and voice stream data), and (3) Image data. Deep learning has three basic variations to address each data category: (1) the standard feedforward neural network, (2) RNN/LSTM, and (3) Convolutional NN (CNN). For readers who are looking for tutorials for each type, you are recommended to check “Explaining Deep Learning in a Regression-Friendly Way” for (1), the current article “A Technical Guide for RNN/LSTM/GRU on Stock Price Prediction” for (2), and “Deep Learning with PyTorch Is Not Torturing”, “What Is Image Recognition?“, “Anomaly Detection with Autoencoders Made Easy”, and “Convolutional Autoencoders for Image Noise Reduction“ for (3). You can bookmark the summary article “Dataman Learning Paths — Build Your Skills, Drive Your Career”.

What is image recognition?

Just like the phrase “What-you-see-is-what-you-get” says, human brains make vision easy. It doesn’t take any effort for humans to tell apart a dog, a cat, or a flying saucer. But this process is quite hard for a computer to imitate: they only seem easy because God designs our brains incredibly well in recognizing images. A common example of image recognition is optical character recognition (OCR). A scanner can identify the characters in the image to convert the texts in an image to a text file. With the same process, OCR can be applied to recognize the text of a license plate in an image.

Since you are interested in image recognition, I encourage you to take a look at this interesting video:

How does image recognition work?

How do we train a computer to tell one image apart from another image? The process of an image recognition model is no different from the process of machine learning modeling. I list the modeling process for image recognition in Steps 1 through 4.

Modeling Step 1: Extract pixel features from an image

First, a great number of characteristics, called features are extracted from the image. An image is actually made of “pixels”, as shown in Figure (A). Each pixel is represented by a number or a set of numbers — and the range of these numbers is called the color depth (or bit depth). In other words, the color depth indicates the maximum number of potential colors that can be used in an image. In an (8-bit) greyscale image (black and white) each pixel has one value that ranges from 0 to 255. Most images today use 24-bit color or higher. An RGB color image means the color in a pixel is a combination of red, green, and blue. Each of the colors ranges from 0 to 255. This RGB color generator shows how any color can be generated by RGB. So a pixel contains a set of three values RGB(102, 255, 102) refers to color #66ff66. An image 800 pixel wide, 600 pixels high has 800 x 600 = 480,000 pixels = 0.48 megapixels (“megapixel” is 1 million pixels). An image with a resolution of 1024×768 is a grid with 1,024 columns and 768 rows, which therefore contains 1,024 × 768 = 0.78 megapixels.

Modeling Step 2: Prepare labeled images to train the model

Once each image is converted to thousands of features, with the known labels of the images we can use them to train a model. Figure (B) shows many labeled images that belong to different categories such as “dog” or “fish”. The more images we can use for each category, the better a model can be trained to tell an image whether is a dog or a fish image. Here we already know the category that an image belongs to and we use them to train the model. This is called supervised machine learning.

Modeling Step 3: Train the model to be able to categorize images

Figure (C) demonstrates how a model is trained with the pre-labeled images. The huge networks in the middle can be considered a giant filter. The images in their extracted forms enter the input side and the labels are on the output side. The purpose here is to train the networks such that an image with its features coming from the input will match the label on the right.

Modeling Step 4: Recognize (or predict) a new image to be one of the categories

Once a model is trained, it can be used to recognize (or predict) an unknown image. Figure (D) shows a new image is recognized as a dog image. Notice that the new image will also go through the pixel feature extraction process.

Convolution Neural Networks — the algorithm for image recognition

The networks in Figure (C) or (D) have implied the popular models are neural network models. Convolutional Neural Networks (CNNs or ConvNets) have been widely applied in image classification, object detection, or image recognition.

A gentle explanation for Convolution Neural Networks

I will use the MNIST handwriting digit images to explain CNNs. The MNIST images are free-form black and white images for the numbers 0 to 9. It is easier to explain the concept with the black and white image because each pixel has only one value (from 0 to 255) (note that a color image has three values in each pixel).

The network layers of CNNs are different from the typical neural networks. There are four types of layers: the convolution, the ReLUs, the pooling, and the fully connected layers, as shown in Figure (E). What does each of the four types do? Let me explain.

Convolution layer

The first step that CNNs do is to create many small pieces called features like the 2x2 boxes. To visualize the process, I use three colors to represent the three features in Figure (F). Each feature characterizes some shape of the original image.

Let each feature scan through the original image. If there is a perfect match, there is a high score in that box. If there is a low match or no match, the score is low or zero. This process in producing the scores is called filtering.

Figure (G) shows the three features. Each feature produces a filtered image with high scores and low scores when scanning through the original image. For example, the red box found four areas in the original image that show a perfect match with the feature, so scores are high for those four areas. The pink boxes are the areas that match to some extent. The act of trying every possible match by scanning through the original image is called convolution. The filtered images are stacked together to become the convolution layer.

2. ReLUs layer

The Rectified Linear Unit (ReLU) is the step that is the same as the step in the typical neural networks. It rectifies any negative value to zero so as to guarantee the math will behave correctly.

3. Max Pooling layer

Pooling shrinks the image size. In Figure (H) a 2x2 window scans through each of the filtered images and assigns the max value of that 2x2 window to a 1x1 box in a new image. As illustrated in the Figure, the maximum value in the first 2x2 window is a high score (represented by red), so the high score is assigned to the 1x1 box. The 2x2 box moves to the second window where there is a high score (red) and a low score (pink), so a high score is assigned to the 1x1 box. After pooling, a new stack of smaller filtered images is produced.

4. Fully connected layer (the final layer)

Now we split the smaller filtered images and stack them into a single list, as shown in Figure (I). Each value in the single list predicts a probability for each of the final values 1,2,…, and 0. This part is the same as the output layer in the typical neural networks. In our example, “2” receives the highest total score from all the nodes of the single list. So CNN recognizes the original handwriting image as “2”.

What is the difference between CNNs and the typical NNs?

The typical neural networks stack the original image into a list and turn it to be the input layer. The information between neighboring pixels may not be retained. In contrast, CNN's constructs the convolution layer that retains the information between neighboring pixels.

Is there any pre-trained CNNs code that I can use?

Yes. If you are interested in learning the code, Keras has several pre-trained CNNs including Xception, VGG16, VGG19, ResNet50, InceptionV3, InceptionResNetV2, MobileNet, DenseNet, NASNet, and MobileNetV2. It’s worth mentioning this large image database ImageNet that you can contribute to or download for research purposes.

Business Applications

Image recognition has wide applications. In the next Module, I will show you how image recognition can be applied to claims to handle in insurance.

What Is Image Recognition?

Written by Chris Kuo/Dr. Dataman