What Is Image Recognition?
The first question you may have is what the difference is between computer vision and image recognition. Indeed, computer vision has been vigorously developed by Google, Amazon and many AI developers, and the two terms “computer vision” and “image recognition” may have used interchangeably. Computer vision (CV) is to let a computer imitate human vision and take actions. For example, CV can be designed to sense a running child on the road and produces a warning signal to the driver. In contrast, image recognition is about the pixel and pattern analysis of an image to recognize the image as a particular object. Computer vision means it can “do something” with the recognized images. Because in this post I will describe the machine learning techniques for image recognition, I will still use the term “image recognition”.
What is image recognition?
Just like the phrase “What-you-see-is-what-you-get” says, human brains make vision easy. It doesn’t take any effort for humans to tell apart a dog, a cat or a flying saucer. But this process is quite hard for a computer to imitate: they only seem easy because God designs our brains incredibly good in recognizing images. A common example of image recognition is optical character recognition (OCR). A scanner can identify the characters in the image to convert the texts in an image to a text file. With the same process, OCR can be applied to recognize the text of a license plate in an image.
How does image recognition work?
How do we train a computer to tell one image apart from another image? The process of an image recognition model is no different from the process of a machine learning modeling. I list the modeling process for image recognition in Step 1 through 4.
Modeling Step 1: Extract pixel features from an image
First, a great number of characteristics, called features are extracted from the image. An image is actually made of “pixels”, as shown in Figure (A). Each pixel is represented by a number or a set of numbers — and the range of these numbers is called the color depth (or bit depth). In other words, the color depth indicates the maximum number of potential colors that can be used in an image. In an (8-bit) greyscale image (black and white) each pixel has one value that ranges from 0 to 255. Most images today use 24-bit color or higher. An RGB color image means the color in a pixel is the combination of red, green and blue. Each of the colors ranges from 0 to 255. This RGB color generator shows how any color can be generated by RGB. So a pixel contains a set of three values rgb(102, 255, 102) refers to color #66ff66. An image 800 pixel wide, 600 pixels high has 800 x 600 = 480,000 pixels = 0.48 megapixels (“megapixel” is 1 million pixels). An image with a resolution of 1024×768 is a grid with 1,024 columns and 768 rows, which therefore contains 1,024 × 768 = 0.78 megapixels.
Modeling Step 2: Prepare labeled images to train the model
Once each image is converted to thousands of features, with the known labels of the images we can use them to train a model. Figure (B) shows many labeled images that belong to different categories such as “dog” or “fish”. The more images we can use for each category, the better a model can be trained to tell an image whether is a dog or a fish image. Here we already know the category that an image belongs to and we use them to train the model. This is called supervised machine learning.
Modeling Step 3: Train the model to be able to categorize images
Figure (C) demonstrates how a model is trained with the pre-labeled images. The huge networks in the middle can be considered as a giant filter. The images in their extracted forms enter the input side and the labels are in the output side. The purpose here is to train the networks such that an image with its features coming from the input will match the label in the right.
Modeling Step 4: Recognize (or predict) a new image to be one of the categories
Once a model is trained, it can be used to recognize (or predict) an unknown image. Figure (D) shows a new image is recognized as a dog image. Notice that the new image will also go through the pixel feature extraction process.
Convolution Neural Networks — the algorithm for image recognition
The networks in Figure (C) or (D) have implied the popular models are neural networks models. Convolutional Neural Networks (CNNs or ConvNets) have been widely applied in image classification, object detection or image recognition.
A gentle explanation for Convolution Neural Networks
I will use the MNIST handwriting digit images to explain CNNs. The MNIST images are free-form black and white images for the number 0 to 9. It is easier to explain the concept with the black and white image because each pixel has only one value (from 0 to 255) (note that a color image has three values in each pixel).
The network layers of CNNs is different from the typical neural networks. There are four types of layers: the convolution, the ReLUs, the pooling, and the fully connected layers, as shown in Figure (E). What does each of the four types do? Let me explain.
- Convolution layer
The first step that CNNs do is to create many small pieces called features like the 2x2 boxes. To visualize the process, I use three colors to represent the three features in Figure (F). Each feature characterizes some shape about the original image.
Let each feature scan through the original image. If there is a perfect match, there is a high score in that box. If there is a low match or no match, the score is low or zero. This process in producing the scores is called filtering.
Figure (G) shows the three features. Each feature produces a filtered image with high scores and low scores when scanning through the original image. For example, the red box found four areas in the original image that show a perfect match with the feature, so scores are high for those four areas. The pink boxes are the areas that match to some extent. The act of trying every possible match by scanning through the original image is called convolution. The filtered images are stacked together to become the convolution layer.
2. ReLUs layer
The Rectified Linear Unit (ReLU) is the step that is the same as the step in the typical neural networks. It rectifies any negative value to zero so as to guarantee the math will behave correctly.
3. Max Pooling layer
Pooling shrinks the image size. In Figure (H) a 2x2 window scans through each of the filtered images and assigns the max value of that 2x2 window to a 1x1 box in a new image. As illustrated in the Figure, the maximum value in the first 2x2 window is a high score (represented by red), so the high score is assigned to the 1x1 box. The 2x2 box moves to the second window where there is a high score (red) and a low score (pink), so a high score is assigned to the 1x1 box. After pooling, a new stack of smaller filtered images is produced.
4. Fully connected layer (the final layer)
Now we split the smaller filtered images and stack them into a single list, as shown in Figure (I). Each value in the single list predicts a probability for each of the final value 1,2,…, 0. This part is the same as the output layer in the typical neural networks. In our example, “2” receives the highest total score from all the nodes of the single list. So CNNs recognizes the original hand-writing image as “2”.
What is the difference between CNNs and the typical NNs?
The typical neural networks stack the original image into a list and turn it to be the input layer. The information between neighboring pixels may not be retained. In contrast, CNNs construct the convolution layer that retains the information between neighboring pixels.
Are there any pre-trained CNNs code that I can use?
Yes. If you are interested in learning the code, Keras has several pre-trained CNNs including Xception, VGG16, VGG19, ResNet50, InceptionV3, InceptionResNetV2, MobileNet, DenseNet, NASNet and MobileNetV2. It’s worth mentioning this large image database ImageNet that you can contribute or download for research purpose.
Image recognition has wide applications. In the next Module, I will show you how image recognition can be applied to claim handling in insurance.