Can you recognize the people in the picture above? If you’re a fan of science fiction, you’ll instantly recognize that they are Harry, Ron, and Hermione from J.K. Rowling’s worldwide phenomenon Harry Potter book series. The picture is a scene from Part 1 of Harry Potter and the Deathly Hallows — in which Harry, Ron, and Hermione are interrogating the thief Mundungus Fletcher, captured moments earlier by the elves Dobby and Kreacher. Well, I know this because I’ve watched the movie and read the book so many times! But what would it take for a computer to understand this image as you or I do? Let’s think explicitly of all the pieces of knowledge that have to fall in place for it to make sense:
- You recognize it is an image of a bunch of people and understand that they are in a room.
- You recognize that there are stacks of newspapers and a long wooden table with chairs, so the location is most likely a kitchen or a common living room.
- You recognize Harry Potter from the few pixels that make up the glasses in his face. It helps that he has black hair as well.
- Similarly, you recognize Ron Weasley because of his red hair and Hermione Granger because of her long hair.
- You recognize the other man who is bald and wears old-fashioned clothing, which suggests that he is much older.
- You recognize the other 2 creatures (Dobby and Kreacher) who are not human, even if you’re not familiar with them. You’ve used their heights, facial structure, body measurement, in addition to your knowledge of normal people’s looks to figure it out.
- Harry, Ron, and Hermione are interrogating the bald man. You derive this because you know their body posture gearing towards him, you sense their eyesight appearing doubtful, and you also see Hermione holding a wand in her hand (wand is the magic weapon in Harry Potter wizarding world).
- You understand that the bald man are feeling scared. You understand that he is hiding something, given that his hands covering his chest. You start to reason about implications of the events that are about to unfold seconds after this scene, and become curious about what secrets will be revealed.
- The 2 creatures are looking towards Harry and Ron. Looks like, they are trying to say something. In other words, you are reasoning about state of mind of those creatures. Whoa, you can be a mind-reader!
I could go on, but the point here is that you’ve used a huge amount of information in that second when you look at the picture. Information about the 2D and 3D structure of the scene, visual elements like people’s identities, their actions, and even their thoughts. You think about the dynamics of the scene and make guesses about what will happen next. All these things come together for you to make sense of the scene.
It is incredible how human brains can unfold an image consisting of just arrays of R,G,B values. How about computers? How can we begin to write an algorithm that can reason about the scene like I just did above? How can we get the right data that can support the inferences we make?
The field of Computer Vision tackles this exact problem, as machine learning researchers has focused extensively on object detection problems over the time. There are various things that make it hard to recognize objects: image segmentation / deformation, lighting, affordances, viewpoint, huge dimensions etc. In particular, Computer Vision researchers use neural networks to solve complex object recognition problems by chaining together a lot of simple neurons. In a traditional feed-forward neural network, the images are fed into the net, the neurons process the images and classify them into the outputs of True and False likelihood. Sounds simple, isn’t it?
But what if the images are deformed, just like the number digits above? Feed-forward neural net only works well when the digit is right in the middle of the image, but fails spectacularly when the digit is slightly off position. In other words, the net knows only one pattern. This is clearly not useful in the real world as real datasets are always dirty and unprocessed. As such, we need to improve our neural network in cases the input images aren’t perfect.
Thankfully, Convolutional Neural Nets come to the rescue!
The Convolution Process
So what exactly is Convolutional Neural Network? According to Chris Olah, Research Scientist at Google Brain:
“At its most basic, convolutional neural networks can be thought of as a kind of neural network that uses many identical copies of the same neuron. This allows the network to have lots of neurons and express computationally large models while keeping the number of actual parameters — the values describing how neurons behave — that need to be learned fairly small.”
Note the term being used there: identical copies of the same neuron. This is loosely based on the process of how the human brains work. By using the same brain memory spot, humans can spot and recognize patterns without having to re-learn the concept. For example, we recognize the identity of the digits above no matter angle we look at. The feed-forward neural network can’t do this. But Convolutional Neural Net can because it understands translation invariance — where it recognizes an object as an object, even when its appearance varies in some way.
In very simple explanation, the convolution process works like this:
- First CNN uses sliding window search to break an image into overlapping image tiles.
- Then CNN feeds each image tile into a small neural network, using the same weights for each tile.
- Then CNN saves the results from each tile into a new output array.
- After that, CNN down-samples the output array to reduce its size.
- Last but not least, after reducing a big image down into a small array, CNN predicts whether the image is a match or not.
There’s a fantastic tutorial by Adam Geitgey that goes into much more detail on how the convolution process works. I definitely suggest you to check it out.
CNNs was popularized mostly thanks to the effort of Yann LeCun, now the Director of AI Research at Facebook. In the early 1990s, LeCun worked at Bell Labs, one of the most prestigious research labs in the world at that time, and built a check-recognition system to read handwritten digits. There’s a very cool video dated back in 1993 that LeCun showed how the system work right here. This system was actually an entire process for doing end-to-end image recognition. The resulting paper, in which he co-authored with Leon Bottou, Patrick Haffner, and Yoshua Bengio in 1998, introduces convolutional nets as well as the full end-to-end system they built. It’s quite a long paper, so I’ll summarize it quickly here. The 1st half describes convolutional nets, shows its implementation, and mentions everything else related to the technique (which I’ll cover in the CNN Architecture section below). The 2nd half shows how to integrate convolutional nets with language models. For example, as you read a piece of English text, you can build a system on top of the English grammar to extract the most likely interpretation that is part of the language. The big takeaway is that you can build a CNN system and train it to simultaneously do recognition and segmentation, and provide the right input for the language model.
Let’s discuss the architecture of a Convolutional Neural Network. There is an input image that we’re working with. We perform a series convolution + pooling operations, followed by a number of fully connected layers. If we are performing multiclass classification, the output is softmax. There are 4 basic building blocks in every CNN: Convolution Layer, Non-Linearity (ReLU activation used in the CNN Layer), Pooling Layer, and Fully-Connected Layer.
1 — Convolution Layer
Here we extracts features from the input image:
- We preservers the spatial relationship between pixels by learning image features using small squares of input data. These squares of input data are also called filters or kernels.
- The matrix formed by sliding the filter over the image and computing the dot product is called a Feature Map. The more number of filters we have, the more image features get extracted and the better our network becomes at recognizing patterns in unseen images.
- The size of our feature map is controlled by depth (# of filters used), stride (# of pixels slid over the input matrix), and zero-padding (padding the input matrix with 0s around the border).
2 — Non-Linearity:
For any kind of neural network to be powerful, it needs to contain non-linearity. LeNet uses Sigmoid Non-Linearity, which takes a real-valued number and squashes it into a range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1. However, the sigmoid non-linearity has a couple of major drawbacks: (i) Sigmoids saturate and kill gradients, (ii) Sigmoids have slow convergence, and (iii) Sigmoid outputs are not zero-centered.
A more powerful non-linear operation is ReLU, standing for Rectified Linear Unit. It is an element wise operation that replaces all negative pixel values in the feature map by 0. We pass the result from the convolution layer through a ReLU activation function. Almost all CNN-based architectures developed later used ReLU, as in the case of AlexNet I discuss below.
3 — Pooling Layer
After this, we perform a pooling operation to reduce the dimensionality of each feature map. This enables us to reduce the number of parameters and computations in the network, therefore controlling overfitting.
CNN uses max-pooling, in which it defines a spatial neighborhood and takes the largest element from the rectified feature map within that window. After the pooling layer, our network becomes invariant to small transformations, distortions and translations in the input image.
4 — Fully-Connected Layer
After the convolution and pooling layers, we add a couple of fully-connected layers to wrap up the CNN architecture. The output from the convolution and pooling layers represent high-level features of the input image. The FC layers use these features for classifying the input image into various classes based on the training dataset. Apart from classification, adding FC layers also helps to learn non-linear combinations of these features.
From a bigger picture, a CNN architecture accomplishes 2 major tasks: feature extraction (convolution + pooling layers) and classification (fully-connected layers). In general, the more convolution steps we have, the more complicated features our network will be able to learn to recognize.
Since then, CNNs have been remodeled in a variety of forms for different contexts of natural language processing, computer vision, and speech recognition. I will cover some notable industry applications later on in this post, but first let’s discuss CNN’s usage in computer vision a bit. Recognizing real objects in color photographs downloaded from the web is much more complicated than recognizing handwritten digits. There are hundred times as many classes, hundred times as many pixels, two dimensional image of 3-dimensional scene, cluttered scenes requiring segmentation, and multiple objects in each image. How will CNN evolve to cope with these challenges?
In 2012, Stanford University Computer Vision group organized the ILSVRC-2012 competition (ImageNet Large Scale Visual Recognition Challenge) — one of the largest challenges in Computer Vision. It is based on ImageNet, a dataset with approximately 1.2 million high-resolution training images. Test images are presented with no initial annotation and algorithms will have to produce labelings specifying what objects are present in the images. Every year since then, teams from leading universities, startups, and big companies have competed to claim state-of-the-art performance on the dataset.
The winner of that 1st competition, Alex Krizhevsky (NIPS 2012), built a very deep convolutional neural net of the type pioneered by Yann LeCun (known as AlexNet). Compared to LeNet, AlexNet is deeper, has more filters per layer, and is also equipped with stacked convolutional layers. Looking at AlexNet’s architecture below, you can identify the main differences between it and LeNet:
- The number of processing and trainable layers: AlexNet includes 5 convolutional layers, 3 max-pooling layers, and 3 fully-connected layers. LeNet only has 2 convolutional layers, 2 max-pooling layers, and 3 fully-connected layers.
- ReLU Non-Linearity: AlexNet used ReLU whereas LeNet uses logistic sigmoid. ReLU helps decrease training time for AlexNet as it is several times faster than the conventional logistic sigmoid function.
- The use of dropout: AlexNet uses dropout layers to combat the problem of overfitting to the training data. LeNet doesn’t use such concept.
- Diverse dataset: While LeNet was only trained to recognize handwritten digits, AlexNet was trained to work with the ImageNet data, which are much richer in terms of dimensions, colors, angles, semantics etc.
AlexNet became the pioneering “deep” CNN that won the competition with 84.6% accuracy, while the 2nd-place model (which still used the traditional techniques in LeNet instead of deep architectures), only achieved 73.8% accuracy rate.
Since then, this competition has become the benchmark arena where state-of-the-art computer vision models are introduced. In particular, there have been many competing models using deep Convolutional Neural Nets as their backbone architecture. The most popular ones that achieved excellent results in the ImageNet competition include: ZFNet (2013), GoogLeNet (2014), VGGNet (2014), ResNet (2015), DenseNet (2016) etc. These architectures were getting deeper and deeper year by year.
CNN architectures continue to feature prominently in Computer Vision, with architectural advancements providing improvements in speed, accuracy and training for many of the applications and tasks mentioned below:
- In Object Detection, CNN is the major architecture behind the most popular models such as R-CNN, Fast R-CNN, Faster R-CNN. In these models, the net hypothesize object regions and then classify them, using CNN of top each of these region proposals. This is now the predominant pipeline for many object detection models, deployed in autonomous vehicles, smart video surveillance, facial detection etc.
- In Object Tracking, CNNs have been used extensively in visual tracking application. For example, given a CNN pre-trained on a large-scale image repository in offline, this online visual tracking algorithm developed by the team at Pohang Institute in Korea can learn discriminative saliency map to visualize a target spatially and locally. Another instance is DeepTrack, a solution to automatically relearn the most useful feature representations during the tracking process in order to accurately adapt appearance changes, pose and scale variations while preventing from drift and tracking failures.
- In Object Recognition, the team from INRIA and MSR in France developed a weakly supervised CNN for object classification that relies only on image-lvil labels, yet can learn from cluttered scenes containing multiple objects. Another instance is FV-CNN, a texture descriptor developed by people from Oxford to solve the clutter problem in texture recognition.
- In Semantic Segmentation, Deep Parsing Network is a CNN-based net developed by a group of researchers from Hong Kong to incorporate rich information into image segmentation process. UC Berkeley’s researchers, on the other hand, built fully-convolutional networks and exceeded state-of-the-art in semantic segmentation. Recently, SegNet is a deep fully convolutional neural network that is extremely efficient in terms of memory and computational time for semantic pixel-wise segmentation.
- In Video and Image Captioning, the most important invention has been UC Berkeley’s Long-Term Recurrent Convolutional Nets, which incorporate both CNNs and RNNs (Recurrent Neural Nets) to tackle large-scale visual understanding tasks including activity recognition, image captioning, and video descriptions. It has been deployed heavily by the Data Science team at YouTube to make sense of the huge amount of videos uploaded to the platform daily.
CNN have also found many novel applications outside of Vision, notably Natural Language Processing and Speech Recognition:
- Natural Language Processing: In the domain of Machine Translation, the AI Research team at Facebook used CNNs to achieve state-of-the-art accuracy at 9 times the speed of recurrent neural systems. In the domain of Sentence Classification, Yoon Kim at NYU experimented with CNNs trained on top of pre-trained word vectors for sentence-level classification tasks and improved upon state-of-the-art on 4 out of 7 tasks. In the domain of Question Answering, a few researchers from Waterloo and Maryland explored the effectiveness of CNNs for Answer Selection in end-to-end question answering. They found answers from CNNs are detectably better than previous algorithms.
- Speech Recognition: CNNs are very effective models for reducing spectral variations and modeling spectral correlations in acoustic features for automatic speech recognition. Hybrid speech recognition systems incorporating CNNs with Hidden Markov Models / Gaussian Mixture Models have achieved state-of-the-art results in various benchmarks. Researchers at University of Montreal have proposed an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC (Connectionist Temporal Classification), that is competitive with existing baseline systems. The team at Microsoft used CNNs to reduce error rate in speech recognition performance, in particular by building a CNN architecture with local connectivity, weight sharing, and pooling. Their model is capable of being invariant to speaker and environment variations.
Let’s revisit our example of the Harry Potter image again and see how I can use CNN to recognize its features:
- First I pass a sliding window over the entire original image and save each result as a separate, tiny picture tile. By doing this, I turn the original image into multiple equally-sized tiny image tiles.
- Then I feed each image tile into the convolution layer and keep the same neural network weights for every single tile in the same original image.
- Next, I save the results from each tile into a new array in the same arrangement as the original image.
- Then, I use max-pooling to reduce the size of the array. For instance, I can look at each 2 x 2 square of the array and keep the biggest number.
- After being downsampled, the small array then is fed into the fully-convolutional layer to make prediction, say whether it is an image of Harry, Ron, Hermione, the elves, the newspaper, the chair etc.
- After training, I am now confident to make predictions to my image!
As you can see from this article, Convolutional Neural Networks played an important part in shaping the history of deep learning. Heavily inspired by the study of the brains, CNNs performed extremely well in commercial applications of deep learning (vision, language, speech) compared to most other neural networks. They have been used by many machine learning practitioners to win academic and industry competitions. Research into CNN architectures advances at such a rapid pace: using less weights/parameters, automatically learning and generalizing features from the input objects, being invariant to object position and distortion in image/text/speech… Undoubtedly the most popular neural network technique, CNNs is the must-know for anyone who wants to enter the deep learning arena.
If you enjoyed this piece, I’d love it if you hit the clap button 👏 so others might stumble upon it. You can find my own code on GitHub, and more of my writing and projects at https://jameskle.com/. You can also follow me on Twitter, email me directly or find me on LinkedIn.