Deep Learning — Convolutional Neural Networks Basic 101

Judy Shih
7 min readFeb 27, 2018

--

Deep Learning architectures includes many type of neural networks. Convolutional Neural Networks is one of specialized image recognition algorithms.

To talk about convolutional neural network, first we need to understand what a typical neural network does. Fundamentally, a neural network takes certain inputs and makes classifications. It can be a binary classification or multiclass classification. An example of a binary classification is, someone else uses your phone and tries to unlock it with facial recognition. The phone likely will tell him/her that it’s not your face. An example of a multiclass classification is, your phone camera predicts if you are making a happy, sad, angry, face.

So why is neural network very robust at making complex classification? Let’s look at some very simple examples. Let’s predict if you are going to pass this course or not.

Linear Graph

Are you going to pass this course?

X1 = how hard you study.

X2 = how hard you play.

We will have 2 input factors, X1 and X2, representing how hard you study and play, respectively. The data may end up with some graph like above graph. + and o are the labeled results from everyone in the class, based on how hard they studied and played, and if they pass or not. + means passed, o means failed. To predict if a new student is going to pass or not, based on this graph, a straight line can be drawn between the +’s and o’s. In other words, this binary classification can be predicted by using an equation of a straight line with a constant slope.

Now how about this?

Non-Linear Graph

This nonlinear classification can certainly be done with complicated mathematical equations. However, a simple neural network can do this without a sweat.

If the inputs are simplified to binary inputs, 1 or 0, the graph is also simplified as shown on the right hand side.

Simplified Binary Graph

To make the nonlinear classification in neural network. We will need the simplified neural network looks like below graph:

It consists of an input layer as a column, and one or more hidden layers in the middle, and an output layer on the right hand side. X1 and X2 are the binary inputs, 0 or 1, and each layer has a bias which is typically 1. Each circle represents a neuron. To calculate a1, you simply multiply the value of the neuron with the corresponding weights, sum them all up, and undergo an activation function g(x).

a1 = g(-30 + 20X1 + 20X2)

a2 = g(10–20X1–20X2)

a3 = g(-10 +20a1 +20a2)

The activation function simply transforms the value of x into 1 or 0. If x is >0, g(x) is 1, if x is <0, g(x) is 0. If you do all the calculations you will end up with the chart below. This is equivalent to the classification we wanted to do earlier.

This is also called the XOR gate, where the output is 1 only when X1 or X2 is 1. This shows neural networks are very robust at solving non-linear classifications.

Traditional neural networks are robust, but they still have limitations. For one, they are not very good at computer vision, because there are no geometric relations among the inputs when they are fed into a Traditional neural network. This is where convolutional neural network comes in and takes the main stage.

The main difference between a traditional neural network and a convolutional neural network (or CNN) is that a CNN can detect edges of an object in earlier layers, and then detect features based on the edges in the later layers. In other words, a CNN somehow maintains the geometric relations among the inputs.

How does CNN work?

So how does a CNN detect edges in an image? Let’s take a look at this simple example.

This is a 6 pixels by 6 pixels greyscale image. The value in each pixel represents how bright it is. The higher the number, the brighter it is. The matrix after translation looks something like the image below.

edge

Now we apply the image with 3 by 3 matrix, or we call it a filter with a matrix convolution operation. This specific filter is also called a vertical edge filter. You will see in a moment why it is called vertical edge filter. The asterisk between the two matrices represent the matrix convolution operation.

This specific convolution operation will end up with a 4 by 4 matrix. To do this, first we take the red highlight portion, 3 by 3, of the matrix and multiply by the filter. The multiply here means element wise multiplication.

The left column: 10 x 1 + 10 x 1 + 10 x 1 = 30

The middle column is just zero: 10 x 0 + 10 x 0 + 10 x 0

The right column: 10 x -1 + 10 x -1 + 10 x -1 = -30

Now the sum from all 3 columns is 0, and we put it on the upper left hand corner in the result.

Next step, we move the 3 by 3 portion to the right by one element and highlight it as green.

Conduct the same multiplication and we will end up with 30. When the entire convolution operation is done, you will end up with this 4 by 4 matrix as the result.

If there are series of cells in the resultant matrix have values greater or less than 0, it means the filter detects vertical edge(s) in the input image.

The concept of these filters are not something new. These filters are commonly used in image processing, such as sharpening an image, blurring an image. The values inside the filters are traditionally analyzed and optimized by researchers to do specific functions. The beauty of a CNN is that the values inside the filters are learned by dumping large amount of images in the CNN and going through forward and backward propagation.

How about the colour images?

Most images come in RGB colors instead of greyscale, so to convolve a 3D matrix, we also need a 3D filter. In a CNN, we can have many filters. Through supervised learning process, each filter will take different form to accurately predict the input images.

You can also notice that in this RGB convolution layer, there are only 27 (9 x 3) weights (or variables) to learn, doesn’t matter how big the image is. Unlike a traditional neural network, if you have 1000 by 1000 pixel image, that is 1 million inputs, and the first hidden layer usually has to be close to number of inputs or it suffers accuracy issue. Fewer variables to learn means faster computation time. That is why CNN is fast and accurate to apply analysing the visual imagery.

Time, and the right learning algorithms made all the difference

Deep learning has enabled many practical applications of Machine Learning. In some scenarios the result from Deep Learning is better than humans. The technique of image recognition can identify the indicators for cancer in blood and tumors in MRI scans.

So thanks to Deep Learning, AI has a bright Future. Like Driverless cars, the better preventive healthcare, or the even better movie recommendations.

In the near future, it is not impossible that people can walk with C-3PO or even the Terminator.

References

[1] Andrew. Ng, “Coursera,” Coursera, [Online]. Available: https://www.coursera.org/learn/convolutional-neural-networks. [Accessed 24 Feb 2018].

[2] M. COPELAND, “Nvidia,” Disqus, 29 July 2016 . [Online]. Available: https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/. [Accessed 24 Feb 2018].

--

--