Revolutional, Convolutional Neural Nets.

My first steps into the world of A.I.

Part 5: Where is the “eye” of AI?

7 min readOct 29, 2018

Intro

The architecture of a neural network is built upon simple connections. Models are based on this idea of relaying information forwards and backwards. Inside each neural network, there is a deep connection between the several layers that are stacked together. Within each neural layer is a simple calculation performed on the data it receives before the data is passed to the next. In this part, we will explore one of the more common layer arrangements in a specific neural network structure called ConvNets. Just like any neural network, it has several hidden layers and an input and output layer.

This architecture is famously known and often referred to as a Convolutional Neural Network (CNN).

Convo-what-shen-now?

Arguably, CNN can be considered the biggest breakthrough in machine learning in our century. It is the neural network that is most well-known amongst its peers and has become highly practical in recent years. To me, this means it’s probably worth spending some time to look into what it’s all about, and why it’s so magical.

Until just a few years ago, deep learning and big data have always revolved around one format of data: numbers in spreadsheets. We all know numbers aren’t pleasing to the eyes for many, and even less entertaining for most.
Put bluntly, the reason CNN has been so successful is that for the first time in history, mankind found a way to abstract image data well enough so that a machine could “see” the image in its own terms. The core of this technique follows the same basic logic of neural nets, except for a few small differences.

Let’s begin by looking up the meaning of “Convolve” in a dictionary.

From this, we can infer that a CNN has the base meaning of folding or rolling together something.

In mathematical terms, it is combining multiple functions.

In this case, every “function” within a CNN is represented by a neural layer.

Since CNN mainly works with images, we must come up with a way to represent the images as data. A simple example might be a grayscale image, it only has black and white and all the colors in between. In each pixel, there contains a value between 0 to 255 representing the brightness of the pixel.

We can turn this image into a matrix, where the resolution of the image is the size and dimension of the matrix. With this matrix, we can do many things to the image. Since it is now being represented as a numerical value, we can do all sorts of things to it with math.

In common practice, the CNN architecture consists of 4 essential layer types.

Convolution

In the first stage, we are looking at a convolution layer. This layer involves mapping a filter to the entire picture. This means that you are taking a separate smaller matrix (the filter), and using that to transform the bigger matrix (the picture). This filter will go through every pixel and perform a calculation to produce a new pixel. This method has been widely used with a plethora of filter types that can transform the overall appearance of a picture. Each of these different filtering methods can also be called by their mathematical name, kernels. This post serves as an excellent introduction to convolution and contains an interactive experience for applying the first level of abstraction on an image.

Pooling

The next level of abstraction is the pooling layer. The main purpose of the pooling layer is to reduce the complexity of the model without reducing its performance, it speeds up the analytical process by summarizing the picture. This way the model can still recognize important features in a region of the picture without the need to concern itself with less significant pixels. There are many different types of pooling that can achieve this and they all aim to capture different features of interest. For example, max pooling goes about abstraction by finding the max of each filter region, and create a simplified representation of the important features in that image. Stride is the number of steps it will jump to get pool sample for the next region.

The output from “sliding” each filter across the entire picture gives you something called an Activation map, sometimes called a Feature map that gives an indication of how strong the neuron is firing and responding to features in each region.

Quite literally, it shows where the neurons “activate” at the most important regions of the picture. When this process is done enough times you begin seeing patterns that can be used to distinguish complex object characteristics.

Flatten

The third stage of a CNN is flattening the data. At this point, we have abstracted all the relevant spatial information and are ready to process the data using a purely one-dimensional perspective.

From a matrix, we downgrade into an array of numbers.

If multiple filters were used, a layer of these produced feature maps would be flattened into individual nodes that can be consumed by an Artificial Neural Network (ANN).

This is a relatively simple and straightforward concept. All it really does is just to align things in a single dimension to reduce complexity.

Programmers are lazy, but machines can be lazy too.

Dense

Lastly, this flattened output is fed into a fully-connected layer, also called a dense layer which connects to a Neural Network for further processing.

This is the entire process of convoluting the input data.

Using this method, we can feed image data into a neural network as if it is any numerical data, but actually represent an image.

The neural network will then learn patterns and undergo deep learning with this input, and then produce some output.

Beyond the basics

The above process is only a basic template for a CNN. Much like all the other templates and basic models I’ve looked over, this is merely an introduction to the concept and its way of thinking. There are alternate versions and improvements from this basic CNN that are more practical and aimed towards solving a specific problem.

It’s important to note that you can convolve and pool the image data however many times you like. When the neural net is more complex, there are more classes as possible outputs and the last section of the CNN might become more sophisticated. More complex CNN will tend to perform convolution and pooling repeatedly.

This method can be very useful especially with larger images since it is able to abstract more information and features at each deeper level. In the above example, a deep CNN network might contain multiple layers of abstraction before it is flattened and sent into a dense layer.

The below diagram is the neural network structure of another practical classifier. It is from a post where the goal is to classify images of traffic signs and has reached great accuracy with the help of CNN, and some data augmentation.

As the CNN abstracts deeper into the image, the convoluted pictures start making less sense for us, but perhaps more sense for the machine. In the first few levels of abstraction, it might start recognizing simple features such as lines and edges. But as the convolution gets deeper, the machine begins to notice other relationships between shapes in these images. Without some type of analysis, this could easily look like a complete mess in a human’s eye.

This paper gives an eye-opening interactive experience to understand how machines really “see” these pixels and classify the image correctly. Although just numbers, when visually displayed like this it really looks beautiful. Every pixel going through the network helps to create meaning of a bigger image, perhaps that is akin to what our brain does when recognizing an object?

That’s it for this part, please look forward to the next!