Computer Vision Part 3: The core elements

5 min readMar 20, 2019

Welcome to Part 3 of our blog series on computer vision. In Part 1, we gave an overview of some computer vision use cases across different industries which was followed by Part 2: a summary of the different computer vision techniques. In this third part, we will discuss the main building blocks from which many state of the art algorithms are constituted. The goal is not to provide an in-depth mathematical dissertation but rather a high-level explanation. We will then use these building blocks to build a neural network that is capable of classifying images.

Convolution

Let us immediately dive into, what is arguably the most influential technique in CV. Below, we see a visual representation of a convolution operation:

The left matrix is the input, the middle matrix represents a filter and at the right we observe the result of the convolution operation. We place the filter on top of the input starting at the top left corner and multiply each value from the input with its overlapping counterpart from in filter matrix. Subsequently, we take the sum of all those 9 results and hopefully we will find the same solution as the top left result matrix. This process, is done repeatedly by gliding the filter over the input.

Not impressed yet? Take a look at the two following filters:

Nothing fancy, right? Let’s see what the result is after applying both filters on an image:

The image below is the result of applying the color and edge filters on the original image above.

Filters in a convolution operation are basically nothing more than feature detectors and the result is also called a feature map.

Activation Units

A rectified linear unit (ReLU) is typically applied after every convolution. The purpose is to introduce non-linearity since most real world data is also non-linear and convolution is a pure linear operation. How does it work? ReLu will check pixel by pixel for negative values, if a pixel has a negative value it will be replaced with a zero. There exist other activation functions that can introduce non-linearity in neural networks, such as sigmoid or tanh, but ReLu is most of the time superior in performance.

Pooling

Max pooling or down sampling will reduce the dimensionality of an input matrix by simply taking the biggest value in the mask region of the filter:

Multiple advantages are associated with this technique. Through reducing the size of the input we also reduce the computational complexity and which results in a lower computational cost. Furthermore, it increases robustness in terms of image distortions, size and transformations.

Fully Connected Layer (FCL)

The convolution and pooling operations act as feature extractors. At a certain point in time, our network has extracted and learned enough information from the input to make a decision about what those features mean with respect to what we are trying to classify. At this point, we will have to start taking into account all features, this is where Fully Connected Layers (FCLs) come into play.

The convolution operation that we mentioned before, extracts features from a region of the image. The FCL is nothing more than a convolution across the whole input instead of a region. Below, we can appreciate how an FCL actually looks like.

Softmax Activation (SMA)

We're almost there! As we see above, in the last layer we perform the actual prediction in our model. We introduce Softmax, which is a logistic regression on steroids, it assigns decimal probabilities to each class in a multi-class problem and the sum of the probabilities must be 1.

Le CNN Architecture

We have discussed how the 5 main elements functions and have built our Convolutional Neural Network (CNN) toolbox. We can now use all those tools by stacking them on top of each other, we obtain a CNN architecture. As you can expect, different architectures exists for different kinds of problems. For example, a use-case with biomedical images will probably be tackled using a U-Net architecture while detection of objects in image could be handled by a ResNet.

Example architecture that uses various convolutional and ReLu layers, followed by an FCL and Sigmoid activation layer.

For example, by stacking the tools from our toolbox three times, followed by a FCL and SMA, we would obtain the underlying model which does a good job at classifying images. The image below also visualizes the different layers of the underlying model.

Visualization of a CNN consisting of several layers of convolution and ReLus.

It is important to understand that when training a CNN, it actually learns the appropriate values for the feature detectors. This means that the CNN will extract the image features which maximizes the chance to recognize correct patterns in unseen data. Intuitively, the more filters our architecture contains, the more image features can be extracted but the more compute power is required.

Finally, as seen above, filters in the early processing phase detect low level features such as edges, curves or dots. The more deeply we move into the CNN, the more we encounter high-level features such as tires or legs. The last layer consists of an FCL where those features are employed to finally classify the images with a probability score.

The End

In this Part 3, we provided a high-level explanation of the main building blocks from which modern state-of-the-art CV models are built and how architectures are constructed by chaining these elements in a particular way. We also showed how easily an image classifier can be constructed using these building blocks.

Next in Part 4, we will briefly discuss the different architectures of arguably the most performant image classifiers. Afterwards, we will further expand on the building blocks in this post to explain how Object Detection and Segmentation works.