Computer Vision

Published in

Machine Learning Basics

7 min readDec 27, 2018

Computer Vision is the ability of a computer program to interpret images and identify, tag or classify them appropriately. For example, a computer program that can classify an image of a pet animal as a cat or a dog’s image is a computer vision program. An AI algorithm called the Convolutional Neural Network, is a specialised form of the classic Artificial Neural Network (or Multi-Layer Perceptron) that is used to solve computer vision problems. The convolutional neural network contains Convolutional Layers and Max Pooling Layers in conjunction with Densely Connected Layers.

The architecture of Convolutional Layers is inspired by the arrangement of neurons in the Visual Cortex of the brain; the place where visual information from the eyes is processed and interpreted. The human visual cortex works very differently compared to classical Multi-Layer Perceptrons. Research in human vision, has shown that neurons in the visual cortex do not focus on every part of the image at once, instead each neuron has its own receptive field. For example, in the image of a face, one neuron can process the left eye, another one can process the mouth and so on. This means that each neuron processes only a certain part of the image. Each neuron decomposes the high level visual information from its respective section of the image down into low level features. For example, the eyes, nose and mouth in the image of a face can be broken down into edges, curves, circles etc. Later, all these neurons share their processed information with neurons in the next layer. These low level features are then combined and composed together in a certain way to generate more complex features, such that the image can be interpreted as a certain object based on these complex features.

Unlike Densely Connected Layers where every neuron in one layer is connected to every neuron in the next layer, Convolutional Layers have neurons which are only connected to certain neurons in the previous layer. The Convolutional Layer takes the spatial information of the flat 2D image and converts it into something called feature maps using some weight matrices called kernels. These feature maps are essential because they contain the features extracted by the layer from the image, and thus the name.

Now, we come to the convolution operation. The convolution operation takes two matrices as input. One is the image matrix and the other is the kernel (weight matrix). The kernel is “shifted” across the entire image part-by-part (like the neurons in the visual cortex) and the Hadamard or Dot Product of each part of the image is taken with the kernel to produce a feature map. (the product of the weight matrix would taken with the entire image if it were to be a Densely Connected Layer). Also, a bias matrix B can be added to the convoluted image and the entire matrix can be passed through an activation function like ReLU, Tanh or Sigmoid. (The activation function is used to introduce non-linearity into the network so that it can fit more complex functions)

A picture demonstrating the convolution operation. The kernel’s dot product with a part of the image is taken and the dot product replaces that part of the image in the feature map. Many such kernels or weight matrices are taken and used to perform convolution over the image/s. The feature map produced by each kernel is stacked up into a “pile” of feature maps.
Thus convolution turns shallow and wide matrices into more dense and narrow ones. (The feature maps produces are smaller than the image, but the outptut is deeper because multiple feature maps are produced)

The convolution operation is like a similarity measure. If the product of magnitudes (Frobenius Norm) of the kernel and a certain part of the image is kept constant, the computed dot product increases as the angle between the matrices decreases. Thus, if a kernel is very “similar” to a certain part of an image, the corresponding part of the resulting feature map has a very high activation. These kernels are adjusted in such a way (using back-propagation) such that the basic features of an image, such as the edges, curves, elementary shapes etc are highly activated.

In the above pictures, the effect of convolution is illustrated. The first image is the image of a desert animal. When the image is convoluted with a learned kernel (the matrix shown in the second image), the features of the image such as the edges and curves in the face of the animal are extracted. Also, the basic round shape of the eyes and the ears are well captured. (these basic features are given higher activations and are therefore highlighted over the lesser activated surroundings)
Many such feature maps are generated and used as features for interpreting the image.

A parameter of the convolution operation called the kernel size, includes the dimensions of the weight matrices (must be square matrices). For example, a kernel size of (3, 3) means that we find the dot product of a (3, 3) weight matrix with each (3, 3) part of the image. Another convolution parameter called stride determines how far rightward and downward we move as we move from one part of an image to another. For example, a stride of (1, 1) means that we move our “window” (part of the image to be convoluted) by one 1 column to the right and 1 row downwards each time we need to shift to a different part of the image. The third important concept : padding is the process of adding extra pixels around an image or a feature map before convolution, like the “frame” of an image. This helps preserve the size of the image or feature map post convolution, as convolution decreases the size of the image. Thus, information at the image borders is also preserved. Zero Padding is a type of padding where pixels with zero value are placed around the borders of the image. Same Padding is another type of padding in which the size of the image after convolution is exactly the same as the original size of the image.

Now coming to the concept of Max Pooling. Max Pooling is the process of reducing the size of the images or feature maps by shifting across the image and picking the largest pixel value in each part of the image.

Here, we take each 2 by 2 window in the image and pick the largest value in each window. (Max Pooling also has window size and strides like convolution; Here window size and strides are both (2, 2))
Max Pooling helps reduce the size of the feature maps. It has been shown experimentally that max pooling preserves most of the important information in the feature maps whilst reducing the image size.

The CNN

Many Convolutional Layers immediately followed by Max Pooling Layers are stacked together. These layers extract features from the images. These feature tensors are then flattened into a vector of features using something called a Flatten layer. These features are then fed to a stack of densely connected layers to do the final classification in classification applications (these features can be used in other ways in other computer vision applications). These densely connected layers combine these features in certain ways to produce more complex features which form the basis for classifying images. This stack of multiple Convolutional, Max Pooling and Densely Connected Layers is called a Convolutional Neural Network or CNN.

But, the Max Pooling layers also introduce something called invariance, which is a problematic shortcoming of CNNs. This means that some valuable information about the orientation, angle, lighting, proportion etc of objects in the image is lost (as a result of max pooling). So, the network cannot capture the spatial relationships between the different parts of the image.

For example, a simple CNN can easily be fooled into believing that the above image is a face as it cannot capture the relative proportions and relative spatial positions of the eyes, mouth, nose etc in the image.

To overcome this problem of invariance, Capsule Networks (by Geoffrey Hinton) were recently introduced to capture additional information about the image , such as relative orientation, positioning, proportions etc of the different components of the image. Capsule Networks are beyond the scope of this article.

This concludes this article on Computer Vision and specifically about CNNs. I hope this helped you understand the meaning and implementation of Computer Vision better.

Computer Vision

The CNN

Written by Tarun Paparaju