Computer Vision :An Introduction

With the growing popularity of Deep learning or say Convolution Neural Networks, its no longer deemed necessary among the deep learning enthusiasts to know the traditional computer vision algorithm from implementation point of view but they are very helpful from the understanding point of view. It will help us understand what actually is going on in a CNN layer. These algorithms are popular in medical imaging, robotics, 3D modelling, gesture recognition etc. All the techniques discussed are unsupervised so their is no ground truth requirement and that’s a great benefit.

The idea of this post is not to explain the traditional algorithm in detail but to give an intuition and basic understanding about them. I tried to keep this blog mathematics free and thus at many places, I have skipped some equations. I would encourage you to derive things yourself.

In computer vision, we represent images as numbers.One can call it a function from n dimension to a number. f(x,y,z)=k where x ,y,z are dimensions and k is intensity. Normally what we come across is a 2D image.

Most basic element of CV is filter. It can perform basic task as averaging to complex tasks as we do with CNNs. Most common filters are Gaussian filters and typical size of filters is 3x3.

I really hope, one have some basic understanding of linear algebra. This post won’t be that mathematics loaded but it is important for fundamental understanding. One can go through a short course on Linear Algebra by 3Blue1Brown.

One of the most basic task of CV is edge detection. We all know what edge generally means. Technically,edges are the high frequency components of image. It is in line with human behaviour as we can classify any object just given the edges.

Edge Detection:

One can also think of edge as sudden change in intensity/brightness in image. We can compute gradient in any direction and if the gradient is high than we can say that there is an edge.

For simplicity, lets consider 1D image. In 1D, we can estimate gradient by calculating the difference between the intensity of image with intensity of right pixel. This is called right gradient. Similarly, calculating this for left to get left gradient.We can also convolute the image with filter [-2,0,2] or any other filter of your choice to get different gradient.

But in real life scenarios with large noise, this method won’t work.Imagine a noisy image, its gradient will be higher than threshold at many points cause of randomness. So, first we have to smooth the input using a Gaussian filter and then find gradient.

But, as the convolution is linear operation, convolution with Gaussian filter and differentiation can be replaced by convolution with differentiated Gaussian. The output will now have a maxima corresponding to every edge which can be detected by further differentiation and zero detection. By applying the same rule again, we can convolute with the double derivative of Gaussian filter to get the edges as the points where it passes 0 and strong derivatives exists on both sides.

In 2D, we get gradient by convolution with “y filter” for getting edges in vertical direction and “x filter” for getting derivative in horizontal direction. We can combine their result to get the actual gradient at that point.

After getting the results, we threshold the gradient to get edges. Higher the threshold, stronger the edges. This may exclude some of the edges but then we can use Canny algorithm to get the result. In Canny algorithm, we find the strong edges first,then we find the weak edges. And then we check which of these weak edges were part of strong edges to get good result.

Here also, as we did in 2D, we can directly perform convolution with the double differentiated Gaussian filter which we call as Laplacian of filter as shown in figure.We can directly check for zero crossing to get the edges.

Many a time, when the edges are too thick, we use thinning to get a cleaner output. This famous example shows the various steps of our edge detection.

Hough Transform :

Next task after edge detection task is localisation of shapes in images. Hough transform’s task is identification of lines from edges generated. It basically uses a voting procedure for this. Each point on edge vote for what they think is the line.This voting procedure is carried out in a parameter space. Parameter with maximum votes is selected. In case of multiple lines, multiple maximas can be chosen.

Hough algorithm transforms from image space to parameter space. Consider a general equation of line y=mx+b, image space is plotted x and y dimension and parameter space in m and b. Note that a line in image space is point in parameter space and similarly, a point in image space is a line in parameter space.So, basically, they are dual of each other. For each point in image space we get a line in parameter space ,thus intersection in Hough space gives the equation of line in image space.

But the problem with this representation is slope can be infinity and then its hard to fit. So, we use parametric form of line equation.

In this case, duality holds in form that a point in image space is a sinusoidal curve in parameter space and a sinusoidal line is a point. Visualisation of which is shown in figure.

In case of circles, consider we know the radius of circle then, the Hough transform can be applied as shown.

Now, if multiple circle of different radius are present, our radius assumption fails. So, consider the general case,where the radius is also treated as a parameter. In this case the parameter image of a point will be a cone. Now,we all know,its hard to handle cones.

To avoid the 3D interpretation on parameter side, we take the gradient of point in image space and from our geometry classes we know radius of circle lies on the normal. So, from any 3 points,we can get the centre of circle as point of their intersection.

Now, consider a case where the input image space is noisy, in this case the parameter space will have jitters and a lot of false maximas exists. So we smooth the input with Gaussian filter.

Space Complexity of Hough Transform is k^n where k is number of possible bins and n is number of dimensions is exponential in dimension.So, the problem with Hough transform is it explodes with increasing bins or dimensions.

Time Complexity is linear in number of points on edge.

Feature detection :

Feature is a piece of information which is relevant for solving the computational task. Feature should be easily repeatable i.e. we should find that feature in new picture at same position and locality describing it should not be too large.In case the locality is too large, a small occlusion of neighbourhood may lead to loss of information. For example, for feature ‘nose’ in human face, you use the information of ear that is too far so in different image if ears are missing, it may not be able to detect ‘nose’ feature.

One way of feature detection is corner detection, we may now call it as Harris corner. In Harris corner we define an error term E(u,v) as shown. u,v are the movement of window in horizontal,vertical directions respectively.

We calculate how the error is changing as we move along the u,v dimension from the pixel under consideration. Now, to approximate this error value, we use taylor-Series expansion. Considering (u,v) as distribution around (0,0), we can reduce the taylor series to a second order moment as shown below.

Using some knowledge from linear algebra, we can convert this final equation in diagonal form. From there, we can conclude that if both eigenvalues are large, it is corner. If only one of them is large, it is an edge.

We don’t even need to calculate eigenvalues. We can just calculate this value R.

Why? Because expanding R using properties such as product of eigenvalues is equal to determinant of matrix and trace of matrix is sum of eigenvalues, we can see that if both eigenvalues are positive and high, R will be large positive .

If R is large positive, it’s a corner.

Harris corners are invariant to intensity and rotation as the eigenvalues remains same but the issue is that the it is variant to scaling. If we will increase size of window, it may detect it as edge than corner.

One way to correct this is that to run it for all the window size and then choose the size which give best result.

SIFT(Scale Invariant Feature Transform):

Here, to make it scale invariant, we can use Laplacian of Gradient (LoG), we can vary Laplace for different Sigma and choose the best result.Our transform will be a function of not only positions but sigma too. Problem with LoG is that it is very expensive.

So, if we remember the Gaussian function, difference of two Gaussian function also gives an approximate LoG .SIFT uses this approximation. We blur the image and take their difference as shown in figure.From the DoG images, consider the x mark,if its value is greater than all the values beside it in same layer as well as upper and lower layer,its a feature.

We also repeat this by varying image size from factors of 2 i.e it is repeated for different octaves.

We remove the points with low contrast and also the edges to get features .

SIFT descriptor

After getting the features, we also need the description as the correlation is variant to rotation, intensity and scaling. SIFT also gives the descriptor for the features.

For a given keyword descriptor, a neighbourhood of 16*16 is chosen, which are then divided into blocks of 4*4. Each block contains the orientation of the feature in one of the desired bins ( normally we take 8). 8 bin histogram is created, that gives 4*4*8=128 features for descriptor.

Now, one way is to match all the descriptor features of the one image with all the descriptor features of other images.But if descriptors are in range of 10k,it can be too heavy. We can use wavelet hashing,locality sensitive hashing to get some speed ups.

We can also use SIFT to get a affine transformation of the one image to another, we need atleast 3 descriptors to match from both image.

These algorithms are just like tip of an iceberg.There’s too much more to it. The aim of this blog post was just to introduce with some of the popular techniques with a little explanation.

If you find this interesting, you can go through this course on computer vision.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.