Computer Vision: An Introduction to Perception

Photography is an art which I could never master. May be it was my half-hearted attempt or the lack of proper gadgets, the fact remained that photos I captured would not be upto the mark. I thought it was time I took it to a deeper level. I wanted to analyse vision from a mathematical point of view. As a part of a project work under Swarm Robotics, IIT Kharagpur, I kickstarted this summer (2017) by taking a course on Robotics Perception on Coursera. I present a brief “lecture-note” of the same in this article.


Most of us are familiar with isometric view which we learned during our Engineering Drawing lectures. It’s a 3D view of an object in real world, with a fixed pair of perpendicular lines (the coordinate axes) being 120° apart. Isometric views carry a lot of information in themselves, hence are very important in architecture. On the flip side, a perspective projection is the one which mimics the human eye — the one obtained in photographs. A stark difference between the two lies in the fact that parallel lines remain parallel in isometric view, whereas in perspective view, they may meet at a point termed as the “vanishing point”. Various sets of parallel lines on a plane may meet at various vanishing points. Joining all of them we get a horizon, a line (on an image) which is specific to a plane (in the real 3D world).

In computer vision, we generally tend to work with homogeneous coordinates. We represent a point by a line and a line by a plane by adding a third coordinate i.e. we add an extra dimension. Considering ourselves to be at the origin, we imagine the image to be placed a unit distance (along the z axis) away from us. Accordingly, the horizontal becomes the x axis and inverse of the vertical becomes y. For a point on the image, we imagine a ray passing from the origin to the image plane and extending beyond. For a line, we simply imagine two more distinct lines, each passing through the origin and meeting the line on the image plane at two different points. These three lines make a plane. Formulas involving homogeneous coordinates are often simpler and more symmetric than their Cartesian counterparts.

A point in the 3D world is in respect to some coordinate frame which might not be the same as the camera frame. In order to bridge the gap, we use a matrix of transformation. We first write the 3D point in homogeneous coordinates by adding a ‘1’ to it. The matrix has two parts — relative rotation and translation, of the world coordinates with respect to the camera coordinates. I present an image below to illustrate the same.

Here the first one is the rotation matrix, and the second one is for translation. The x,y and z axes have been represented by Red, Green and Blue (RGB). The matrix is simply the relative position of the coordinate aces. For instance, the y axis of the camera is opposite the z axis of the world, hence (0,0,-1). Often the the translation matrix is augmented with the rotation matrix and a row is added below with (0,0,0,1) as it’s elements. This is done so that the matrix remains invertible and also because it brings in support for homogeneous coordinates. We further use another matrix containing intrinsic camera parameters which is pre-multiplied to get the points in 2D image coordinates. The overall picture looks like

where fx and fy are focal lengths along x and y axes respectively. s is the scale factor as the camera may not be completely horizontal. px and py are the principal points i.e. where rays hit the centre of the image. K is known as the “calibration” matrix.

We assume images to be formed out of the pin hole camera model, something with which we are familiar with since childhood. The image formed is imagined to be at a a distance in front of the hole as the distance between the hole and the screen. An interesting visual effect worth discussing here is the dolly zoom. It occurs when we simultaneously move back the camera and zoom in (or vice versa) so that the size of the subject remains the same. As a result, the background which is out of focus, changes it’s size.


The equation of the horizon line can be derived from a set of vanishing points lying on the line. If we take into consideration homogeneous coordinates, we can say that the plane of the horizon remains parallel to the ground plane.

If we have three vanishing points, and a set of three mutually perpendicular axes connecting each of them, the distance between the origin and the ortho-centre of the triangle formed from the three points give us the focal length. This helps in finding camera parameters just from an image. We can also estimate heights of objects which are parallel in the real world by taking their relative heights. For instance, if we have the ends of objects (lying in the same plane) meeting at the same vanishing point, we can say they have the same length (assuming the image has been captured horizontally). What if this assumption is invalid? Moreover, how would we understand if the image is horizontal or not? A simple way is to see where the horizon lies. If it goes across the image from the centre, the assumption is valid. In case it doesn’t, we need the concept of cross-ratio to determine the distances. They can no longer be computed using simple ratio.

Here, A, B, C and D can be considered to lie on the image plane and A’, B’, C’ and D’ in the ground plane (or vice versa). In case one of the points (say D) is a vanishing point, the corresponding point in the ground plane (D’) lies at infinity. Thus, by taking limits tending to infinity, the RHS reduces to A’C’/B’C’. Hence, even very limited amount of information can help us decipher a lot from an image if we have the knowledge of vanishing points, more importantly, without getting any intrinsic parameters of the camera involved.

A projective transformation is an invertible matrix of transformation which transforms an image viewed from a certain angle to an image as it would appear if viewed from another angle. For a set of points, we can simply post-multiply the coordinates with the matrix to get a set of new coordinates in the other plane. This matrix (known as the homography matrix) is obtained in a very interesting manner.

Here, we are shifting the coordinates in a such a way that from this view angle, the 2D image plane lies a unit distance in front of the new origin (along the z axis). When we take homogeneous coordinates into consideration, we take each point to be a ray emanating from the origin and moving towards the image plane. Since, the transformed original ray, and the new ray are supposed to be the same, or some linear combination (and hence parallel), we can say that their cross product should lead to a zero vector. This cross product can be written in a way such that the last row is a linear combination of the first two (the last matrix written at the left most corner of the image). Hence, ultimately, we are left with six unknowns. In case we avoid such complication and directly have a look at the matrix, it is composed of nine unknowns (3x3 homography matrix). Since, in homogeneous coordinates, the last one would be 1, we get one constraint and hence need eight other equations. Since every point gives us two equations, we need to know the transformed version of at least four points in the image to compute the homography matrix. While doing it the other way round, as usual, we need to take inverse of this matrix. The inverse is at times broken down into two matrices instead of three, by making the diagonal matrix in Single Value Decompostion, an identity matrix (discussed later), thus enforcing orthogonal contraints to it.

While moving from 3D to 2D, we have four more parameters (intrinsic camera features). Hence, we need to have at least six points in that case.


Detection invariance is a feature with which we can find correspondences between points of two images of the same object captured from a different scale and angle. This means, we should be able to map a point from one image onto the other such that they correspond to the same point in the 3D world, just by looking at the images. Descriptor invariance is a feature with which we can say that two images are pretty similar by studying a few sets of points and their neighbourhood. The most important issue faced regarding the two is maintaining a scale invariance i.e. detecting a pair of same points in both the images even when there is a drastic change in scale. We use a method known as SIFT (Scale Invariant Feature Transform) by David Lowe. Scale space is a notion can be used in images too. The main idea of a scale-space representation is to generate a one-parameter family of derived signals in which the fine-scale information is successively suppressed. This is obtained in images by administering a blur in the image. If we look at the just one pixel across all series of blurred images, we find that it’s intensity follows a curve similar to

For instance the Laplacian of Gaussian (LoG), it’s scalar double partial derivative of the Gaussian, is generally used before finding edges in an image. But, using it directly makes it fade out, hence, we use a scale-normalised approach. We make sure we just have σ square in the denominator, just as in Gaussian. We compare the absolute values of the maxima (intrinsic scale) of the curve for various σ and choose the one which gets the highest value. The effects are drastic and hence leads to much better results in application (like edge detection, where we subtract the original image from the normalised one, and the absolute value of the result gives us prominent edges). An interesting property of LoG is that it can be approximately obtained by subtracting a Gaussian from another having different σ.

Singular Value Decomposition is factorisation of a matrix (A) into three parts MDN where A=MDN. Here M and N are such that a dot product of any column with itself leads to a zero vector. This means M.transpose(M)=I. D is a diagonal matrix with continuously decreasing values starting from the first element such that the last few elements are always zero. What’s interesting here is that if we take just the first few (say n) columns from the first matrix, n elements from the second and n rows from the third, we can extract certain features of the image. Here are a few examples.

We often use the “Least Square Fitting” method to obtain a straight line from scattered points. But it gives us results far from optimum when we have noise in the data. Hence, we use an alternate way, known as RANSAC (Random Sample Consensus), Here we choose two random points and draw a straight line through it. We count the number of points which would line if we would draw imaginary lines parallel to one another separated by a distance ±ε. The one giving maximum number of points is taken to be the best fitting straight line.

In order to compute the rotation between two sets of coordinate systems, three points are enough. We take the cross product between two of the vectors and again a cross product with one of them gives us an orthogonal vector in case the original two vectors were not so.

Another question which arises is that rotation should be taken about which point. The answer is about the centroid. So, we try to minimise the square of A-RB, (where R is the rotation matrix). Here A and B are matrices containing the distance of each point from their respective centroids.

Hence, Z is a diagonal matrix whose upper bound is given by identity, as the matrix is orthogonal.

Now, we’ll have a look how to determine the position of the camera if we have knowledge of the location of certain points in the scene. If we know two 2D points and the angle between them, the camera could lie anywhere in this circle. If we are thinking in 3D, the camera could lie anywhere in a

toroid. If we add another point, we get a toroid for every two points. Hence, we can safely conclude the presence of the camera at the intersection of the three toroids. Breaking it down into simple geometry, we can make use of the cosine rule so that the unknown distances can be recovered.

The derivation is pretty straight forward. We take d2=u d1 and d3=v d1. Applying cosine rule,

Hence,

Solving the first equation for the square of u and inserting it in the second, we solve for u and place it back in the first. In this way, we end up getting a biquadratic equation in v. The second equation generated was a quadratic in u, so for every u we have four solutions of v. Thus, we get eight results.


Stereoscopic vision gives two images (A and B) of the same scene, from different positions. Let’s assume we are able to see camera A from image B and camera B from image A. The respective cameras are known as epipoles. Every point in the image, when connected to the other camera forms a line in 3D which are called the epipolar lines. In order to find out the relative rotation and translation between the cameras, we need to have eight corresponding points in both the images. Why? Let’s assume camera A to be at the centre of the universe. So, it obtains a 3D (world) to 2D (image) transformation with k [I O] where k is the calibration matrix, I the identity matrix and O the null matrix. Here I is the rotation matrix and O, the translation. Accordingly, for camera B, it’s k[R t]. Here t is the line joining camera A and B, in camera B’s coordinate frame. We obtain three lines in camera B reference frame: t, which is the line of translation, X2 which is any ray from camera B to the image plane and going beyond. Let it meet the 3D object at point X1, so we have another ray, which is X2-t, which is Rx1 in camera B reference frame.

Taking appropriate vector products and then a scalar product gives us zero. On expanding the matrix on the RHS, we are able to represent it with the help of an essential matrix by taking the transpose og (RX1) followed by the transpose of the entire expression. And since we have shown before that it’s a skew symmetric matrix, a negative sign gets attached to it. Now, if we change the point X2 in the image, we get different sets of planes. The planes cuts camera B image plane forming various lines which converge to the epipole, and these lines are known as the epipolar lines. The point X1 in camera A image plane will just be a scaling factor. Let’s say, pX1. The epipolar line on camera B image plane is EX1 and transpose of X2 multiplied by E for camera A. Hence the epipole for camera A is represented as Ee1=O where e1 is the epipole. For camera B, it’s transpose of e2 multiplied by E=O.

By and large, if a point is at X1 (3D) from the point of view of camera A and X2 from camera B, their relationship is expressed by the bilinear equation transpose of X2 E X1 = O. If the camera calibration matrix is K, the images obtained in camera A and B are given by KX2 and KX1 respectively. Replacing X2 by K inverse of x2 and X1 by K inverse of x1, we get that the transpose of x2 F x1 = O, where F is K inverse transpose E K inverse. The matrix F is known as the Fundamental Matrix. Just like E, the epipole times F equals O, hence it’s a second rank 3x3 matrix. Why? Because, any point represented in homogeneous coordinates have 1 as the z-coordinate. Hence, it means that the last row is a linear combination of the first two. Also, since there are total 3 dimensions in the points, the matrix must be 3x3. It has a total of nine elements, but there’s a scaling factor involved, for we are taking the z-coordinate to be 1. Hence, we are left with 8 unknowns. Thus,

Hence, it gets reduced to a least square problem of AX=O. Thus, we need at least eight point correspondences. We normalise the last element to be 1 and accordingly bring the points back to the image plane.

In order to compute the rotation and translation, between two images, we need to decompose E as we know it is t x R. tx is written in the form of U A U transpose. Next, we break U into three parts, with the last one being the

translation vector. We can further perform an SVD of R and equate it to U Y and the transpose of V. U U transpose gives us identity. So, by and large, we take SVD of E, breaking it into U, D and the transpose of V. The translation vector is either U or -U. Our rotation matrix is a multiplication of U and an unknown y, which can take two values, and V transpose. The determinant of the rotation matrix, as usual, should be 1. Thus, we have four possible configurations of rotation and translation for two images.

Hence, we are able to determine the rotation and translation between the two images simply from the images captured.


I hope the lecture note presented above makes it easier for one to develop interest in computer vision. In order to churn out something practical from the knowledge gained above, I plan to work this semester on stereo-vision, and would definitely make a post on the outcome !

Course credits:
Kostas Daniilidis, Professor, Computer and Information Science
Jianbo Shi, Professor, Computer and Information Science

School of Engineering and Applied Science
University of Pennsylvania
Coursera.

https://www.coursera.org/learn/robotics-perception