Augmented Reality DIY

Published in

The Startup

10 min readDec 17, 2019

Hi!! My name is Jayant, and what follows is a guide to make your first augmented reality program using Python and OpenCV. This project was an assignment I had in my Undergrad Computer Vision course, which I have broken down here, step by step.

After following this tutorial, hopefully, you’ll have something that looks like this or even better.

Disclaimer: This will require a basic knowledge of python and OpenCV, so if you haven’t used OpenCV before I would recommend that you go through some introductory tutorial first. I have inserted code snippets for a better understanding of whats going on. It is recommended to go through the full blog first, understand everything and then try writing yourself with my code as a reference.

In this post, I shall cover-

Aruco Markers & Detection
Perspective Transformations
Augmenting the Object into our Reality
Possible Improvements

Aruco Markers

As you can see in the video, the cut-out in my hand with small black and white squares is an Aruco Marker. This will be my target surface. By positioning the marker, I can control the location and movement of the fox.

If you look at one marker carefully, you can see that there are 64 subsquares (8 by 8) and are colored either black or white. The pattern is not entirely random. The design tries to ensure that a marker won’t be mistaken for some background element and vice-versa. For the same reason, we wouldn’t use a simple checkerboard as an aruco marker. I chose to use Aruco Markers as my target surface because their simple black and white pattern makes them super easy to detect and identify.

So given the original aruco pattern, how do I find it in my image?

We find all contours in the original image (A contour is simply the boundary of an object in an image). Here I directly used the function cv2.findContours, but at the back end, it may be doing something as simple as finding the edges and then trying to connect them and get closed figures.

The cv2.approxPolyDP() function then uses the Douglas Peucker Algorithm to “smoothen” the contours.

Note that even though our original pattern has a square boundary, the image of this marker may not look square. This is because of the perspective transformation that it undergoes. We’ll dive deeper into this ahead. An easy example would be to think of railway tracks. We know that they are parallel, but sometimes when you look at them along their direction, they seem to incline towards each other.

source: https://stackoverflow.com/questions/12919398/perspective-transform-of-svg-paths-four-corner-distort

Since we are looking for a Quadrilateral, we disregard all found contours with more than 4 points and continue our search. For the filtered out quadrilaterals, I find the bit-signatures by dividing each quadrilateral into an 8 x 8 grid and threshold the pixel value at the center.

points of interest for a general convex quadrilateral

This bit-signature found can then be matched with the bit-signature of the original pattern. All four rotations of the original are considered, and the corresponding match gives information about the order of points, which is crucial in homography estimation, which will come ahead.

Yay!! We found our aruco marker in the image.

Homography & Perspective Transform

I want to answer questions like if I had a real fox and I kept it on top of my marker, then what would the captured image look like?

So, given the unchanged image of me with the marker in my hand and the true Aruco pattern for reference, I wish to augment the fox into my reality by making edits to the image I have.

Before we do that, we need to learn about different coordinate systems and how they are related to one another.

The World Coordinate System(WCS) is the 3D Cartesian Coordinate system in which we describe our object of interest. It can be arbitrarily kept according to our convenience. The Camera Coordinate system(CCS) is also a 3D Cartesian Coordinate system which has its axes according to the camera's orientation.

A translation and rotation are enough to transform world coordinates into camera coordinates. The jump from world coordinates to camera coordinates is governed by the following equation-

Next, we have the intrinsic matrix (A), which is a projection from the 3d camera coordinates(X, Y, Z) to 2d pixel coordinates(u,v). It’s enough to know that the intrinsic matrix is dependent purely on the camera, not on its orientation or other stuff. It takes care of the focal length of the camera, aspect ratio, etc. For now, say we have the intrinsic matrix.

The equation for a perspective transformation, from WCS to Pixel Coordinates

Point P has World Coordinates (U1, V1, 0) and Pixel Coordinates (u1, v1)

Our aim is to find the parameters of this perspective equation so that we can place our object in the WCS and find its pixel coordinates, which will give us our desired image.

Till now, we’ve located our Aruco Marker in the image. Using its corners, we have four correspondences (a pair (P, P*) where P is in the WCS and P* is the corresponding point in the Pixel Coordinates). If we try to solve for the parameters, 4 points suffice to find the homography.

Now, what is a Homography? A Homography is a degenerate form of the perspective equation. Using the homography H, we can transform points belonging to a specific planar surface; in this case, the plane W = 0. Note that in our perspective equation for all 4 points on the marker, the z component(W) is 0, and therefore we can’t get any information about r3. But we do know that r3 should be the cross product of r1 and r2(orthogonal axes). Using this, we find r3 in the extended_RT function.

Coming back to the intrinsic matrix, OpenCV has direct functions to help you calibrate your web camera. You just need to follow their documentation. I will just briefly give you an idea of what you’ll have to do.

So it describes various types of distortions in images and gives the correction equations which have certain parameters.

To find all these parameters, what we have to do is to provide some sample images of a well defined pattern (eg, chess board). We find some specific points in it ( square corners in chess board). We know its coordinates in real world space and we know its coordinates in image. With these data, some mathematical problem is solved in background to get the distortion coefficients.

Given these sample images, it finds the distortion coefficient along with the intrinsic matrix(our aim).

The problem here is that you have to take a lot of photos of a checkerboard.

If you are lazy, then there is another way. You can try to approximate it smartly. The matrix has the form-

cx and cy are supposed to be half the resolution of the webcam. So for me, the resolution was 1280x640, which gives cx= 640 and cy= 320. The focal length can be approximated following this post. The exact value of f is not crucial to this application and therefore a smart guess could work.

So, after all this hassle we have the perspective transformation matrix.

But before transforming points to the pixel coordinate systems, we first have to decide the world coordinates of the animated fox.

Rendering

I got the fox from the website clara.io, which has a wide range of 3D objects that you can choose from. We will be dealing with the .obj format. It is a very simple way to describe a three-dimensional object. All we need to know is that it lists all the vertex coordinates, texture coordinates, and face information.

What I’ve done is not the best(or even correct) way to render this image. There are many things that need to be taken into consideration, but I took some shortcuts.

The basic idea is that we find world coordinates for all the vertices. For that, we just have to shift the object to the center of the aruco marker, by shifting each vertex. Project all the vertices to their pixel coordinates and then draw all the 2d faces on top of the original image.

For color, there was an accompanying texture file and texture coordinates for each face. In the image, you can see each triangular face mapped to the corresponding texture. I have directly picked up the color (by using pixel intensity at the centroid of the triangle). Since, while rendering no ordering of faces is kept in mind, the fox appears transparent. No consideration of lighting has been taken either. Maybe all of this can be solved using some library, but for now, this is what we have.

Tip: When dealing with images in OpenCV in python, please take care of pixel coordinates and locations in the image. What I mean to say is that intensity at point(x, y) in the image is img[y][x] and not img[x][y]. Such bugs waste a lot of time and are extremely frustrating.

For example, in this 100x100 image, a circle was drawn at (20,50) but the pixel value at image[20][50] is black and image[50][20] is red

Summary

This is how it will all come together-

Before running the main file we have hardcoded the intrinsic matrix which was found either by calibration or smart guessing.

At each time step, we fetch a frame from the webcam. We try to find the aruco marker in the frame by finding all contours, filtering out the quadrilaterals and then matching their bit-signatures. If we find the aruco marker in our frame, we use its corners to estimate a homography from the world coordinates to the pixel coordinates. This homography is extended to get the full perspective matrix. We shift our object vertices to the center of the marker and using the perspective transform, find its 2D projection in our image.

Further

One major improvement, that I’m working on currently would be not to detect the marker each time, but rather track it. What I mean to say that If I know that the marker was at some position currently, then I can exploit this information when I’m looking for it at the next time step, instead of searching for it all over again from scratch. This is precisely what tracking is. This will address the instability in the video that we have right now and since it will take less processing the simulation would be more responsive. Using the Lucas-Kande algorithm for Sparse Optical Flow should solve our problem, and I plan to cover that in the next part.

Some other issues that I’d like to solve are to have a more realistic rendering.

Another way to create this augmented reality program would be to use any marker (not Aruco, something random like your name written on a piece of paper) and use SIFT feature descriptors. An advantage would be that it would work even if some part of the marker is occluded. The Aruco method fails if a part is outside because it’s impossible to detect the contour though I found that method slower and more unstable.

I would like to thank Juan Gallostra Acín, whose blog really helped me get started with this project and helped me write my first version. He has used SIFT descriptors for feature matching and homography estimation. You should check out his blog too.

I would also like to thank my project partner Manuj Trehan without whom I couldn’t have completed this project.

Links

PS: This is the first time I have made a tutorial. If you found any part unclear or incorrect, feel free to leave a comment or write to me at jayantjain100@gmail.com so that I can improve these posts. Thank you.