Augmented reality as a research problem in computer vision

Augmented reality, or simply AR, is an exciting technology that can change the way we view the world. This technology is aiming to expand (augment) what we usually see in our physical world (reality) by integrating the computer graphics into the real world. Think of it as we pull out a computer image out of your monitor and place it onto our real world where you are facing.

Fig.1: A typical AR application, a man is using a tablet device to visualise the design layout in a office.

The effect of Fig.1 is called AR, it tries to blur the line between what is real and what is virtual, and therefore enhances our perception of reality in our 3D world.

What makes AR an AR app?

There is a common misconception many people have when they mean an AR application. A so-called AR is not just focus on “adding” or “overlaying” an information or graphics onto the real-world, but to provide a tool for computers to “see” the world. When you point your device towards something, it should genuinely see and understand what is there, then it should overlay graphics on top of the scene. This is called vision-based AR. The computer generated graphics will behave like a real object and when you move around, you should be able to see different sides of that object, or even different scale when you move away from the scene.

Why AR is a challenge task?

In order to make AR believable, computers essentially need to know how to rotate the virtual objects according to our viewpoint, which is called camera pose. That sounds not too difficult, right ? well…not as easy and intuitive as it seems to be, as it turns out that the camera pose is always unknown because an image or a video file does not contain such information.

But what makes camera pose estimation difficult? the answer is perspective projection. This illumination is a technique that creates 3D effect on a 2D picture plane. Fig.2 depict some images of the same building. Even though you and I can tell these photos are from the same scene but just under different angle, but computers are dumb, they certainly cannot distinguish or even perceive the same like what we do. So how we can teach computer to correctly deal with this illumination?

Fig.2: A sequence of images shooting at the same building under different camera pose.

Thanks the previous researchers who established the foundation of computer theory, a field of studies that instruct computers how to interpret images and do some smart things, many researcher nowadays uses this foundation to come up with their own solutions to estimate the unknown camera poses. In the research field, this problem is usually refer as camera pose estimation.

What are the approaches for estimating camera pose?

Currently there are two mainstreams in computer vision based approaches: marker based and markless based tracking.

Marker AR uses the predefined patterns for training, and then the algorithms recognize these patterns in real world to recover the camera pose. One advantage of this approach is that AR systems can be really robust because of the predefined markers, but the disadvantage is that the markers are fixed and we need to train it beforehand.

Fig.3: Marker AR use a predefined pattern for training and recognition.

Markless AR, on the other side, is a more popular approach to estimate the camera pose. This approach detect natural features such as planes, edges, blobs, or corner points on both template image (a region in a image that you want the computer to recognise) and current image in the camera feed by matching the correspondences between them. A famous techniques such as SIFT not only provide the feature detector, but also a mechanism to encode these feature points that are robust against changes in image scale or even partially occlusion. The technical detail deserve another post, so I will leave it to next time.

Markless AR is usually done by detecting and matching natural features against a “template image”. On the left-hand side is a template image, and on the right-hand side is camera feed. The circle dots are the detected feature points, while blue lines join the correspondences and the green box are the detected region the camera feed,


There are more and more companies starting to make strong efforts towards developing and standardising this technology. These include Qualcomm, Sony, Google and whole bunch of other startups. In addition many open-source toolkits available for developers to use for AR apps development or even expand the algorithms. Personally I think AR is a cool and a scary technology. Cool as in it can add huge value to the real world, and bad because it can create too much information (ads, useless tags or graphics) to environment.