How Augmented Reality works, simply explained (Mobile Vision Series, Part 2)

Published in

Hanson Inc.

8 min readFeb 15, 2019

In our Mobile Vision series, we look at some common computer vision tools and their potential uses in your mobile apps. For this first post, we explain how face detection works, and some ways you can make use of it in mobile applications.

Augmented Reality (AR) is certainly one of the hottest topics in mobile vision. It’s a buzzword for games as much as it is for supporting product selection in ecommerce. Most people know it if they see it, but this article talks about what AR is actually doing as it looks through your smartphone’s camera.

Keep in mind that AR is an umbrella term that includes many different companies and technologies. All of the AR solutions on the market vary in some degree of capabilities, device requirements, or performance.

What is AR, basically?

So what defines AR? We know that AR apps display the camera feed and composite (aka augment) computer images on top.

What separates AR from flat image “stickers” or other badging is that the composition is related to the 3D space of your physical environment. In other words, as you move your phone, the composition of the imagery changes to mimic what the element would look like in the real world. For this to happen, AR software analyzes your live camera feed and combines it with movements from the device to build a virtual 3D representation of your environment that composites on top of the camera feed.

How AR thinks about the real world

There are two essential aspects to how AR works based on the camera feed.

Step 1~ The first part was touched on in the first article of this series: image analysis code is run to match visual elements in each frame of the camera feed. In this case, instead of faces, AR image analysis is designed to look for specific features that help it build a 3D world parallel to your own. These features can be things like carpet patterns and lines on the edges of furniture.

Step 2~ Once the AR software identifies unique features and can track them frame to frame, the second part of the process takes place. Remember all those games you played where you used your phone to steer a race car? Those very same motion sensors in your phone are used to provide the AR software with very specific movements.

So, as you move and tilt your device, the change in rotation helps compute the distance from the camera, by continuing to track the same set of unique points.

Let’s look at two example images to better illustrate. In the first image, our AR system has found two spots along the base of a partition. The large center circle is the camera or device center and the lines illustrate the distance in 3D space between the center and the points. In the second image, the device has rotated to the right, yet the AR system can still track those same points, and with math, it can determine their position based on how far the device actually moved.

See It In Action with our Mobile Vision app

Now that we have a good understanding of how an AR engine should act, let’s start with a simple demo to see what it can do. (A link to our Hanson Mobile Vision iOS App is at the bottom of the article.)

When starting an AR session, an initial understanding of the environment needs to take place – this is why you have to move your phone around while pointing at distinct surfaces. To guide the user along this process, we’ve provided some basic instruction.

Play this video to see the demo.

In this demo, we actually show the yellow dots where the AR engine has discovered a feature point. When the AR session is done initializing, you’ll see many feature points show. And as you move your camera around, more and more feature points will be added. This allows your physical environment to continually feed the 3D space in the AR engine and keep track of where it is.

Now that we have a variety of feature points, go ahead and tap near one. We’ll place a marker cube at this spot to show you where in the 3D space you tapped. We’ve also added some text to the cube which shows the distance from the camera. Notice that if you simply rotate the camera around the marker the distance doesn’t change, but if you get up and move around the distance does change.

What makes AR such an engaging experience is that you and your camera live in the real world and can move about. When the AR session was initialized, your camera stayed pretty much in one place. But once the system is running, you can move from that point without disrupting the alignment of virtual and real. While we move about, our marker cube appears to “stay in the same spot” which allows us to view all sides of it.

Next Feature: Plane detection and placing an object

Let’s move on to something little more practical. Imagine you’re a manufacturer of furniture. You know that if customers see your furniture in a store, they’ll fall in love. However, online sales are lacking due to the same reason: seeing is believing. Let’s see what AR can do to solve this problem.

The key to custom AR applications is 3D content. In our imaginary furniture scenario, this is made possible by our imaginary furniture designer who works in a CAD system designing all aspects of the products. With content ready, we can turn to the flow of the AR app.

A new feature: While our first demo relied on feature points, we need to leverage another part of AR systems for our furniture to work: plane detection. When an AR system starts to notice many feature points along the same plane, it can alert the app that a flat surface, like a floor, exists. We will rely on this to prompt the user to place our virtual furniture in their AR environment on a level surface. AR tools can also look for vertical surfaces (walls).

Stepping through the app, we start by asking the user to move the phone to begin the AR scanning. After the AR engine tells us that it found a plane, we can prompt the user to place the chair. Because of the mapping math, the chair appears in scale with the real world. Our chair can be viewed from multiple angles. As an added bonus, the user can tap the chair and see a list of available color options for the chair.

Obviously, this is just a demo and a real application would include options like: chair rotation and movement, multiple chair placement, and taking a picture to show others what the chair will look like in your room.

One More Feature: Interactivity

The previous demo helped potential customers decide on the purchase of our chair. However, in this world of flat-pack shipping, customers who purchase our chair will still have to wrestle with assembly. While black and white, multilingual (or non-lingual) instructions have guided furniture assemblers for years, maybe AR could help us again.

Just as our chair is an assembly of manufactured parts, our 3D model is a collection of smaller 3D objects. These objects can be manipulated in the AR environment to walk users through assembly steps. Let’s see how that would work.

We begin again by asking our user to scan the area for the AR session to start. Once a flat surface is available, we ask the user to select the location to place the chair. With a chair in the user’s AR world, we can guide them along the steps to assembly, starting with what the finished product looks like. Each step animates the parts of the 3D model into place and the circle arrow lets the user repeat a step if they need to review again. The end result is the assembled chair.

While we took some assembly shortcuts for this demo, it’s easy to see that a 3D assembly tutorial made possible by AR is a serious upgrade over black and white instructions on paper.

What Else?

As mentioned earlier, there isn’t an industry standard to define AR and there is continuous evolution. As technology vendors look to expand their offering, they will be adding various features that go beyond what we described above.

In our previous article about mobile vision, we mentioned how facial detection (looking for anything that looks like a face) was different than facial recognition (looking for a specific face). That nomenclature helps to define these other technologies, although you may see vendors labeling their specific offering differently:

Image Recognition: The ability to look for a specific “flat image” in your environment. The developer includes an image of the item to be recognized in their application – often this is called a “marker”. For example, input a picture of a chess board to the AR engine, and when the user points their camera at a real (flat) chessboard, the application can display a virtual board in 3D space populate it with chess pieces.
Image Detection or Object Detection: This involves looking for objects (not flat images) in the camera feed. A Machine Learning (ML) system is used to understand what a particular object could look like, when viewed from various angles. As an example, if you wanted an app involving recognition of coffee mugs, this would involve “teaching” the system what a mug looks like, by feeding it many images of coffee mugs. This builds a data model that the AR image analysis engine could use to match against any real coffee mug a user may point their camera towards. (Faces just happen to be built into these systems, which is why we all know about snapchat lenses.)
3D Object Detection: Taken to the next level, this works like the above detection of objects but starts with 3D data instead of 2D camera footage. The application has a dataset of what an object looks like in 3D, and the AR engine has the ability to recognize a real-world object of similar shape in its analysis of your environment.

Download and Try it Yourself

The demos shown here are published in the iOS App Store for you to play with, found here: <https://apple.co/2sz49zl>.

Click to download our demo application and try face detection on yourself!

As we continue this series looking at other computer vision technologies, we’ll update the app so you can check out additional demos.

Thank you for joining us on the journey through mobile computer vision. Stay tuned for future articles exploring augmented reality, object detection, and more!