Building the ‘AR-Cloud’: Part Two - The Birth of Computer Vision

The basics of teaching a computer to recognise the world

Note: This is part two of a multi-part series on what we believe is THE biggest and most exciting challenge in computing today. Read the first part of this series on Medium here.

Professor Seymour Papert, co-inventor of the Logo Programming Language

It’s the summer of 1966.

A new professor called ‘Seymour Papert’ has just joined MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL).

With a background in mathematics and psychology, Papert sets his students an ambitious task — to reconstruct a human’s visual system for computers. The challenges Papert expects his students to overcome include the ability for machines to ‘recognise patterns’, ‘identify objects’ and ‘describe regions in space’.

Back to 2018, over half a century on, many of the problems Papert’s students set out to solve are still being researched today. For many people, the ‘Summer Project of 1966’ represents a major moment in the history of computing.

It was the birth of what we now call ‘computer vision’.

What is computer vision?

Computer vision is fundamentally the task of getting computers to understand images in the same way computers currently understand text. What started as a summer project back in 1966, has now developed into a field of computing in its own right, within the category of ‘artificial intelligence’.

What is less known, is that many of us already benefit from computer vision today.

Snapchat’s ‘Lens’ filters

For example, when you open the Snapchat application and overlay a pair of bunny ears on your head, a computer vision algorithm will be running in the background to identify faces visible within the camera feed.

Many banks around the world use computer vision to ‘read’ your deposited check, understanding who you want to pay and how much, from your squiggly handwriting.

Oil and gas companies even use satellite imagery to monitor their competitors’ performance, predicting demand by identifying oil tanks from space and detecting whether they are empty, or full.

However, the challenges Papert outlined back in 1966, are still not ‘solved’. Whilst we have come a tremendous way in the last half-century, computers are still unable to interpret images with the same confidence they can interpret words from keystrokes on a keyboard.

Because, here’s the thing: teaching computers to recognise the world is hard.

The problem arises because computers don’t perceive images in the same way humans do.

Computer vision vs human vision

Whereas we see a picture of a train, a computer just sees a bunch of 1’s and 0’s, that represent the levels of light that fall on a camera sensor.

It’s the role of computer-vision to bridge the ‘semantic gap’ by interpreting these 1’s and 0’s into something meaningful.

Where am I?

At Scape, we’ve initially focused on one sub-section of computer vision called ‘localization’ — that is, determining the position of a device in the world.

Specifically, we’re focussed on providing centimetre-level location accuracy outdoors and at an infinitely large scale.

As discussed in the previous post, accurately determining the position of a device is vital for many industries including augmented reality, self-driving cars and robotics.

Various types of ‘Fiducial Markers’

To simplify the task of recognising a camera’s location within a scene, in the past, computer vision researchers relied on visual tags or ‘fiducial markers’. These image markers, similar to ‘QR Codes’, are easy for cameras to recognise due to their high contrast and simple pattern.

Having identified a visual marker, a camera system can then calculate where the camera must be in relation to that marker, given how the marker appears in both shape and size.

Sadly for computer vision researchers, however, the world is not painted with predefined and easy to recognise image markers. In reality, the real world is infinitely more complex.

Scenes change in appearance during the course of a day.

For one thing, the world outside is constantly changing. If you look out the window, it’s likely that the scene you see looks a lot different than it looked a few hours ago, at least from the perspective of a camera. In the real world, environmental changes like light conditions can drastically change how a scene appears to a computer.

Seasons and weather can also impact how a scene appears to a camera

Also consider how changing environmental conditions, like rain, snow, sun or cloud can alter the appearance of a location over time.

This combination of environmental factors is the reason why, if you capture a location using existing augmented reality SDKs such as ARCore or ARKit today, your device may not recognise the same scene when visited just minutes later. Indeed, if you visit Apple’s website documenting the use of ARKit, you will see the specific notice below, warning that ARKit will be unable to recognise the same scene, as environmental conditions change.

Apple’s warning about relocalizing ARKit over periods of time

This is because, using existing SDKs alone, the 1’s and 0’s your camera perceives, look too different to be recognisable.

Turning the world into one gigantic, recognisable marker

Demonstration of Scape’s large-scale localization in-action

At Scape Technologies, we’ve developed our own method to describe the 1’s and 0’s a camera interprets in a way which is faster, more reliable and more robust than any other method demonstrated to date.

In the same ways image markers can be used by camera devices to recognise a photo or graphic, we are using our method of describing the world within our large-scale mapping & localization pipeline, to allow camera devices to be located with centimetre-level precision across an entire city. Unlike other approaches, we also describe your location in geographical coordinates, like a highly-accurate computer-vision enabled GPS & Compass.

What this means in practice for developers building augmented reality apps, is that when an AR object is placed in the world using our SDK, the object is truly is ‘anchored’ to real-world coordinates, visible for all to see - days, weeks or even months after it was placed.

Whilst the challenges Seymour Papert described back in 1966 are still being worked on today, we feel that accurate location is the ‘keystone’ to unlocking the potential of spatially-aware computing.

Our team are currently preparing research papers & benchmark data to share with the academic community, detailing some of the approaches we have taken in building our pipeline.

In the meantime, we are starting to roll out our computer-vision location services city by city — with an SDK supporting native iOS, native Android, Unity-for-iOS and Unity-for-Android.

If you are a developer working on AR applications that would benefit from accurate, persistent location & heading for city-scale AR experiences, please, get in touch.

This article is part two of a series on building the AR or ‘Machine Perception’ Cloud. See part one here.

If you have questions you would like answered, please leave them in the comments and I will address them in future posts.

Edward is co-founder & CEO of Scape Technologies, a computer vision startup in London, working to build a digital framework for the physical world.

Follow Edward and the company on Twitter. We also have a sparky new Instagram page :)

Interested to learn more about Scape Technologies?

We send a newsletter every couple of months, making sense of the AR industry and sharing our progress.

Sign Up to our newsletter.