The big problems Magic Leap must solve in order to really deliver

9 min readNov 8, 2014

Success will require breakthroughs in many areas of computer vision.

Magic Leap recently raised half a billion dollars, accompanied by a wave of hyperbole that stood out even by today’s inflated standards (go read this TechCrunch article to calibrate, but be prepared for comparisons to the Apollo program). Those few who have tried early prototypes — and, in many cases, gone on to pour money into the company — are quite sure it’s going to change everything. The company won’t say exactly what they’re building, but the basic outlines are clear. They’re keen to note that it’s not just AR — that is, not the modest augmented reality that has come to market and mostly failed in terms of commercial impact — but of course it is AR by any reasonable definition of the term.

So, coyness and secrecy notwithstanding (“we’re being a little tight-lipped”), I think we can piece together the basic outlines of the Magic Leap system. The core device is a set of transparent glasses — a “head-mounted display” of some sort — that uses trick optics and projection (“Dynamic Digitized Lightfield Signal”) to superimpose presumably arbitrary imagery over portions of the real world. Most intriguingly, the teaser videos and images released by the company suggest that the system will work in a wide range of environments — indoors and out; under natural, artificial, and mixed lighting; and in both intimate and large spaces. The goal is to provide a platform through which content creators — artists, storytellers, teachers, and developers — can interweave virtual people, animals, monsters, and vehicles with the physical environment.

This sounds amazing, but what are the technical challenges that must be addressed to deliver this experience? At the highest level, it decomposes into three problem areas:

The system needs to collect information about the user’s surroundings, and interpret the structure and content of that environment. That is, it must go from streams of raw sensor data to a structured representation of shapes, materials, objects, and light sources.
It needs to track the user’s head position and orientation at the very least; for some applications, it may also need to read the user’s gaze, body posture, gait, hand pose, etc.
It must decide what to display and then draw that content to the HMD. This step is a combination of the graphics rendering work performed by 3D gaming engines or digital effects packages, and whatever fancy new projection technology is used by the HMD to deliver imagery to the eyes.

Crucially, all of this must be performed under very stringent latency and quality constraints, including maintaining high and uninterrupted refresh rates. While the VR community is still exploring the various performance requirements needed to maintain a convincing “presence” (see the Oculus Best Practices Guide for a snapshot of current thinking), a good rule of thumb is that changes in the physical world (a user’s head movement, a change in lighting, a physical object moving to occlude a virtual one, etc.) must be accurately reflected in the simulated content within a very few tens of milliseconds. With any more lag, the human perceptual system notices the discrepancies, often at a subconscious level, and the sense of immersion and “magic” is lost. Fall behind or make any number of subtle mistakes and the experience will fail to convince.

I don’t know much about the hardware needed for problem 3 (projection), so I won’t comment here except to note that the size of the recent investment round and the astonished commentary by those who have tried a demo suggest good progress on this front. For problem 2 (tracking), Magic Leap will face many of the same challenges as other VR systems like Oculus. These difficulties are considerable, but it’s entirely plausible that continued R&D will result in cheap, rock solid, performant tracking systems in the near future. The tracking on the Oculus DK2 is already pretty good, for example, though of course it does require that pesky external camera.

That leaves problem 1, which is a real doozy. To deliver on the full promise of “cinematic reality”, the system needs to understand the environment, in real time, well enough to inject new elements that appear to be natural, legitimate participants of the scene. I think this means, essentially, solving all of computer vision. To see why, let’s break problem 1 down a bit more.

First, remember that the system, via the HMD and possibly other as-yet-unspecified sensor platforms, will be collecting sensory signals from the environment. There will certainly be cameras, whether traditional, infrared, or depth, as well as microphones, accelerometers, and so on. These streams form the raw data from which the system must interpret its surroundings, possibly in conjunction with information from maps and other pre-existing databases.

http://web.mit.edu/manoli/crust/www/slides/piggy.jpg

3D Mesh reconstruction. Crudely speaking, this means translating the raw sensor streams into a “wireframe” version of the scene that describes the shapes and geometries, as well as the boundaries between objects. This information is fundamental for determining where virtual characters and objects can exist in the scene, how they can move while respecting the other occupants of the environment, and so on. Without this knowledge, virtual elements will need to be well separated from real objects, or risk appearing in impossible or unnatural locations.

http://www1.cs.columbia.edu/CAVE/projects/btf/

Surface and texture. It’s one thing to understand the shape of things, but it’s also important to understand their textures and material composition — fabric, wood, metal, glass, rubber, and plastic all behave differently, especially with respect to how they interact with light sources. Without a good sense of these surface properties, the interactions between real and virtual elements will be constrained and unrealistic.

Object recognition. This is a matter of attaching labels and other metadata to different pieces of the wireframe and surface scene description. Conceptually, this is similar to the recognition performed by Google’s image search — though in a much messier, faster-moving, and less constrained setting. Object recognition, including identification of key subparts like handles, knobs, hinges, etc. will determine how virtual content can interact with and respond to the real world.

Pose. For some uses, it will be important to understand where the real people are in the scene, what pose their body is in, and even their gait and trajectory. This problem is conceptually similar to the one solved by a system like Microsoft’s Kinect, but it will need to function in much more varied and cluttered environments — and with potentially many more people at a time. There’s a huge difference between grokking one or two people in a living room performing stereotyped movements and tracking a diverse, shifting crowd in a street or on a beach.

http://astheticsaloni.blogspot.com/2011/09/lighting.html

Lighting inference. This is one of the most interesting and difficult problems: in order to produce realistic virtual content that “fits” into the real world, the system needs to figure out how to illuminate the objects. In other words, it needs to understand something about the light sources, and how those sources interact with the rest of the environment — including absorption, transparency and translucency, reflection, etc. Get this wrong, and the virtual content will be lit in a manner inconsistent with the rest of the scene, and things will feel subtly (or drastically) wrong.

And that’s just the beginning. In order for scripted virtual characters to interact well with real people, they’ll need to understand all kinds of unspoken social signals. Even a behavior that seems effortless, such as walking with a flow of pedestrians down a sidewalk, in fact requires the system to correctly interpret a complex set of implicit and culturally-defined cues.

https://www.napali.com/tours/whales/whale-behavior/

Solving these problems means the difference between being able to create a baby elephant floating in the empty space of two cupped hands, and a tarantula sitting right on your hand. Between a submarine floating well above a busy sunlit street, and a virtual character moving naturally and convincingly in the crowd of pedestrians and the flow of traffic on the street below. Or between a whale flying high over a beach full of people, and that same whale finding an appropriate place in the water to breach without “crushing” some unlucky humans. In short, these scene understanding capabilities will be crucial in enabling Magic Leap content to descend from the metaphorical and literal heights and instead become intimately entwined with the inherent drama of the physical world. Will these be truly blended realities, or will the real world serve as just a prettier stage for segregated virtual players?

So, is this possible? On the one hand, all of these problems have been tackled by academic and industry research communities for decades (see here for some examples of current performance in various subtasks). None of them is currently solved well enough to support the needs of Magic Leap. The company is clearly hoping to make significant progress by bringing together a large, excellent, well-funded team with clear goals; the $500 million question is whether this approach will pay off. Not only will the state of the art in these individual areas need to be significantly improved, but the various elements will need to be integrated into a single, affordable consumer-grade system that Just Works in a wide range of conditions.

Of course, there are any number of ways to reduce the level of difficulty. Use blank backdrops to drastically simplify the scene (e.g., castAR). Significantly limit the types of content that can be injected into the scene (such as the informational displays and abstract elements of traditional AR systems). Restrict to scenes with known and/or simple light sources and matte surfaces. Require tagging or other instrumentation of real objects and scene elements to simplify recognition. Limit locations entirely to pre-mapped (probably indoor) spaces. Don’t allow real humans into the scene at all, or at least not unscripted ones.

Indeed, a close look at the collection of the teaser images on the Magic Leap website suggests that they feel most comfortable with a clean separation between the virtual and real components of the scene: most of the simulated objects are either floating in space (submarine, whale, elephant, seahorses), or confined to relatively featureless regions of lawn or bed (monster, ballerina)— in other words, stages.

What then, is the optimistic case for Magic Leap? In addition to the massive capital raise, the company has already assembled an amazing technical team that includes Gary Bradski, who wrote the book on OpenCV, and Jean-Yves Bouguet, the technical lead for Google Street View. Their jobs listing page is a good indicator of their ambition, with well-articulated listings for just the kind of scientists and engineers who could help address all of the problems listed above. Investors like Google are clearly hoping to replicate the success of projects like their own self-driving car and IBM’s Watson — that is, large-scale efforts that combined cutting-edge machine learning with massive compute power and systems integration to produce breakthrough capabilities that few thought were possible at the time.

But both the self-driving car and Watson are narrower successes than they might first appear. The cars, for example, are restricted to roads that have been meticulously mapped in advance; indeed, Google has turned all of Mountain View and surrounding areas into a “track” to enable its cars to drive there. While there’s no mention of similar restrictions in Magic Leap’s current messaging, I’ll be keeping an eye out for caveats as more details emerge.

The big problems Magic Leap must solve in order to really deliver

Written by Beau Cronin