Going Deep

Published in

TL;DR Innovation

4 min readFeb 24, 2018

A New Algorithm for Robotic Vision

Cruising around in my Nissan 350Z with the ground-effect neon thumping to the soundtrack, or careening around curves at over 170 MPH in my V12 Aston Martin DB9, are high on the short list of things that assuage my daily commute amid snarled traffic in the real world. In a relatively short period of time, the hefty cathode ray tube in our family room has been transformed from the keyhole view of lack-luster situation comedies and reality shows into a rich interface between my family’s senses and the minds of video game designers. The graphics displayed by our latest holiday acquisitions are visually stunning. The industry has moved beyond the stage of “almost-real” into “hyper-real” — a world wherein the most impossible camera angles are commonplace, the lighting is always perfect and the frame acquisition rate is just right. There is plenty of science buried in this experimental setup and researchers are developing the tools to uncover and leverage it.

One such scientific mystery is how the binocular vision we use to navigate through our three-dimensional (3-D) environment has no difficulty extracting volumetric information from a two-dimensional (2-D) pattern emitted from the glowing phosphors coated on a piece of glass. I see two identical flattened images when using both eyes to look at the screen, yet I can schuss around obstacles at high-scale velocity with ease. Even this simple test reveals that 3-D depth processing is more than optics; it must include some high-level image processing.

Professor Andrew Y. Ng and his research group at Stanford University are asking similar questions of robotic vision. Autonomous vehicles equipped with a phalanx of cameras, sensors, lasers, and radar are making their way through cluttered environments; however, Professor Ng is investigating lightweight agile solutions formed around a solitary color video camera. It appears the key to extracting depth from a single, 2-D monocular image involves the same techniques artists use to inject depth into their pieces, namely, texture, perspective, and focus. The Renaissance masters skillfully detailed the stitching and folds on the clothing of near subjects while purposefully reducing the scale, focus, and detail of objects in the distant background to produce life-like vistas on flat canvas. Short of developing a thinking machine that recognizes objects and their common size in perspective, Ng’s method extracts generic features from the digital image and transforms them into depth information.

The algorithm is based on a popular 2-D version of the Naïve Bayesian Classifier (NBC), known as a Markov Random Field (MRF) whose goal is to classify combinations of image pixel attributes into a range of depth values. Bayesian analysis permits an observed outcome to be related statistically to a collection of input observables. For example, atmospheric visibility can be related to input observables of temperature, humidity, time-of-day, and atmospheric pressure. Even though an exact deterministic equation connecting the input to the observed outcome may not be known, the NBC can generate the probability of a specific outcome given the known input values. The statistics are generated by “training” the NBC with a set of input/output pairs and are validated by comparing the output to a test set of additional input/output pairs. If there is no true relationship, the NBC performs very poorly on the test set; however, high-quality results can be obtained if a relationship does exist, even if it is not explicitly known.

Ng’s group collected image/depth pairs using a small 1704 x 2272 pixel color digital camera and a one-dimensional laser range finder mounted on a translation stage to find the true depth of the image at a resolution of 86 x 107. The MRF was trained using 75 percent of the pairs, and validated using the remaining 25 percent. The digital images were segmented into small pixel cells and correlated with filter patterns designed to classify texture variations, texture gradients, haze, and edge orientation resulting in 34 unique local input observables for each cell. The cells are also compared to their nearest neighbors at multiple resolutions to extract global information of 19 additional features, resulting in a set of 646 input observables for each cell. The trained MRF was used to predict depth in test images of both indoor and outdoor locations and was determined to have an average error of 35 percent, meaning the image of an obstacle 10 meters away would appear between six and 14 meters away to the algorithm. At a 10-Hz frame rate, an autonomous robot would have adequate time to avoid the obstacle even with this uncertainty.

The one-camera system has dramatically reduced the amount of hardware required to provide depth information, and can also determine distances five to 10 times further away than the dynamic range of many triangulating two-camera systems. The algorithm has been used by a small radio-controlled car to navigate through a cluttered, wooded area autonomously for several minutes before crashing. Further enhancements may one day enable the development of autopilot systems for automobiles. But that would drastically reduce my enjoyment of video games.

This material originally appeared as a Contributed Editorial in Scientific Computing 23:4 March 2006, pg. 14.

William L. Weaver is an Associate Professor in the Department of Integrated Science, Business, and Technology at La Salle University in Philadelphia, PA USA. He holds a B.S. Degree with Double Majors in Chemistry and Physics and earned his Ph.D. in Analytical Chemistry with expertise in Ultrafast LASER Spectroscopy. He teaches, writes, and speaks on the application of Systems Thinking to the development of New Products and Innovation.

Going Deep

Written by William L. Weaver