Dense depth: Using stereo vision to see complex city streets

By Team Five

Team Five
Five Blog
5 min readMar 21, 2019

--

London. Paris. Berlin. Rome. Warsaw. Europe’s cities are both complex and diverse. It’s these differences that make them fascinating, drawing travellers and tourists from all around the world. Many of these cities grew up from medieval villages, hence the winding narrow streets, cobbles and all sorts of idiosyncratic ‘street furniture’. Buildings, gardens and public spaces are often positioned right up close to our roads. It’s a very different scene from the one you’ll find in US cities, where the infamous grid structure leads to more simplicity and uniformity.

Five is help autonomy programs solve the industry’s greatest and complex challenges. Our cities and streets demand a different, and far more rigorous, approach from US urban environments. Our entire sensor suite is geared to meeting this challenge. Five’s cameras, lidar scanners, radar and GPS are all optimised to ensure we can capture a detailed 3D model of the scene around our car, and all the ‘objects’ it comprises. Stereo vision plays a crucial part, giving us ‘dense depth’ of the scene at hand.

Stereo vision unlocks dense depth

Stereo vision is well-studied outside the autonomous vehicles world. Researchers are drawn to it, not least because it’s the method our very own eyes use to construct 3D models of our environment. Within the autonomous vehicles industry, Five is unique to be focussing so closely on stereo vision. The majority of developers are using lidar, which offers an accurate yet sparse picture.

So what exactly is this all-important dense depth that stereo vision makes possible? When we talk about dense depth, we mean we can take the video we capture from our car’s cameras and tell you exactly how far away everything (every pixel) is. This contrasts with the ‘sparse depth’ you get from lidar, where highly accurate depth calculations are achieved but only for a small fraction of the visible points in a video.

These images show the sparse depth provided by lidar (left), compared to the dense depth stereo vision gives us (right).

Dense depth matters, because safety matters

As Five develops technology for complex city environments, dense depth is a must.

Why? First up, seeing in 3D is essential for accurately classifying what we see in a scene. Is it a car? A pedestrian? A cyclist? Low-hanging foliage? Or an advert on the back of a bus? We must know, not assume.

Secondly, ‘seeing’ with stereo vision’s dense depth allows us to better classify the entirety of a complex scene and its unpredictable road layouts.

In these environments, the sparse quality of other methods can lead to missing or indeterminate detail. There can be areas in the scene where it’s hard, or impossible, to determine the depth due to something blocking or partially blocking the sensors’ view of that area. For example, one car may be partially obscured behind another. These partial ‘occlusions’ make it harder to understand the scene semantically. What’s really going on? What’s the context? Dense depth gives us answers.

And third? Dense depth also helps us combat wet, grey European weather. It may seem a mundane topic, but it’s an existential one in the autonomous vehicles world. Rain and snow produce noise effects that obscure the scene. Dense depth allows us to better overcome this interference.

In unlocking accurate classification, minimising occlusions, and championing over dodgy weather conditions, dense depth helps our cars stay safe on the roads.

How does it work?

The cameras on our car are set up in pairs, left and right, with both cameras in the pair pointing in the same direction. The difference in position of the cameras gives them slightly different views of the world. We use this difference to estimate the distance to objects in the scene.

Let’s do an experiment

Find a scene around you that’s going to stay fairly stationary for the next couple of minutes — one with some close objects, and others further away. Look at one of the objects that’s further away and, as you do so, cover one eye and then the other. Try to notice how the closer objects move from left to right as you switch eyes.

We calculate how far away an object is by measuring those changes in the horizontal position of close objects in relation to far objects. The larger the change in horizontal position, the closer the object must be. If an object hardly moves at all, it must be very far away.

We calibrate our cameras so that, if an object is ‘infinitely far away’ it will appear in the same pixel in both left and right cameras. By matching the pixels in the image from the left hand camera with pixels corresponding to the same object in the right hand camera, we can measure the difference in the position of those two pixels in the two frames. We can then take this disparity and calculate the distance to the object.

Calculating this dense depth is computationally intensive. In the same time it takes light to travel from a car we’re following to our camera sensor, our system performs around one hundred thousand calculations related to stereo depth. In a busy city, it’s this kind of intense, highly accurate sensor activity that will keep passengers, other drivers, pedestrians and property safe.

In the mix

eAs we’ve discussed in a previous post, our sensors are interconnected. We ensure that, at every opportunity, they work together to give us the greatest possible accuracy. With this in mind, it’s worth stressing that dense depth is crucial in addition to other methods that help us achieve depth.

We benefit hugely from the sparse depth we get from lidar and radar, as well as existing information we can access by accurately localising ourselves on an HD map. It’s all about a complementary mix. Teamwork is the name of the game, and stereo vision is an all-important player.

--

--

Team Five
Five Blog

We’re building self-driving software and development platforms to help autonomy programs solve the industry’s greatest challenges.