Depth Estimation at KeepTruckin: Survey (part 1)

Published in

motive-eng

7 min readAug 30, 2019

KeepTruckin offers the best-in-class, end-to-end fleet management solutions and Electronic Logging Devices for drivers, fleet managers and fleets of all sizes. With its smart dashcam, its building a culture of safety and trust. To that end, in order to make our smart dashcam even smarter, we are working on technologies to estimate the distance of the vehicle in front. This will assist the drivers in keeping a safe driving distance from vehicles in front and thus limit their chances of being involved in an accident.

An average accident in the US costs $200k and, despite the fact that passenger vehicles are at fault 80% of the time, commercial vehicle drivers are more likely to be blamed. In this case, dash cameras can provide video evidence of what went wrong and that evidence can be used to exonerate the driver. One more use-case of dashcams can be automatic monitoring of unsafe driving behaviors by detecting everything on the road and scoring the driver based on the decisions they took in certain situations. One such example of unsafe driving behavior is Tailgating.

Tailgating:

Tailgating occurs when drivers do not keep a safe distance from the lead vehicle. According to the US traffic safety guidelines, the driver must keep a headway of 3s and 5s for cars and trucks respectively from the vehicle in the front. The headway is a function of two parameters:

The speed of the ego vehicle, and
The distance from the lead vehicle.

Vehicle speed can be obtained by sensors on-board (such as On-Board Diagnostics (OBD) and/or GPS). However, the second parameter for headway calculation, i.e. the estimation of distance from the lead vehicle, remains a challenge. The reason is that there is no standard sensor on-board to estimate the distance. Along with headway calculation, self-driving cars and advanced driver assistance systems (ADAS) also need distance estimation for FCW (forward collision warning) and applying emergency brakes. Distance can be calculated, estimated as well as predicted and the most straightforward way is to utilize active sensors to calculate distance.

Active Sensors:

Active sensors transmit radiation (usually Near-Infra Red or Infrared light) and then receive the reflected signal. Then, the Time-of-Flight (ToF), or the time it takes for the signal to return carries the information about the depth in the Field-of-View (FoV), or the distance of the objects in the front.

Radars and Lidars are the commonly used active sensors for distance calculation on road vehicles.. Figure 1 shows a typical configuration of a distance sensor on a self driving car.

**Figure 1**: Typical vision and distance sensors configuration on a self-driving vehicle (Courtesy: Google)

**Table 1**: Comparison of active sensors

Lidars:

LIDARS (Light Detection and Ranging) are popular among most self-driving systems as primary range or distance sensor along with other ultrasonic sensors, radars and cameras. Lidars emit light pulses which bounce back from obstacles. The sensor then calculates the time taken by each pulse to bounce back. Lidars are known for their high resolution and accuracy and can accurately sense depth up to 100 meters. But their performance is compromised in rain and fog along with their high cost (sometimes more than the cost of the vehicle itself). Top of the line Lidars can cost up to $100k. The Lidar used in the famous Kitti dataset (Velodyne HDL-64E) costs around $100k and offers a range of 120 meters. Looking at the advent of Electrical Vehicles (EVs) on the scene, Lidar power consumption may also put yet another challenge.

Radars:

Radars (Radio Detection and Ranging) use radio waves instead of light pulses. They are relatively cheaper and the radio waves have comparatively less absorption as compared to light and can work for relatively longer distances (up to 250 meters). Radars are also robust to adverse weather conditions but they suffer from low resolution and accuracy. Moreover, they perform in a sub-optimal way in blind spots which can be generally compensated by installing multiple sensors which adds up to the cost and complexity in sensor fusion. Radars also have high data bandwidth requirements and thus require object detection on edge. It’s a drawback because those chips can’t be easily updated.

Ultrasonic sensors or Sonars:

Just like radars and lidars, sonars are also active sensors. However, they use ultrasonic sound waves to detect objects. Sound waves suffer from strong attenuation in thin media such as air and therefore cannot detect obstacles beyond a few meters. For this reason, sonars are used as parking sensors and are usually installed at the backside of a vehicle

Passive sensors:

In contrast to active sensors, passive sensors , such as standard color and monochrome cameras, do not emit a signal of their own but receive ambient light.. In this process, the appearance of the 3D world gets mapped on to a 2D canvas, thereby compromising the depth of the contents in the scene. But there are ways to retrieve and recover the lost dimension with further processing. Camera based solutions are far cheaper than active sensors but require varying levels of image processing.

Camera based solutions:

Since camera based solutions cannot directly calculate the distance w.r.t to obstacles, they can only estimate or predict distances. Camera-based solutions also make a core component of ADAS systems. Estimating or predicting distance from color/monochrome images is generally known in literature as depth prediction or scene depth estimation. Two kinds of systems exists for distance estimation

Multiview or multi camera systems
Single camera solutions
Geometry based solutions
Learning based solutions

Multiview systems employ multiple cameras to capture the surroundings and use camera geometry ( usually stereo) to combine the information and estimate depth. The accuracy of these systems depends upon the baseline (horizontal distance b/w cameras, b in below fig) of the stereo rig. The wider the baseline, better is the ability of these systems to estimate accurate depth at longer distances.

Estimating distance in a single camera system is significantly harder. Camera projects 3D world points into 2D image pixel coordinates. We can then back project the point by applying inverse camera projection but it’ll result in a ray i.e. a single point in image will map to infinite points in 3D world lying on that ray.

Another challenge in estimating distance is that vehicle detection has to be extremely accurate to get reliable distance especially at longer distances. Detecting vehicle at longer distance becomes harder owing to the fact that the vehicle size (in pixels) becomes too small as the vehicle moves far, going under 10 pixels in a 720p image.

**Figure 3**: The pixel width of vehicle (on a 720p image) goes down to under 10 px as distance increases

The effect of a single pixel error in vehicle detection can cause significant error in distance estimation because the relationship between pixel distance (vertical distance of vehicle from the bottom of the image) and the actual distance of the vehicle from camera is not linear (Figure 3). The space in an image is limited and the real world information is heavily quantized, which means that at larger distances, points that are several meters apart in the real world can be mapped to adjacent pixels.

Learning based systems use data to train complex machine learning models which learn the mapping between the image pixels and their corresponding depth. They can use Lidar points mapped on to an image to learn that relationship (such as DORN) or they may use the stereo pairs for training and predict disparity maps using single image. The disparity is then used to estimate the point distance using camera focal length and the baseline of the stereo setup used in training (for example, Monodepth). These systems have their own challenges. They are computationally expensive and less robust to domain variance. They can only work reliably for the camera they were trained on. When the camera focal length and the position (pose or height) changes, depth prediction is compromised.

In order to address the challenges in the active and passive systems, Hybrid systems are proposed which are generally more reliable. Hybrid systems may combine various sensor modalities or algorithms. Most ADAS systems use multiple cameras and radars and fuse the information for a robust distance estimate. Single view systems can also be improved by using a hybrid framework of geometry and learning based solutions. Such as using a machine learning trained model to detect road constraints (e.g. lane lines) which is robust to occlusion and then use the lanes as constraints (such as their known width in real world or a prior knowledge of their geometry) to estimate camera pose and height. This can be used to get a more reliable distance estimate of a point on the road plane in an image. Finally, localization of the vehicle on the road plane using object detection will complete the loop to relate estimated distances with obstacles on the road.

Estimating and/or calculating distance continues to be a challenge and its pivotal in making driving safer. At KeepTruckin, we are inherently working on this problem and have better than state-of-the-art results on the latest Lyft and other datasets. We would discuss further about our approach in the following blog post, so be sure to watch this space. More importantly, we are hiring! You can join our team to work on similarly challenging and interesting problems and have your chance to create an impact.

References:

Khader M. and Cherian S., An Introduction to Automotive LIDAR (Texas Instruments). Technical report, 2018.