Estimating depth for YOLOv5 object detection bounding boxes using Intel® RealSense™ Depth Camera D435i

6 min readMay 2, 2022

In this article, we’ll look at how to use the Intel® RealSense™ Depth Camera D435i to do real-time depth estimate of detected objects. To accomplish this, we’ll need an Intel® RealSense™ Depth Camera D435i, as well as a PC or laptop running Windows, Linux, or Mac OS. The following setup can also be run on arm architecture-based computing system like Jetson Nano or its variants if you install some packages by building it from the source. Although a GPU is ideal for object detection, this tutorial will execute adequately with a CPU as well.

Link to the source code for this tutorial is provided at the end of this article.

Before getting into sensor handling and programming, let’s install the SDK: Intel® RealSense™ Software Development Kit 2.0. For more in-depth understanding of the SDK, or to contribute to the source code, you can check https://github.com/IntelRealSense/librealsense. The upcoming sections and programs implemented are in Python3 (3.9.6), so keep that in mind if you run into any problems. Additionally, the Pytorch version used for YOLOv5 is 1.9.1+cu111 (CUDA supported), higher versions tend to give errors, especially since I am using an older version of YOLOv5 (v3). Minimum system requirements to run Intel® RealSense™ Software Development Kit 2.0 are Ubuntu*16.xx/Windows*10.

I would recommend creating a virtual environment in python for this project (let’s call it ‘realTimeDepth’).

python -m venv realTimeDepth

Activate the venv with the following command (Windows):

realTimeDepth\\Scripts\\activate

Real-time depth or distance perception

Depth perception is an important part of robotics. Depth perception is defined as “the visual ability to see the environment in three dimensions” (3D). Whether you’re a robot, an animal, a human, or a cyborg, real-world perception is essential for navigating, responding to the environment, and accomplishing activities in real time. While different entities use different methods to perceive depth, robots has to rely on artificial intelligence to perceive depth from 2-dimensional images, which can be erroneous and difficult task. To address this problem, light detection and ranging (LiDAR) depth data is used to improve depth perception in addition to ocular perception.

Depth perception using Intel® RealSense™ Depth Camera D435i

Intel® RealSense™ Depth Camera D435i (part of the Intel® RealSense™ D400 series of cameras) is an RGB-D camera where RGB stands for Red, Blue and Green color channels, while the D stands for Depth. The camera outputs 1920 x 1080 RGB images and 1280 x 720 resolution stereo depth images at it’s maximum resolution. Each pixel in a depth frame has a corresponding depth value that indicates the distance of that pixel (object/surface) from the camera. The camera also features a built in IMU (Inertial Measurement Unit) that can measure movements in 6 degrees of freedom. The minimum -Z Depth for the camera is 280 mm at 1280 x 720 resolution. At the distance of less than or equal to 2 meters, considering 80% of the field of view’s (Region Of Interest) ROI, the absolute depth error for D435i is ±2% and RMS Error is ≤2%. It must be kept in mind that, the depth sensor measures the depth from depth point start (from sensor tip/depth origin point) which is located -4.2mm inside the D435i camera’s front cover glass, therefore, to get a precise depth measurement, +4.2 mm must be added to the depth value. Facing the camera lenses, the X,Y coordinates of the depth sensor is positioned at 17.5mm to the right from the center of the camera. While working with this sensor, it is recommended to use USB 3, since lower version could run into errors during data transfer or sometimes fail to work at all. This covers pretty much the basic information about the Intel® RealSense™ Depth Camera D435i sensor.

Now that we are familiar with the camera itself, let’s look at the code to access the sensor APIs and get the depth values at the desired pixels. To begin with, we will start to stream the data with the following code:

pipeline = rs.pipeline()
config = rs.config()
config.enable_stream(rs.stream.color, 1280, 720, rs.format.bgr8, 30)
config.enable_stream(rs.stream.depth, 1280, 720, rs.format.z16, 30)
print(“[INFO] Starting streaming…”)
pipeline.start(config)
print(“[INFO] Camera ready.”)

The code above initiates the pipeline and configuration objects. We will stream both RBG and Depth frames at the dimension 1280 x 720 with a frame rate of 30fps to keep things consistent. You can check additional configurations at their Github or here. Finally, the configurations are passed to the pipeline.

Object Detection using YOLOv5

Instead of retraining from scratch, we will use pretrained checkpoints from https://github.com/ultralytics/yolov5 Glenn Jocher’s Ultralytics GitHub repository for YOLOv5. Being one of the most widely searched and used object detection architecture, current version of YOLOv5 (when writing this article) is v6.1 with it’s 8th release. For this tutorial, I will be using YOLOv5 v3 and to simply things further, we will use the COCO dataset trained weights from here (YOLOv5 small pretrained weights). If you are looking forward to training your model on custom data, here is a good start. For this tutorial, we will be using detect.py for performing object detection. To run this code, you will need Python≥3.7 and Pytroch ≥1.7, in addition to the required packages mentioned in YOLOv5's Github repo.

Change in trends for object detection architecture Google search for past 3 years

Once you have the YOLOv5 repository in you local computer, all you need is ‘models’ and ‘utils’ folders from the main project. To simultaneously handle object detection and corresponding depth/distance of each detected object from the camera, we will make minor changes to detect function in detect.py python script. To get color and depth frames from the camera (pythonically), we will use the following lines of code:

frames = pipeline.wait_for_frames()
color_frame = frames.get_color_frame()
depth = frames.get_depth_frame()color_image = np.asanyarray(color_frame.get_data())

Once we obtain the color frames, we are just passing the frames to the object detection model, which will follow the machine learning pipeline and return the bounding box coordinates for each detected object along with its class. Once we have the coordinates, let’s say for object 1. We can find the center of the box (rectangle) using the following formula:

(x,y) = (x2 + x1)/2, (y2+y1)/2

Following, it is pretty much straight forward. We will get the corresponding depth value from pixel at (x,y) coordinates of the image using:

zDepth = depth.get_distance(int(x),int(y))

The SDK 2 returns the depth values in meters which needs to be converted to centimeters or inches, depending on which part of the world you live in.

It must be noted that, when performing object detection, and accessing the depth of the pixel at the center of the bounding box, sometimes the value would return zero due to ‘Invalid depth pixel’ where the depth for that pixel is simply not present. To avoid this, averaging the depth value of a matrix at the center of bounding box is an option.

As promised, my code for the current implementation is available in Github https://github.com/jithin8mathew/Depth-estimation-using-Depth-Camera-D435i-for-YOLOv5-detected-objects.

Conclusion

Finding the depth of detected objects is essential for various applications, primarily for robotics. This tutorial is aimed at covering the first step for those working towards this goal. Good luck with your projects!. If you find this project interesting, check out my Github repo for the source code, and while you are there, consider staring.

Further research:

https://github.com/IntelRealSense/librealsense/blob/master/wrappers/tensorflow/example1%20-%20object%20detection.py (for running inference on Tensorflow model and get depth value)
More depth camera configuration parameters: https://intelrealsense.github.io/librealsense/python_docs/_generated/pyrealsense2.config.html
Datasheet for Intel® RealSense™ D400 series cameras: https://www.intelrealsense.com/wp-content/uploads/2022/03/Intel-RealSense-D400-Series-Datasheet-March-2022.pdf

Estimating depth for YOLOv5 object detection bounding boxes using Intel® RealSense™ Depth Camera D435i

Written by JITHIN MATHEW