OpenCV Spatial AI Competition Progress Journal — Part I

Robotic platform setup + Getting started with OAK-D + Point Clouds from depth maps

Deepak Talwar
The Inverse Project

--

Introduction

OpenCV, a famous open-source computer vision library, announced its very first Spatial AI Competition sponsored by Intel. With its launch of the OAK-D (OpenCV AI Kit with Depth) module, OpenCV called for participants to solve real-world problems using its module. The OAK-D module has built-in stereo cameras along with an RGB camera. It also ships with a powerful visual processing unit (Myriad X from Intel) to enable deep neural network inferences on board.

Excited by the potential of this sensor, we decided to submit a project proposal for this competition back in July. Our group’s proposal was selected (among 32 out of 235 proposals) and we are officially joining the competition! We will post a series of blog posts to journal and share our progress, thoughts, and challenges throughout the development.

Our proposal: Using Spatial AI to detect indoor Landmarks and enabling Graph-Based Pose-Landmark SLAM

The proposal we submitted for this competition involves using the new OAK-D sensor in solving the indoor pose-graph SLAM (Simultaneous Localization and Mapping) problem, but with the addition of pose-landmark constraints. OAK-D’s ability to not only detect objects but also provide their relative 3D locations makes solving this problem uniquely possible. In general, a pose-graph SLAM problem is an optimization problem that uses multiple sources of pose-estimates of a mobile agent and aims at minimizing the error between these estimates by adjusting them. This optimization allows us to locate the agent in its environment, and at the same time allows us to stitch the robot’s sensor measurements to create a map representation of the environment.

Figure 1: Flowchart of the proposed project.

Most implementations of indoor SLAM use the following two sources of pose-estimations:

  1. Wheel odometry
  2. Rotating laser scanners (aka 2D LiDAR)

In our implementation, we aim to use wheel odometry as the first source but replace the 2D LiDAR scanner with OAK-D. Figure 1 shows the flowchart of our proposed implementation. Here’s how OAK-D uniquely enables us to solve this problem:

  1. Source of pose-pose constraints: Typically, laser scans from 2D LiDARs can be used to estimate pose-pose constraints in a pose-graph by using alignment algorithms such as Iterative Closest Point (ICP). An informal way to think about ICP — if you were given two laser scans from two different locations, how would you need to move and rotate such that the two laser scans align. This movement + rotation gives us an estimate of the pose-pose constraint. We wish to use a similar approach when using OAK-D. To do this, we will need to create point clouds from the OAK-D depth maps, and then align them with ICP. These will give us richer and more accurate pose-pose constraints because we can leverage 3D information.
  2. Source of pose-landmark constraints: The main benefit for using OAK-D, however, is that it can do 3D object detection on the edge directly! This uniquely allows us to detect objects using a deep learning-based object detector and use them as landmarks in the pose-graph optimization problem. What are considered as landmarks, you might wonder? Well, any stationary object in the environment that is observed in multiple OAK-D measurements, can be used as a landmark, as its location in the world coordinate frame is unchanging, while the relative location of the agent with respect to the landmark keeps evolving. These observations to the same object allow us to create pose-landmark constraints in which the locations of the landmarks are considered fixed. These landmarks can be used as “anchor” points for the entire indoor map. Stay tuned for more on this in future updates.

With these sources, we will be able to create a graph that contains both pose-pose constraints as well as pose-landmark constraints. We will then use optimization techniques to optimize the locations of these poses. Once the poses are fixed, we simply stitch the OAK-D observations (these can be RGB images, depth maps, or point clouds) to create the indoor map.

Robotic Platform Setup — Jetson Nano + TurtleBot3 Waffle + OAK-D

Now that we have provided an overview of the proposed tasks, we will provide some implementation details and what we have achieved thus far. The first thing we need is an indoor mobile robot that we can mount the OAK-D on. This robot must meet the following criteria:

  1. It should provide a source of wheel odometry: While we could add this ourselves but since that is not the main task for this project, it is better to use a platform that already has motors with encoders.
  2. Supports Robot Operating System (ROS): We aim to use ROS as the backbone of this project and so a robot that supports ROS is a must.
  3. It should be able to power the OAK-D and a single-board computer (SBC).

We chose to modify a TurtleBot3 Waffle to fulfill these requirements. The reason we went with a TB3 Waffle is simple — this is what we had on hand! Plus, TB3 Waffle checks pretty much all of the boxes here. Here is a summary of what it provides and things that we needed to change:

  • TB3 has odometry: TB3 is equipped with DYNAMIXEL motors that have built-in encoders that provide odometry.
  • TB3 comes with the OpenCR embedded platform that can provide enough power for our SBC of choice (NVIDIA Jetson Nano) as well as OAK-D.
  • By default, TB3 comes with a Raspberry Pi 3 B+, however, we want to use a Jetson Nano as our SBC. This is because we would like to have higher performance headroom for this problem.
  • We removed the 2D laser scanner that comes with TB3.
  • We will install OAK-D on the TB3 platform.

Figure 2 shows the modified robotic platform with the Jetson Nano already installed and configured to run with the TB3. We haven’t yet mounted and extrinsically calibrated the OAK-D just yet.

Figure 2: Our robot platform setup. We replaced the Raspberry Pi on the Robotis TurtleBot3 Waffle Pi platform with the NVIDIA Jetson Nano.

Using Jetson Nano as the single-board computer on TurtleBot3

As mentioned earlier, TurtleBot3 comes with a Raspberry Pi 3 B+ as default. Luckily setting up the Jetson Nano as the SBC for TB3 is not very challenging. The following steps were required to get Jetson Nano and the OpenCR board talking over ROS.

Installing/Updating firmware on the OpenCR board

  1. The OpenCR embedded platform board is responsible for powering and communicating with the DYNAMIXEL motors. This means that it reads the encoder readings from the motors and generates odometry messages, as well as receives velocity commands and sends them to these motors — both of which are crucial to this project. We followed the instructions provided here to update the firmware of the OpenCR board.

Installing TurtleBot ROS stack on Jetson Nano

  1. Since this Jetson Nano was already used in other robotics projects, we already had ROS Melodic installed on it. However, to get it working with the OpenCR board, we needed to set up this board as if it was the brains of TurtleBot3. To do this, we set up the TB3 ROS stack on Jetson Nano in a new ROS workspace. This can be done by simply doing sudo apt install ros-melodic-turtlebot3. Now, we can connect the Jetson Nano to the OpenCR board over micro-USB.
  2. Bring up TB3 stack using roslaunch turtlebot3_bringup turtlebot3_robot.launch and now we are ready to receive odometry and send velocity commands to the TB3 from the Jetson Nano.

Receiving Odometry and sending velocity commands

  1. We can launch the teleoperation node to control the motors and see updates to odometry. This can be done using roslaunch turtlebot3_teleop turtlebot3_teleop_key.launch
  2. To see streaming odometry values, we used rostopic echo /odom. A sample odometry message looks like the following:
---
header:
seq: 987
stamp:
secs: 1597031283
nsecs: 652199083
frame_id: "odom"
child_frame_id: "base_footprint"
pose:
pose:
position:
x: -0.147986590862
y: -0.0244562737644
z: 0.0
orientation:
x: 0.0
y: 0.0
z: 0.179800972342
w: 0.983703017235
covariance: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
twist:
twist:
linear:
x: -0.11722844094
y: 0.0
z: 0.0
angular:
x: 0.0
y: 0.0
z: 0.217905953526
covariance: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
---

With this, the setup of our SBC and TB3 is done. The next step would be to add OAK-D to this setup. This will be discussed in a future post.

Getting depth maps from OAK-D

The OAK-D module consists of three cameras — two monochrome global shutter cameras which are used to provide depth perception, and one RGB camera. OAK-D provides two different approaches to estimate depth. The first method computes the depth map using the Semi-Global Matching (SGBM) algorithm using left and right cameras, while at the same time running neural inference on one camera (left, right, or RGB) to perform object detection. The depth map is then used to locate detected objects in the camera’s frame of reference. The second method runs neural network inference on both the left and right cameras in parallel to estimate disparities between the two images. The detected features are then triangulated using camera’s intrinsics to calculate their 3D positions. In this project, we are going to exploit the first method as it provides us with both — a dense depth map (used to create point clouds), and object detections in 3D (used as landmarks in our SLAM problem) simultaneously.

Figure 3: OAK-D module with two monochrome cameras used for depth, and one RGB camera in the middle.
Figure 4: Sample depth map from the OAK-D camera. As shown, it can detect objects (here, a person) and can also report their 3D locations in the camera coordinate frame

Before we start streaming the depth maps from OAK-D, we first need to obtain the intrinsic camera calibration matrix which is essential for converting the depth maps into point clouds (as explained in the next section). We were able to obtain intrinsic parameters for our OAK-D by following the stereo calibration tutorial from the official DepthAI API for OAK-D.

The DepthAI API used to stream the data from OAK-D is provided in Python 3 and C++. Since our ROS version (Melodic) does not support Python 3, we decided to use the C++ API. We followed the example provided here and after some trial and error, and with the help of DepthAI developers, we were able to build a program capable of streaming depth maps from OAK-D. These depth maps will be published over a ROS topic for downstream nodes to consume. In the next few weeks, we will be working on improving the quality of these depth maps.

Converting depth maps to point clouds

One of the benefits of using the OAK-D is its capability of producing raw depth maps. Each pixel of a depth map contains estimated depth information. Normally, when we know a camera’s intrinsic calibration matrix (we call it K), we can project 3D points from the scene onto a 2D image plane. This process is called perspective projection.

The 3x3 intrinsic calibration matrix is composed of 4 different parameters.

  1. fᵤ and fᵥ are horizontal and vertical focal lengths. They are distance measurements from the camera’s center (pinhole) to the image plane, measured in pixels.
  2. oᵤ and oᵥ are principal points that represent the optical center of the image plane.

Given a 3D point with coordinates (X, Y, Z) in the camera coordinate frame, we can retrieve pixel locations when projected onto the image plane with the following equations:

The above equations tell us that we can also inverse project a 2D pixel with coordinates (u, v) into 3D space if we know the intrinsic calibration matrix and the depth, or the Z coordinate of the pixel. Solving for X and Y, we get the following equations:

In order to construct point clouds consisting of N number of points (where N is equal to the number pixels in the 2D depth map), you can simply visit every pixel in a depth map and calculate for (X, Y, Z) inverse projection of that point. However, visiting every pixel using double for-loops can be quite costly. Therefore, we can use linear algebra to solve the same problem in a more computationally optimal fashion.

With the above equation, all we have to find is the inverse of the intrinsic calibration matrix and perform matrix multiplication. Based on the equations above, we can derive elements inside the K⁻¹.

Figure 5: Sample point cloud generated from a depth map from OAK-D

This process enables us to convert incoming depth maps from the OAK-D into point clouds. A sample point cloud generated using this process is shown above in Figure 5. This point cloud is not subsampled or cleaned and generated from raw depth maps, which is why the quality is not great. This will be improved upon in the future.

Our next step on point cloud registration using the Iterative Closest Point (ICP) algorithm will be discussed in a future blog post.

This concludes Part I of our journal. Be on the lookout for future updates in the coming weeks! Please feel free to comment on this post and reach out to us over on LinkedIn.

Reach out to us

  • Deepak Talwar (LinkedIn) is currently a final year Master’s student in Computer Engineering at San José State University focusing on Robotics and AI. He was previously an intern at FarmWise Labs and is currently interning at DeepMap Inc.
  • Seung Won Lee (LinkedIn) recently received his master’s degree in Computer Engineering from San José State University where he completed various projects related to computer vision, especially in autonomous driving. His master’s thesis, “Enhancing Point Cloud Density using Stereo Images”, adopts methods to combine regular point clouds with pseudo-LiDAR point clouds for improving 3D object detection.
  • Sachin Guruswamy (LinkedIn) recently received his master’s degree in Computer Engineering from San José State University with a focus on Robotics and Computer Vision. He currently works at Luxonis on their Depth.ai product.

Disclaimer: Our team member, Sachin Guruswamy, was hired by Luxonis after we submitted our proposal and got selected amongst Phase-I winners. Due to this reason, while we will still be submitting our work for this competition, we will not be in the running for the prize money for this competition.

--

--