From depth map to point cloud
How to convert a RGBD image to points in 3D space
This tutorial introduces the intrinsic matrix and walks you through how you can use it to convert an RGBD (red, blue, green, depth) image to 3D space. RGBD images can be obtained in many ways. E.g. from a system like Kinect that uses infrared-based time-of flight detection. But also the iPhone 12 is rumored to integrate a LiDAR in its camera system. Most importantly for self-driving cars: LiDAR data from a mobile unit on a car can be combined with a standard RGB camera to obtain the RGBD data. We do not go into details about how to get the data in this article.
It is important to know your camera’s properties if you want to understand what each pixel corresponds to in a 3D environment. The most important parameter is the focal length. It tells us how to translate a pixel coordinate into lengths. You probably have seen focal lengths like “28 mm”. This is the actual distance between the lens and the film/sensor.
From a simple geometric argument (“similar triangles”) we can easily derive the position x from u and d of each pixel. The picture below is just looking at x and u but we can do exactly the same for y and v. For a pinhole camera model the focal length is the same in x and y direction. This is not always the case for a camera with a lens and we will discuss this in a future article.
From the similar triangle-approach we immediately obtain:
Usually fₓ and fᵧ are identical. They can differ though e.g. for non-rectangular pixels of the image sensor, lens distortions, or post-processing of the image.
To sum up, we can write a very short piece of Python code using only geometric arguments to convert the coordinate system of the screen to the cartesian coordinate system.
In the code (cₓ, cᵧ) is the centre of the camera sensor. Note the constant pxToMetre, a camera property, which you can determine if the focal length is known both in units of meters and in pixels. Even without it, the picture is accurately represented in 3D up to a scale factor.
Of course there is a more general way to do all this. Enter intrinsic matrix! A single matrix which incorporates the previously discussed camera properties (focal length and centre of camera sensor as well as the skew). Read this excellent article on it for more information. Here, we want to discuss how to use it to do the above conversion for us. In the following we will use capital boldface for matrices, lower case boldface for vectors and normal script for scalars.
Next, we introduce homogeneous coordinates. Homogeneous coordinates will help us to write transformations (translations, rotations, and skews) as matrices with the same dimensionality.
Think of it in this way. In Fig. 2 we could move the image plane to any other distance e.g. from fₓ → 2fₓ and keep note of the factor h=2 that we shifted it by. The shifting introduces a simple scaling and we can always go back to the original by dividing u and v by h.
Now we can do any operation on the homogeneous coordinates, while we leave the last dimension unchanged. All operations are defined such that the last component is unchanged. Good examples can be found in Chapter 2.5.1 of this book.
The rotation matrix R, translation vector t, and the intrinsic matrix K make up the camera projection matrix. It is defined to convert from world coordinates to screen coordinates:
Note that [R|t] refers to the block notation, meaning we concatenate R and the column vector t=transpose{t₀,t₁,t₂}, or, in other words, add it to the right-hand side of R.
If we want to do the conversion the other way around, we have a problem. We cannot invert the 3x4 matrix. In the literature you will find an extension to a square matrix which allows us to invert. To do this we have to add 1/z (disparity) to the left side to fulfill the equation. The 4x4 matrices are called full-rank intrinsic/extrinsic matrices.
Let’s verify what we said above with the simplest case: the camera origin and the world origin are aligned, i.e. R and t can be neglected, the skew S is 0, and the image sensor is centered. Now the inverse of the camera matrix is simply:
Just looking at the first row leads to exactly the same conclusion as we found in the beginning (Eq. 1). Same applies to y and z using row twoand row three of Eq. 6, respectively. For more complicated intrinsic matrices, you will need to calculate the inverse before making this conversion. Since it is an upper triangular matrix there is an easy analytical solution:
Now you have all the tools at hand to convert a depth map or RGBD image into a 3D scene where each pixel represents one point (Fig. 3). There are some assumptions that we made along the way. One of them is the simplified camera model: a pinhole camera. Cameras you use in the real world, however, use lenses and oftentimes can only be approximated by the pinhole model. In our next article of this series we will explore the differences and the impact of the lens on this conversion.
This article was brought to you by yodayoda Inc., your expert in automotive and robot mapping systems.
If you want to join our virtual bar time on Wednesdays at 9pm PST/PDT, please send an email to talk_at_yodayoda.co and don’t forget to subscribe.