Geometric consistency as a learning signal

yodayoda
Map for Robots
Published in
6 min readSep 17, 2020

Self-supervised learning is all the rage these days for machine learning. Whether its models like GPT-3 for natural language processing or data augmentation for computer vision, everyone is trying to get something for free from their data without paying for expensive things like labels.

In this article, we’ll go over one area of computer vision research focused on the very same goal. This area has gone relatively unnoticed by the broader machine learning community. By leveraging classical ideas in the geometry of three-dimensional shapes, this subfield of computer vision has been making incredible progress in training neural networks to do what was once thought to be impossible: creating accurate representations of depth in a scene from a single image.

Projecting from 3D to 2D views

Our digital cameras project 3D scenes onto two-dimensional pixel grids. This pixel grid represents a projection which satisfies mathematical constraints known as geometry. The assumptions these constraints make can map exceptionally well to reality. They include things like: light travels in a straight line and real cameras behave much like a pinhole camera. These simplifications make the analysis a bit easier without introducing significant inaccuracies.

The main idea here is that projection of the 3D scene onto a two-dimensional image depends on the position of the camera. To experience this dependence for yourself¹: Hold one finger vertically in front of your face and compare the image you see with each eye closed. Pay attention to the background of your finger as you switch between left and right eyes. Notice a shift in the position of your finger relative to the background? This shift is a result of the difference in the horizontal position between your left and right eye. The difference between these two images is one of the cues our brains use to perceive depth.

Object viewed from two different angles. A and B can be two cameras in different positions or the same camera at a different time and position.

Most importantly, for this article, this difference follows the constraints described by Epipolar geometry. At a very high level, these constraints tell us that if we know where the cameras are relative to one another and which pixels come from the same point in the 3D scene (e.g. the tip of your finger), we can infer the depth of that pixel for each camera.

Let’s go over what these constraints are to get a better understanding of how we can use them to infer depth.

Camera Matrix

The camera matrix is like a spec sheet for a camera. It describes the intrinsic properties of the camera, like its horizontal and vertical focal lengths and also extrinsic properties like the camera’s position and orientation in the 3D scene.

Intrinsic Matrix

For now, let’s assume the horizontal and vertical focal lengths are the same value f_k.

Extrinsic Matrix

The extrinsic matrix is a transformation that converts from points in the 3D scene to their corresponding values in camera coordinates. It has two parts, a rotation matrix R and a translation vector t.

Camera Matrix

Putting it all together, we get the Camera Matrix P. The camera matrix is often padded with an extra vector to make it square and more convenient to manipulate, as we will see later.

Mathematically, the camera matrix is a transformation based on these specifications, which converts a point in the 3D scene to its corresponding pixel coordinate on the 2D projection captured by the camera. Conveniently in Epipolar geometry, the inverse of the camera matrix is the transformation from pixel positions to points in the 3D scene. The critical thing to note here is that pixel coordinates in Epipolar geometry also contain some depth information.

Transformations between 2D projections

If two camera matrices, let’s call them Camera A and Camera B, are given with respect to the same 3D scene, then you can easily convert a pixel coordinate from Camera A to the corresponding pixel coordinate of Camera B. You use the inverse the camera matrix of Camera A which gives you the 3D scene coordinate and then apply the camera matrix of Camera B.

Let’s start by taking the inverse of the camera matrix for Camera A. For the sake of simplicity, let’s assume the cameras A and B have the same intrinsic specifications (same intrinsic matrix).

The inverse of the camera matrix A transforms pixel coordinates of Camera A to coordinates in the 3D scene.

Now that we have a way to convert from the pixel coordinates of Camera A to 3D scene coordinates, we can use the camera matrix of Camera B to go from 3D scene coordinates to pixel coordinates of Camera B.

This matrix transforms pixel coordinates from Camera A to pixel coordinates of Camera B. The matrix R* is the relative coordinate transformation which goes from camera A to camera B.

This is the constraint that our scene must obey. If we know the relative coordinate transformation and intrinsic parameters of the two cameras, we can map pixel coordinates from Camera A to pixel coordinates in Camera B. Roughly speaking, we can get an idea of the image which Camera A sees with an image from Camera B.

Consistency between different projections

If we have two images of the same 3D scene and we can transform the pixel coordinates of one image to the other, then we can generate what we would expect from the perspective of one image with the information from the other. More succinctly, we should be able to reconstruct one image from the other image. If we assume our knowledge of the relative position/orientation and intrinsic properties of the camera(s) which took the images is correct, any inconsistencies we see in this reconstruction we can ascribe to mistakes in our knowledge about the pixel coordinates. More specifically, these mistakes are a result of incorrect knowledge about the depth of the pixel coordinates. And most importantly, we build models that learn from these mistakes.

Depth from a single image

You may have noticed all of what we have described so far involves two cameras and, more specifically, two images or projections to infer depth. To get depth from a single image, as promised at the beginning of this article, we have to employ models that can learn from their mistakes. With enough data, neural networks are pretty effective at this.

If our neural network model uses one image’s pixel coordinates to predict depth, we can use the corresponding reconstruction of the other image and its actual contents to judge the quality of that prediction. And because the geometry equations we have discussed so far are differentiable, we can use backpropagation to minimize the errors. We now have a framework for predicting depth from a single image.

You might ask, how does this framework do?

A single RGB image of New York City converted to a depth map with PackNet. Distant objects are shown in darker color and close objects in lighter color. The network performs well and details such as the traffic light on the right are reproduced nicely. However, some parts such as the blown out background are misinterpreted.

Assumptions that are not always true

Earlier, we mentioned that the geometric constraints mapped very well to reality. This is true if we assume the different projections were mapped at the same time. What if two projections were captured at different time points? This is where things become more complicated. For example, what if an object in the scene moves between the first projection and the second projection? In the example using the tip of your finger, what if your hand moved when both of your eyes were closed? This movement violates an assumption we made about the 3D scene being static between different camera views.

Without having a very accurate model of the moving object, many methods ignore (mask out) the pixels of objects that are suspected to be moving.

We also know from geometry that there are ambiguities that cannot be ruled out when inferring depth from a single image. This leads to many illusions in our own depth perception when we see 2D images from our computer screens. It’s only when we incorporate other cues like structure from motion do we even consciously realize this ambiguity.

Final Remarks

Neural networks can derive depth information from just a single image. With the technology progressing an inexpensive replacement for LiDARs is on our doorstep.

This article was brought to you by yodayoda Inc., your expert in automotive and robot mapping systems.
If you want to join our virtual bar time on Wednesdays at 9pm PST/PDT, please send an email to talk_at_yodayoda.co and don’t forget to subscribe.

References

[1] Szeliski, Richard. Computer vision: algorithms and applications. Springer Science & Business Media, 2010.

[2] http://ksimek.github.io/2012/08/13/introduction/

--

--