Reconstruction: The Mighty Camera

8 min readOct 26, 2018

At the core of DroneDeploy is a product called Map Engine that is able to quickly and accurately reconstruct 3D scenes from a collection of photographs without any prior knowledge of where the photos were taken, or what they contain. This is one of the core problems of the field of photogrammetry which draws on techniques from computer vision and machine learning extensively. Being able to do this is powerful, as a digital camera is a lot cheaper and easier to move around than a 3D scanner — but you do need some smart software. This is the first of three posts describing how this process takes place from beginning to end. We will also be implementing a simple and educational pipeline that reconstructs 3D scenes from 2D images.

First we are going to look at some of the fundamentals of cameras and code up some building blocks. Our story starts with the mighty digital camera. Cameras use light to capture a 2D representation of the 3D world by exposing a sensor to light that is focused through a lens. This creates a digital image. Photographers talk about cameras in terms of shutter speed, focal length and ISO. Here is a great interactive illustration of how these factors affect the final image.

In computer vision we approach cameras differently and are concerned rather with the external and internal geometry of the camera. The external geometry, called the extrinsics represents how the world is transformed relative to the camera when we look through it. It’s usually more intuitive to specify where the camera is in the world than how it transforms the world though. This is called the pose and we specify it as a matrix [R | t]where R is a 3x3 rotation matrix and t is a 3x1 translation vector. The extrinsics are then E = [R',-R'*t]. This inversion of the pose transforms the world so that we are effectively looking down the optical axis of the camera.

Next we have the internal geometry of the camera, called the intrinsics which represent a conversion to pixels. Two important values here are the camera center (also called the principal point) and the focal length. Together in 3x3 matrix as K this represents a scaling and translation. This is is as mapping pixels into the image based on the field of field of the camera and where the pinhole of the camera is. We can now create a projection matrix as P = K * E. Together the camera projection matrix can be thought of as acting on a 3D point as a rotation translation, followed by a translation and scaling yielding the final image coordinates. Which might call off the edges of the image depending on the resolution in pixels

Projection

Let’s implement a simple pinhole camera. We can now use the projection matrix of a camera to transform 3D world points to 2D image coordinates. We represent out world coordinates at 4D homogenous coordinates and then multiple by the projection matrix and normalize to get image coordinates.

An implementation of a pinhole camera

Let’s test this out by creating some 3D world points. We’ll create the vertices of a cube and place the camera backwards long the z-axis. We’ll then project each of the vertices of the cube into the camera and we should get a 2D representation.

Viewing our 3D cube using a pinhole camera

The resulting image shows our cube and correctly captures the perspective introduced by the back face being further from the camera than the the front.

Aside: Decomposition

We can also go backwards and decompose the projection into it’s K, R, t components remembering that the structure is P = K[R | -R'*C]we can first extract t and decompose the remainder to get K and R. Unfortunately this decomposition is not unique because if you take the resulting R and Q matrices from the RQ decomposition and negate a row and corresponding column of R and Q the resulting projection matrix is the same.

Positioning the camera

Up until now we’ve just had our camera looking down the z-axis which isn’t that exciting. We want to fearlessly be able to place our camera anywhere and pointing in any direction. Placing the camera anywhere is easy we just specify the position t. Pointing it in any direction is a bit less intuitive because we have to specify the orientation as a rotation matrix. One way to do this is to construct the rotation matrix of the pose as the axes of the camera in world coordinates with the first axis being the direction the camera is pointing and the other two: a vector through the top of the camera and a vector through the side of the camera. Here’s a function to position the cameras somewhere looking at a certain point:

Implementing this we can position the camera anywhere and look in any direction. Let’s test it out and position the camera randomly around our cube and use our existing projection code to look at the cube from different angles

Viewing our cube from 6 different positions

You’ll notice an optical illusion here in that the cube will look skew in some images. As your eyes sort out the orientation you will see them square up. The reason is because we aren’t sorting the edges by their depth from the camera meaning some lines in the 2D images are draw over other lines that they should be behind. If we wanted to clear this up we should sort our geometry by their depth from the camera and render in that order.

Some more visualization

Now that we can position and look through arbitrary cameras let’s visualize the camera positions instead of just what they see. We can do this my placing another camera looking at the whole scene of cameras and render the position of each camera along with it’s axes and frustum.

The result is our whole scene of cameras looking as our cube.

Triangulation

Now we should be pretty comfortable with cameras and mapping the 3D world to 2D images. But can we go backwards? Let’s try and reconstruct the 3D vertices of the cube from a 2D picture. Unfortunately we can’t do this from a single view because a pixel corresponds to a ray extending out into the real world and the 3D point could lie anywhere along that ray. But if we had a few different poses of the camera the rays should pass through the same corresponding pixel each image and intersect at a unique point in the 3D world. Take a look at this picture to convince yourself that you need at least two pixels to reconstruct a 3D point.

If we take two images of our cube and write out how two 2D pixels were computed from the 3D points using the two projection matrices ( P1 and P2 ) we end up with a linear system of equation. We could try and solve this but a couple of things go wrong here, the matrix is typically not square (unless you have two views and so the inverse doesn’t work. A better and more numerically stable way is to instead look as it as a homogeneous systems of equations. This has a trivial solution at 0 0 0 which we are not super excited about but if we take the singular-value decomposition (SVD) of this matrix and take the singular vector corresponding to the smallest singular value we get a solution to the system with some useful additional properties. In particular it constrains the solution vector to have a magnitude of 1 so avoids the trivial solution. It’s also numerically stable and when use to solve over-determined systems naturally. Let’s code this up and create a function that takes a list of cameras and corresponding list of image coordinates and triangulates them.

More triangulation

The SVD based triangulation makes two assumptions. The first is the error term we are looking to minimize and the second is how the camera operates on 3D points namely as a linear operation. We can also triangulate points using a non-linear solver. We can frame this as a problem where we seek the X, Y, Z coordinates in the world such that they project into the corresponding image coordinates in each camera we are considering. We can use something like Levenberg–Marquardt to solve this problem. Although this takes longer than our analytical solution it allows use to be more flexible about some things. For example instead of least square we can use different norms like the Huber norm to handle outliers in our data. This is useful as we will see in the next post where we don’t know exactly where our 3D and 2D points are. Another advantage is that we can use more sophisticated camera models that can’t just be represented as a matrix multiplication.

Distortion

We just mentioned “more sophisticated camera models” — up until now we have dealt just with an ideal camera called a pinhole camera. In the real world cameras are not as perfect. Due to imperfections and tolerances in the manufacturing process each camera is slightly different and also the lens used can introduce types of distortion. For example a photograph of a scene with perfectly straight lines may appear to bend in the image. This can cause problems if not handled correctly as typically the reconstructions are used for measurement and planning. Here’s a simple implementation of a Brown–Conrady camera model:

Implementation of a Brown camera model with 8 distortion parameters

Now let’s vary some of the radial and tangential distortion parameters and see what it does to our cube projection.

The effect of varying radial distortion (k1, k2, k3, k4, k5, k6) on our camera

The effects of varying tangential distortion (p1) on our cube

The effects of varying tangential distortion (p2) on our cube

Unfortunately the distortion parameters are not printed on the box of the camera and have to be estimated from the correspondence between between coordinates in different images. This is something we will be solving in the next post.

So that completes our brief walkthrough of camera fundamentals. We’ve talked about how to position cameras in the world and view a scene from different locations. We looked at constructing and decomposing the projection matrix. We’ve also looked at reconstructing the position of 3D world points from 2D image coordinates. In the following posts we will tackle the problem of reconstructing 3D scenes using these fundamentals but without any ground truth data like our cube.