Converting camera poses from OpenCV to OpenGL can be easy

Published in

Check & Visit — Computer Vision

12 min readMar 1, 2022

Importing a camera pose from another library can be quite confusing. It’s important to understand how the camera and world axes are oriented, but also how the matrices are stored in memory.

Why should I bother paying attention to camera coordinate systems?

Working in a closed environment

You may have never asked yourself precisely in which coordinate system your world or camera is defined. In some cases, it just works and there’s no need to go further.

Pinhole camera model — Diagram by the author

Let’s consider the example of Visual Odometry. Its goal is to estimate the motion of a camera from a stream of images or pairs of images.

The camera pose is a translation t and a rotation R, i.e. where it is located in the world and in which direction it is looking. Unfortunately, there’s no direct measurement of the 3D geometry of the surrounding scene with respect to which the motion is defined. Thus, we also need to estimate a 3D point cloud. Using a camera model, 3D points in the world can be mapped to 2D points in an image (See diagram above). The joint update of the 3D map and the camera pose is done by solving an optimization problem based on the consistency with the 2D points detected on the input images.

As you might have noticed, it doesn’t matter if the vertical axis of our world is Y or Z, or if it’s pointed upward or downward. Everything is fine as long as the projection of the 3D map onto the image plane is well aligned with the detected key points. Indeed, since the input is only in 2D we have total control over what happens internally in 3D.

Imports and exports between different camera conventions

It is however much more realistic to consider that a computer vision pipeline will reuse existing libraries and might accept 3D data as an input.

For this purpose, 3D points and camera poses must be imported or exported between environments that don’t necessarily use the same camera conventions:

input 3D data: Lidar sensor, camera poses from an Augmented Reality app, Inertial Measurement Unit (IMU)…
3D Viewer: plotly, Colmap GUI, OpenGL…
Processing libraries: OpenCV, Colmap, Pytorch3D, AliceVision, Open3D…

The camera pose is always a 3x3 rotation matrix associated with a 3D vector. The meaning of these 12 coefficients might however vary from one library to another. There’s no way to guess the coordinate system from the 3x3 rotation matrix itself. The camera pose must be pre-processed to be correctly interpreted by the recipient.

Recap: Geometric Transformation

Row or Column order?

Storing a 2D matrix in linear storage implies finding a way to map it to an equivalent contiguous array in memory. There are two options:

Row-major: Values are stored one row after the other. Used by DirectX, OpenCV…

Column-major: Values are stored one column after the other. Used by OpenGL, Unity, Pytorch3D, AliceVision…

Row or Column vector?

Once we have a 3x3 matrix, there are two ways to use it as a rotation of the 3D space:

Multiply a 3D row-vector (1x3) on the left of the rotation matrix. The i-th coefficient of the rotated vector is the dot-product between the input row and the i-th column of the matrix.
Multiply a 3D column-vector (3x1) on the right of the rotation matrix. The i-th coefficient of the rotated vector is the dot-product between the input column and the i-th row of the matrix.

As you can see, it’s more cache-friendly to perform the matrix multiplication with a row-vector when using column-major matrices and column-vector when using row-major matrices.

As a result, there are only two possible configurations: either column or row vector. Then the memory alignment can be deduced from it.

The transposition allows switching from one system to the other. If a rotation matrix R is stored in memory using a row-major ordering and then passed to a function that assumes a column-major ordering (like OpenGL), it will be interpreted as if it was the transposed matrix R’.

From now on, the column-vector (3x1) notation will be used.

Change of basis

The camera translation t is easy to understand. The 3D points are shifted away from the origin by this constant vector.

As for the camera orientation R, let’s first recall the key ideas explaining why it has to be an orthogonal matrix. Orientation can be seen as a change of basis that maps between the global axes and the new local axes. It indicates how the camera now considers the new X, Y, and Z axes with respect to the old ones. For obvious reasons, the space around the camera neither contracts nor expands when looking around. So the transition matrix must preserve the euclidean norm. This corresponds to the orthogonal group O₃ containing rotations and symmetries. At first glance symmetries should be pruned out as well; but, as we’ll see later, moving from a left-handed camera space to a right-handed world space requires combining rotations and flips.

The transition matrix mapping from a basis 2 to a basis 1 is simply the transformation matrix of the Identity function defined on R³ with basis 2 and expressed in R³ with basis 1. As a consequence, each column is an axis of the new basis 2 expressed in the old basis 1. Seeing things in column like this will be very handy when it comes to pre-calculating the transition matrix from one coordinate system to another, e.g. OpenGL to OpenCV.

Transition matrix from new basis 2 to old basis 1

Negative rotation around the Y-axis — Figure by the author

Camera convention: Left/Right, Up/Down, Front/Back ?

Camera axes

A 3x1 translation vector represents a 3D position in the world space. There is no ambiguity, it merely means that the camera is located at this very position.

As for the 3x3 rotation matrix, the interpretation isn’t so straightforward. It represents the camera orientation, but how to extract the FRONT, LEFT, or DOWN direction when dealing with it?

I think it’s crucial to start thinking about this by considering the case of identity rotation. In this case, the camera is supposed to be aligned with the axes. As you can see, there are plenty of ways to define what aligned means. Since there’s no objective answer, one needs to choose a convention.

The image below illustrates some standards used to define the camera orientation. It’s common to refer to a system by listing the meaning of its X, Y, and Z axes:

Pytorch3D: LEFT, UP, FRONT
OpenCV: RIGHT, DOWN, FRONT
OpenGL: RIGHT, UP, BACK
Unity: RIGHT, UP, FRONT

Some camera conventions — Figure by the author

Given the same 3x3 row-major rotation matrix the FRONT vector will be different:

Pytorch3D, Unity: Column major and Z FRONT. Thus, the last row corresponds to the FRONT vector.
OpenCV: Row major and Z FRONT. Thus, the last column corresponds to the FRONT vector.
OpenGL: Column major and Z BACK. Thus, minus the last row corresponds to the FRONT vector.

Left and right-handedness

As illustrated in the figure above the only difference between OpenGL and Unity axes is that the Z-axis is flipped, which results in a transition matrix with a negative determinant. Note that this is a symmetry and not a rotation.

A right-handed system, like OpenGL, has the same chirality as the world we’re living in, whereas a left-handed system, like Unity, lives in a mirror version of the world. Rotations maintain the handedness, while symmetries inverse it.

An easy way to determine the handedness of a system is to compute the cross-product between x and y using the right-hand rule. We get:

+z in a right-handed system
-z in a left-landed system.

The fact that the Z-axis is pointing backward in right-handed OpenGL view space implies that objects in front of the camera have a negative depth, which might be a bit confusing when considering near and far depths. On the other hand, the projection matrix reverses the Z-axis to have it pointing frontward when mapping to the left-handed OpenGL normalized device coordinates space (NDC).

To make it clear just in case: handedness has nothing to do with row or column-ordering.

World and Camera spaces

There are four coordinate systems when working with 3D points and poses. See the very first figure on top of the article.

World system: the global point cloud
Camera system: the local point cloud seen by the camera
Normalized coordinate system: points in the image plane, before applying the camera matrix
Screen coordinate system: image points in pixels

Since images are stored as 2D arrays, the coordinates of a pixel represent the indices of the corresponding matrix element. That’s why the top-left pixel is at (0,0). The vertical axis is called Y and points downwards, while the horizontal axis is called X and points rightwards. This is the most widely used screen coordinate system. All the libraries mentioned here are using it.

Screen coordinates — Figure by the author

The camera orientation used by OpenCV and Colmap is pretty standard and allows easy conversion between the camera system and the screen coordinate system because the X-RIGHT and Y-DOWN are already aligned.

There’s no reason why the world and camera systems can’t be different. For instance, we could decide to have both a world with a Z-UP because it makes more sense to refer to the altitude using Z and a camera with Y-DOWN because it’s easier to convert to pixels.

Be careful though. Each library has its precise expectations and they must all be satisfied to obtain valid results. For instance, we might think it’s OK to pass a point cloud and camera poses defined in a Y-DOWN world when using Colmap. The point cloud will look good since we can’t visually check the rotation around the vertical, but all the cameras will be flipped if by mistake we used X-LEFT instead of X-RIGHT.

Camera pose conversion

Let’s see how to convert a camera pose from a new coordinate system to a reference one. For that we need to introduce convenient notations. Letters w and c stand for world and camera, while the indices indicate the axes convention, e.g. 3 could mean LEFT, UP, FRONT.

N.B. The same index can be shared by several spaces because it’s only referring to an axes convention. It is by no means an enumeration like World 1, World 2, and Word 3. We can have w1 and c1 if both world and camera agree on the same axes definition. But because we’re handling the general case where each space potentially has a different convention, we need to introduce a new index for each of them. I could have used words like w_opencv or c_unity but it makes the equations harder to read.

As seen previously, each 3D point can be expressed in the new coordinate system by applying the appropriate transition matrix.

By definition the camera to world pose defined in the new coordinate system transforms vectors from c4 to w3.

Warning: do not confuse the world to camera and the camera to world poses. Each one is the inverse of the other.

It only remains to take into account the geometric transform that has been applied to the 3D points. We can then deduce the camera pose expressed in the reference system.

A skilled reader might object that there must be something wrong since the transition between c4 and c2 is somehow missing in the new translation. However, there’s no shift between c4 and c2; the way we consider the camera axes doesn’t change their position inside the global world. No worries then!

N.B. Mapping a matrix from one coordinate system to another implies two transformations. It’s not just a simple multiplication like with a 3D vector. The matrix we’re looking for takes a point from c2 as an input, maps it to c4, applies the input matrix and then converts the result from w3 to w1. It makes no difference whether there is a change of handedness or not. If you still have doubts, imagine that you are observing a rotation in a mirror. First you understand that you’re viewing a flipped world, then you observe the rotation and finally you interpret it back in the real unflipped world.

When the world and camera spaces share the same convention, the conversion can be simplified, using a single transition matrix. For the sake of simplicity, we can use lower script wc3 instead of w3c3 when both w and c share the same index.

Example 1

Let’s convert an OpenCV camera pose to OpenGL while keeping the same global world with Y-UP.

For instance this case occurs when OpenCV is used to infer a pose from OpenGL 3D points. Indeed, the resulting OpenCV pose must then be converted back to OpenGL. (See for example this old but very helpful answer from a friend and former colleague of mine on the OpenCV forum.)

World Reference 1: RIGHT, UP, BACK (OpenGL)
Camera Reference 1: Column-major / RIGHT, UP, BACK (OpenGL)
New world system 1: RIGHT, UP, BACK (OpenGL)
New camera system 2: Row-major / RIGHT, DOWN, FRONT (OpenCV)

N.B. The world space doesn’t really care about how the rotations are stored in memory, that’s why I only mentioned it for the camera system.

Since we decided to keep the point cloud unchanged, the transition will only be applied to the right part of the rotation matrix.

As seen previously, the transition matrix is obtained by expressing the OpenCV axes inside the OpenGL basis.

Transition matrix from OpenCV (2) to OpenGL (1)

We can reuse previous equations to retrieve the OpenGL camera pose. Note the transposition due to the change from column to row-major.

Example 2

Let’s have a look at a more challenging conversion: from a sensor (new), with camera Y-FRONT and world X-UP, to Pytorch3D (reference).

N.B. Even though nobody uses a world with X-UP, it will generate transition matrices that aren’t symmetric. That way, it’s possible for the reader to spot errors of understanding that lead to the transpose of the expected transition matrix.

World Reference 1: LEFT, UP, FRONT (Pytorch3D)
Camera Reference 2: Column-major / LEFT, UP, FRONT (Pytorch3D)
New world system 3: UP,RIGHT,FRONT (X-UP)
New camera system 4: Row-major / LEFT,FRONT,DOWN (Y-FRONT)

The two transition matrices are as follows:

Finally, we get:

N.B. By definition, the inverse of an orthogonal matrix is equal to its transposed. Thus, all the inverse operations used here can be replaced with transpositions.

Conclusion

I hope you enjoyed reading this article and that it helped you decipher the mysteries of camera poses!

https://github.com/ThomasParistech