Understanding 360 images 🌎

Thomas Rouch
Check & Visit — Computer Vision
12 min readSep 5, 2022

Equations and code behind conversion between equirectangular images and cubemaps. It’s all about defining the correct transformation maps for OpenCV.

Photo by Mario Losereit on Unsplash

1) Spherical cameras

Introduction

On average we, as humans, can see about 180° around us, which means that we can only observe the parts of the scene that are in front of us. As a consequence, we have to turn around to see behind our back.

As shown in the table below, cameras also usually do not exceed a field of view (FOV) of 180°.

Angles of View of Manfrotto lenses — From https://www.manfrottoimaginemore.com/2015/08/20/taking-a-closer-look-at-the-wide-angle-lenses/

On the other hand 360 cameras are able to directly capture the light coming from all directions, covering a full sphere. This can be achieved for instance by using two back-to-back fisheye lenses with a FOV slightly larger than 180° (like the Ricoh theta). That way the two views can be stitched together into a single spherical image.

The adjective 360 sounds like a misnomer since we’re capturing light through a 3D sphere and not a 2D circle. This term is probably used to extend the definition of a stitched panorama that can already be done with a standard smartphone. In this case, the panorama connects the end with the beginning to achieve a 360° view. Thus, when referring to a 360 image it doesn’t really matter if the bottom or the top of the spherical image has been cropped for aesthetic reasons, e.g.when hiding the tripod.

From now on we will talk about spherical or equirectangular images, and not 360 images that are potentially cropped vertically.

Spherical camera model

In the standard Pinhole camera model, the scene is projected on a 2D image plane lying in front of the camera. The focal length, i.e. the distance between the camera and the image plane, allows to adjust the FOV.

Unfortunately, this model isn’t suitable for seeing objects behind us, since the largest FOV we can obtain is 180° when the focal length tends towards 0.

In the spherical camera model, the scene is projected on the unit sphere around the camera instead of an image plane. The sphere is then unwrapped into a flat rectangular image, which is similar to representing the earth with a flat 2D map.

Let’s define how to transform 3D world coordinates into 2D image coordinates.

We must first convert from cartesian to spherical coordinates to estimate the angle from which a 3D point is viewed. The figure below shows the radius r and the two angles θ and φ of a given point M.

Spherical coordinates with Theta around the Y axis and Phi around the X axis — Figure by the author
Conversion between cartesian and spherical coordinates

I’ve arbitrarily chosen to use the Right_Down_Front XYZ camera convention (See my previous article about camera poses) and to have θ=φ=0 in front of us. Feel free to use another convention. You get the same image at the end anyway. I just find it more convenient.

Unwrapping the sphere onto a flat image is done by using the φ angle as the y-coordinate of our image and θ as the x-coordinate. The conversion from radian to image space is just an affine transform. Note that this mapping depends on how the θ and φ angles have been defined.

Mapping from spherical coordinates (θ,φ) to image coordinates (u,v)

Two points with the same θ and φ will be projected onto the same pixel. The radius is only used to handle occlusions. The resulting image has an aspect ratio of 2. The width of the image represents 360° while the height represents 180°, i.e. from top to bottom.

Example

It’s time to have a look at a real equirectangular image! We can see the whole room at once. The center of the image corresponds to the point lying in front of the camera. A weird artifact is that the very bottom of the image is reduced to a single point on the floor (Same thing for the top). Whatever the value of θ, a value of φ at π/2 always corresponds to the south pole of the sphere.

Equirectangular image of a room — Image by the author

Notice how the left and right edges of the image smoothly extend each other. Extracting the right half of the image and putting it on the left is equivalent to making the camera point backward. More broadly, rolling the image horizontally corresponds to adding a θ-offset. However, the image is spherical and not toroidal and can’t be rolled vertically to mimic a φ-offset.

Equirectangular image of a room, rotated by 180° — Image by the author
Photo by Milad Fakurian on Unsplash

2) Cubemap

Introduction

Equirectangular images may not be the best format for video games or virtual reality applications. Mostly because they suffer from strong distortion, especially around the poles, and thus store redundant data.

Cubemaps are usually the preferred choice. It’s a textured cube representing the front, left, back, right, up and down views, that can easily be rendered in OpenGL. The scene is projected onto the unit cube around the origin instead of the unit sphere and can be split into a list of 6 images. The following image below shows the six faces of a cubemap.

6 faces of a cubemap — Figure by the author

The image below is the equirectangular image corresponding to the previous cubemap. The letters help to get a better sense of the distortion involved, which is particularly striking for up and down faces.

Equirectangular view of the cubemap with annotated faces — Figure by the author

Here’s what it looks like when we fold a real scene into a cubemap. As you can see the distortion is gone and horizontal lines now remain parallel. The bottom of the door might look wrong at first glance. But it’s only because the cubemap has been unwrapped to a flat image. The observer must be located at the center of the textured cube to see the scene correctly.

Cubemap of the room presented previously - Image by the author
Photo by Alex Padurariu on Unsplash

3) Conversion between the two formats

Introduction

Converting an equirectangular image into a cubemap is equivalent to projecting the unit sphere onto the unit cube.

Each point on a cubeface defines a ray coming from the origin. Normalizing this ray results in a point on the unit sphere.

We need to solve an image sampling problem.

For each cubemap pixel, we can compute the corresponding 2D floating-point coordinates θ and φ on the equirectangular image and estimate its color by interpolation.

cv2.remap or scipy.ndimage.map_coordinates allow us performing such sampling from maps of floating-point coordinates.

The reasoning remains the same when the conversion is done the other way around, i.e. from cubemap to equirectangular image.

Sub-pixel sampling

When computing the transformation maps required by cv2.remap, we need to sample the destination image plane to evaluate the corresponding floating-point coordinates in the source image. Let’s make sure we have a good basis for what’s to come.

As Alvy Ray Smith (Pixar Co-Founder) once said, “A Pixel Is Not a Little Square!”. Indeed, pixels of an image are just samples of the underlying continuous image function on a discrete grid of points.

It’s convenient to define pixels by the discrete integer coordinates of their top-left corner. However, when working with continuous floating-point coordinates, reference should be made to the center of the pixel, which is offset by 0.5.

When looking at the figure below it’s obvious that the blue sampling on the last row is the best option to correctly sample the x-coordinates of a row of 5 pixels. But the red and green sampling strategies can be very tempting! If we want to duplicate horizontally two samplings of [0,5] to simulate a sampling of [0,10], the red option is a really poor choice since the point along the border — at 5 — will be sampled twice.

  • Red: x = np.linspace(0, n, n)
  • Green: x = np.arange(n)
  • Blue: x = np.arange(n) + 0.5
Three different ways of sampling a 1D array of length 5 — Image by the author

The θ angle varies along the x-axis of an equirectangular image from -π to π but the first pixel should have a θ slightly larger than -π, while the last pixel should have a θ slightly slower than π. The script below helps evaluating the pixel samples of a continuous affine function like θ, without using the wrong np.linspace(-np.pi, np.pi, width) approach.

OpenCV documentation doesn’t specify the convention for the continuous floating-point coordinates passed to cv2.remap, but it seems that it uses pixels centered around the integer coordinates (See this GitHub issue). As a consequence, we need to subtract 0.5 before passing the maps to cv2.remap.

Equirectangular image to cubemap

As explained previously, for each cubemap pixel, we can compute the corresponding 2D floating-point coordinates θ and φ on the equirectangular image and estimate its color by interpolation.

We could solve it in the general case with free θ and φ offsets between cubefaces. However, a cubemap offers very specific cases where the angular offsets are multiple of π/2. That’s why I suggest handling cubeface extraction on a case-by-case basis.

Let’s take a look at the RIGHT cubeface. The up, bot, left and right subscripts refer to the 2D coordinate system inside the cubeface image. For instance, the right cubeface image is on the plane x=1 and its up edge is at y=-1.

XYZ bounds of the RIGHT cubeface

The script below does the following steps:

  • List the x,y,z bounds of each cubeface (like what we’ve just done for the right cubeface)
  • Compute the corresponding θ and φ angle maps
  • Convert them to 2D floating-point coordinates in the equirectangular image
  • Extract the cubeface by sampling the equirectangular image

This script has above all a pedagogical purpose. We could leverage the symmetry between the cubefaces to generate the θ and φ maps in a faster way and/or stack horizontally the θ and φ maps to perform a single cv2.remap call.

Note the use of cv2.BORDER_WRAP as border condition. Ideally, we would like to have different border conditions on x and y since the image isn’t toroidal, but the impact on poles is negligible.

Cubemap to equirectangular image

This one is a bit trickier. Previously, 2D coordinates in the equirectangular image were projected onto a single cubeface image. But now we have to handle all the cubemap images at once, which means that a 2D point is mapped to a 3D point in a list of cubeface images, i.e. the face ID and the uv cubeface coordinates.

N.B. : The following part is largely inspired by the py360convert Github repository. Here are the main differences:
- A constant FOV of 90° is used rather than allowing it to vary, to stick to the canonical cubemap example.
- Since the sampling task doesn’t depend on the color channel, I don’t like the idea of calling scipy.ndimage.map_coordinates function repeatedly for the R,G and B channels. It’s best to avoid doing the same thing three times.
- Symmetries are leveraged to speed up the redundant generation of the x and y maps.
- Padding
around cubefaces makes the code difficult to read and is left as an exercise for the reader.
- Better subpixel handling

Since the equirectangular image will be generated from a single remapping call, we need to find a way to generate coordinate-maps that also store the cubeface ID.

With this in mind, the function scipy.ndimage.map_coordinates could be a great fit since it can be fed with a map of face IDs, a map of x-coordinates and a map of y-coordinates. However, as explained previously, it doesn’t allow the sharing of the coordinates-interpolation for the 3 color channels.

Using cv2.remap requires putting all the cubefaces inside a single image. An easy solution is to stack them horizontally like in the image below. The x-coordinate stores both the cubeface ID and the local x-coordinate.

Cubefaces stacked horizontally - Figure by the author

As you might have already noticed in the previous figure representing the equirectangular view of the cubemap with annotated faces, once you know how to project the FRONT and the UP faces you can project all the faces using translations and flips.

Setting the width of the output equirectangular image to a multiple of 8 ensures that translating the image by an half-cubeface can be done by rolling the image by an integer number of pixels. Remember that the BACK cubeface is split in half.

Let’s start by generating a boolean mask corresponding to the projection of the UP cubeface onto the equirectangular image. The intersection between the FRONT and the UP cubefaces correspond to the 3D edge defined by x in [-1,1], y=-1 and z=1. By reusing the equations defined previously we can express φ as a function of θ on this edge.

Thus we get the equation of the edge for θ in [-π/4, π/4] (arctan is odd).

As illustrated in the image below, the following script generates the mask above this projected edge, repeats 4 times and rolls the image by an half-cubeface.

Steps to get the projection of the UP cubeface on the equirectangular image (phi = -arctan(cos(theta)). Yellow for True and Purple for False. - Figure by the author

Now that the UP mask is known we can generate the x and y maps for the FRONT, RIGHT, BACK and LEFT faces on the entire image and filter it afterwards. The script below computes 2D coordinates of the FRONT cubeface (z=1) and translate it to get the other ones.

The script above generates following y-map for input cubefaces of size 256x256. The pattern of the cubefaces is easy to spot. The y-value is also computed for invalid values of φ that are outside the cubeface. The first and last rows are even undefined because of the division by 0 in the tangent. But it doesn’t matter since the UP mask will filter them out.

y-map for the FRONT, RIGHT, BACK and LEFT cubefaces — Figure by the author

As for the x-map there’s no way to notice the cubemap pattern. x=0 starts on the left edge of the FRONT cubeface and keeps increasing until the right edge of the LEFT cubeface at x=4x256–1.

x-map for the FRONT, RIGHT, BACK and LEFT cubefaces — Figure by the author

Let’s do the same for the UP and DOWN faces. The script below computes 2D coordinates of the UP cubeface (y=-1) and flips it to get the DOWN face. Note that the +4*face_size and +5*face_size operations allow skipping the first cubefaces to be in the UP and DOWN local x-coordinates.

x-map for the UP and DOWN cubefaces — Figure by the author
y-map for the UP and DOWN cubefaces — Figure by the author

We can now merge the resulting xy-maps using the UP mask and then call cv2.remap to generate the equirectangular image.

N.B. : Artifacts might arise when cv2.remap interpolates along the edge of a cubeface since they’ve been stacked horizontally although in reality they do not necessarily share the same borders. This can be easily fixed by padding each cubeface by 1 pixel in each direction. You can use a mean to estimate the padded value in the corners.

For instance, the BACK cubeface should be padded like the following:
- At the top: the first row of the UP cubeface
- At the bottom: the last row of the DOWN cubeface
- On the left: the first column of the RIGHT cubeface
- On the right: the last column of the LEFT cubeface

Conclusion

I hope you enjoyed reading this article and that it gives you more insights on how equirectangular images and cubemaps actually work!

If you really want to speed-up your conversions, you can pre-compute the transformation maps once and reuse them to convert a batch of images of the same shape. It’s also possible to call cv2.convertMaps to convert the maps to a more compact and faster fixed-point representation.

https://github.com/ThomasParistech

--

--

Thomas Rouch
Check & Visit — Computer Vision

Computer Vision Engineer who loves to dissect concepts/algorithms in detail 🔥