How we made the Audi AI:ME’s flying VR experience — Part 2

Optimizing photogrammetry data for VR using the realities.io Pipeline Editor

David Finsterwalder

Follow

Published in

Realities.io

13 min readJan 31, 2020

--

A textured mesh and its underlying decomposition of the individual parts

In Part 1, we described how we went about capturing the large amount of data for Audi’s AI:ME VR experience.

In this article, we will describe how we optimized the processed photogrammetry data that we processed using RealityCapture within our RIO Pipeline Editor. We will not go into detail with the Photogrammetry processing itself as our capture workflow made alignment robust and processing straight forward, as described in Part 1.

If you want an overview of learning photogrammetry basics, see Azad’s “Getting started with Photogrammetry — with an Smartphone camera” guide or if you want to learn RealityCapture look at Getting started with RealityCapture. While the processes described here are done with our internal tool, which is not commercially available at this point and offered only via our services, we hope that sharing our overall workflow can still be valuable for people working with large-scale photogrammetry data to learn from.,

Background

When I began working with photogrammetry data in 2013, I often spent a lot of time in different tools (Blender, etc.) to optimize the data for usage in real-time 3D engines. While optimizations also help for rendering on a PC, it became obvious that especially VR experiences running on standalone devices (i.e. Oculus Quest or GearVR back then) required a lot of work to find a good balance to compromise between quality and performance.

By learning, iterating, and working on photogrammetry for award-winning VR films over the past 4 years, we found it necessary to build a highly modular Photogrammetry Pipeline editor to streamline the process of optimizing large-scale photogrammetry scans for VR.

The major benefit of our pipeline (apart from making our lives much easier) is that we can easily adjust the output to service a huge variety of use-cases and performance targets from a single, high-detail input mesh. With the continued growth of AR/VR devices running on mobile chipsets (Oculus Quest, Hololens, Magic Leap), we’re glad to see the work we started back in 2016 paying off.

A lot of the problems we are solving are problems that apply to photogrammetry data (or even large meshes) in general and so I will start talking about the general problems first.

Optimizations of Scanned Data for VR

Splitting up large meshes

First of all, you usually don't want to have a photogrammetry scene as one large mesh. Some parts of the scene might be hidden which means loading unnecessary data (especially large indoor scans). Also, a single large draw call can also cause stalls (especially on mobile devices this can cause the drop of multiple frames at once). For those reasons, splitting up the mesh is often mandatory for hitting performance.

We first started with splitting meshes in even spatial segments and then with an Octree based on the amount of triangles in each node. However, the area of equal-triangle based “chunks” can vary a lot throughout a large-scale scan. If all chunks have the same texture resolution, the huge chunk size differences made the resolution per area very inconsistent.

A worst-case scenario of segmentation using k-d tree. Each color represents a different chunk and material

Because of this, we moved to a k-d tree that respected the area of a triangle to split the scene in roughly even-area chunks. An issue with this was that the boundaries between chunks can be large. Since the boundaries are fixed for LODs (more on this later), this could cause problems as some chunks couldn't be reduced properly.

K-Mediod Voronoi segmentation visualized. Image is blended between random material colors for each chunk on the left and the textured mesh (unlit shaded) on the right. Each chunk is fairly even sized in terms of surface area (some exceptions later) and has its own Level-of-Detail (LOD) meshes. Each chunk has its own texture unwrap and is seamlessly connected to its neighbor (also true for the different LOD levels where the seams between chunks are preserved).

To also make sure that we have small boundaries between chunks, we ended up developing a k-medoid Voronoi mesh segmentation. A Voronoi decomposition gives you very few boundary edges between chunks solving the LOD issue we faced. However, to still make sure that we end up with fairly even areas for each chunks, the seeds for the Voronoi decomposition get optimized iteratively for even areas. Even while there is some variance in chunk area size left it's usually small enough to not be noticeable. So far, we haven’t encountered a case of differences in perceived texture resolution in our productions.

Frustum and Occlusion Culling

One advantage of having the assets split up in chunks, is that you can make use of View Frustum Culling to reduce the amount of data that needs to be rendered. This means that only objects within the cameras view frustum will get sent to the GPU. There are also more advanced techniques like Occlusion Culling that in addition culls (hides) meshes that are in the camera view frustum, but are hidden behind another object. While in most outdoor scenarios, View Frustum culling is more than enough and the overhead from Occlusion culling is not worth it, the latter can be a big boost for detailed indoor environments with several rooms. Both common game engines Unreal and Unity have their implementations of Frustum and Occlusion Culling and our assets work fine with both engines (Unreal documentation / Unity documentation).

Camera Frustum Culling: only the parts of the scene where the camera is looking are rendered (philosopher pun: solipsistic rendering)

Level of Detail (LOD) Sub-Meshes

Another advantage of these mesh chunks is that each chunk can have its own discrete Level-of-Detail (LOD) mesh. Using LODs is pretty common in games, however if the scene is one large mesh instead of split into chunks (or individual objects as usual in games), you’d lose the advantage gained from LOD’s. When the mesh is split-up, it’s more close to a typical game scene where we can use LOD’s with the existing system in game engines (as both Unity and Unreal Engine support LOD’s natively). Our pipeline allows to output FBX files that makes sure LODs get imported correctly for each engine.

Wireframe overlay on the left shows how the more detailed LOD’s are loaded, while the mesh chunks stay seamlessly connected.

MipMap Streaming and Texture editing

Another advantage of our mesh segmentation is that each chunk gets its own texture, which allows to more easily stream mipmaps from individual textures. Virtual Texturing and streaming from one big texture atlas can help with loading the required data, however our assets don’t need any additional plugins or engine modifications and work with both Unreal and Unity’s Texture streaming or without it at all (which makes working with partners much easier). Even when outputting a single UDIM with our pipeline to use it as single Virtual Texture in Unreal, the spatially separated UVs per Voronoi chunk still help to reduce loading not required texels since UV islands tend to be in spatial proximity.

Mip Map streaming visualized. Only the higher resolution mipmaps close to the camera get streamed in. For the chunks further away, lower resolution mip levels are used allowing to save memory. The effect here is extremely exaggerated with a mip bias and a artificially low memory budget.

The other advantage of this one-texture-per-chunk segmentation is that the assets can be better handled in editing. For example, going into “edit mode” for a single chunk in Blender is much faster than editing the full mesh. For texture editing, the Voronoi decomposition leads to a low amount of seams between different textures, which often allows to only work on a single texture when editing a region of a scene.

Adaptive Detail Decimation

High resolution Photogrammetry reconstructions can result in scenes with hundreds of millions (or billions) of triangles. One major challenge is to optimize the geometry decimation to retain the details of areas-of-interest while reducing geometry on areas that don’t need it (walls, floors, surrounding mountains, etc). This is especially important for mobile VR/AR hardware that can handle an order of magnitude less geometry than min-spec PC VR (~100k vertices vs 1 mil respectively).

We developed a workflow where we mark areas-of-interest to retain geometry and texture density with boxes (typically with Blender, but Reality Capture’s reconstruction region boxes can work as well) before decimating further. This adaptive detail decimation workflow, that is especially important for mobile VR experiences, proved vital for the Audi AI:ME project (more on this later).

Earlier in the mesh segmentation section, we referenced how we can retain an even texture resolution between all chunks. However, this isn’t ideal if the photogrammetry scene is huge and the VR player will only be up-close in a smaller, specified area of the environment, as you’d be dedicating a ton of texture and file size to areas which cannot be seen up-close in the final VR experience.

Comparison between two meshes with each around 90.000 vertices (for usage in mobile VR or AR). Without adaptive optimizations, the mesh is decimated evenly and has evenly distributed texture resolution. The statues that are close to where the user is in VR have improved details (blue boxes), while unnecessary geometry details towards the ceiling are reduced to a sufficient amount (red box). The texture resolution is gradually reduced from the area-of-interest to the ceiling.

For adaptive optimizations, we define a simple “area-of-interest”— the space where the user will roughly be. For triangles further away, weights will be calculated and used in the decimation, segmentation and unwrap steps of our pipeline. This way, an extremely high-fidelity experience can be assured even for the most restrictive performance targets. While the large difference in quality is already quite vivid in the example above, the difference is even higher in VR as the individual polygons are even more visible because of the stereoscopic depth perception.

Wire frame view of the adaptive decimated scenes with weights visualized. White (1.0) means higher priority and black (0.0) lower priority. The orange box is the mesh that is used to define the “area-of-interest”.

To make use of weighted decimation, a processing step in our pipeline calculates weights based on the defined “play area”. Any closed mesh can be imported as an area-of-interest, however we often use boxes for reduced calculation time. For “Roomscale VR” the area-of-interest is usually all the accessible space 0.5m - 2.5 m above ground like in the example on the left. We include walls and exclude the floor as you typically get closer to walls than to the floor and more resolution is required.

Texel density of the scene visualized with a grid texture. On the left: In orthogonal view from the side the difference in texel density (visualized through size of the grid) decreases towards the ceiling. On the right: View from player perspective the texel density in screenspace (perceived size of the grid) is fairly even from this perspective.

If the calculated weights are used for the Voronoi segmentation module, the resolution-per-area will be lowered for chunks further out, by making the area of those chunks larger. However, since the required texel density within a single chunk can largely vary, the weights also get used to scale triangles based on the weights in the UV space in the UV unwrap module. If triangles are larger in UV space the texel/area is increased for those triangles. This way texture resolution is not only adaptive per chunk but is adaptive within a chunk. This makes this feature useful in scenarios where you want a low amount of chunks (to reduce draw calls).

Batching everything together with the RIO Pipeline Editor

A lot of those optimizations described here earlier are not necessarily unique to scanning and solutions exist on the market (after all, our assets work with the standards in 3D engines). For example, for decimation, Simplygon, InstaLOD are great solutions (both of which we support in the pipeline). For unwrapping, standalone solutions like Unfold3D work decently as well (which the pipeline also supports).

However, anyone familiar with Photogrammetry knows how time consuming processing and optimizing photogrammetry data can be. With each step sometimes requiring hours, small mistakes can lead to large amounts of waiting. Iterating slowly on unknown variables to get the best results can be soul-crushing.

This problem gets worse when introducing multiple programs into the workflow, as there’s typically no simple ways of queuing up tasks across software.

The requirement of a processing pipeline was an obvious necessity for our work. Instead of reinventing the wheel, we built our pipeline top-down to take advantage of licensing powerful backends and only building the parts that didn’t have pre-existing, usable solutions (as mentioned in the sections above).

The modularity of the Pipeline Editor makes it so that we can easily add new SDK’s and backends with minimal ease. It also means that we aren’t dependent on any one solution when alternatives exist on the market. If any of any one solution is not available anymore, we can replace it with another module.

So while we mostly use RealityCapture CLI and the Simplygon SDK as our main backends, our pipeline can import meshes from pretty much any photogrammetry software (Meshroom, Metashape/Photoscan, 3DZephyr, etc.) that can export as FBX or OBJ.

Modular / Visual Scripting Pipeline Editor

A subsection of the node graph processing a mesh with a marked area-of-interest “playspace”

What we learned during the development of our pipeline is, that there might be different requirements for our customers and partners.

Especially for partnerships in which we only provide assets while scans and the VR experience is done by others, flexibility is key. We need to be able to provide assets for the engine the partner is using without modification and need to adopt to the different photogrammetry software that they might be using.

A few years ago, we had developed a pipeline with fixed steps that could either be enabled or disabled. However, it became tedious when we needed to rerun only parts of the steps (for example for an additional low poly version of an asset).

This lead to a re-engineering of the code base and the creation of our RIO Pipeline Editor: a node-based graph editor where new backends can be integrated with minimal work. Being able to shuffle the modules around and add others would allow for more flexibility. The editor allows to create processing jobs with the individual processing steps (by connecting them via nodes). Those processing jobs can then be sent and queued-up on other PC’s in the network that are running the RIO Queue, leading to a more distributed way of processing our data using all available machines in our office.

Since developing our Pipeline Editor, we’ve also discovered new workflows by creating different arrangements of nodes(process modules).

Let’s say, we have a scan from a client but without a specific target platform to process the photogrammetry for. It can be helpful to provide assets for different performance targets of a single platform to establish what the final specs of the assets should be (to maximize quality and performance while minimizing file size). All it requires from our end is to change a few variables, send it to the Queue, and wait for all the different iterations to finish processing.

To make things even more convenient, we also have a workflow for increasing/decreasing the detail of specific areas of a scan after processing is done. By reusing the segmentation of an input mesh, our clients/partners can swap in/out chunks for higher/lower resolution iterations to better fit their performance target.

Let’s say we have a new processing variable introduced by Simplygon and we don’t know specifically how it affects the processing. By queuing up multiple versions of the variable and processing them, we can easily test the parameter-space of a new feature without having to do so manually.

One of the larger graphs of an actual production. After LOD’s are created, the LOD levels 2 and 4 are reused to make completely new assets. They get unwrapped and baked again (reduction can cause issues that are more visible when close). In case the original assets are to performance heavy, our partner can use more lightweight assets. The advantage of reusing LODs is that the alternative have the same boundaries and so chunks can be used on demand.

Requirements for Audi AI:ME experience

While our features have been mainly developed to optimize for the tight performance budgets of mobile devices, they proved to be very useful for PC VR as well. The Audi AI:ME project was one of those use cases.

We ended up recreating the serene atmosphere of the environments by recreating the fog in-engine. The rendering required a depth-prepass and thus doubled draw calls (more details on the rendering in upcoming Part Three). This led to a much tighter budget than we usually have. While we could achieve this with LODs, the additional requirement of quick scene transitions (low level loading times) meant we needed to get clever.

This required to have the texture resolution be as low as possible to reduce the memory footprint. Since the fog had a depth blur (blurring out details in the distance), it allowed us to dramatically reduce texture resolution for parts of the scan that are more distant.

However to get the memory footprint even lower, we needed to create different levels/assets from very large scans to keep each level smaller and faster to load. Our modular pipeline made solving all those requirements straight-forward. The texture and mesh details in the distance could be reduced by simply changing some parameters. This dramatically reduced the texture amount for the levels. We also defined different areas-of-interest and created different assets from the same location scan to be used in different scenes/levels. This basically allowed us to split one high quality Photogrammetry scan into two smaller scenes that are each faster to load.

Upper part of gif: Two different meshes for different Unity scenes/levels of the experience. The red box is the area-of-interest we defined and where the car will roughly be. In this use-case, the texture details in the distance could be even lower, because the distant details were blurred from the fog post process (compare to the teaser). Lower part of the GIF: top down view of the scene and with random colors for each chunk. The reduction in mesh details is clearly visible.

Conclusion

After working a couple of years on this pipeline, we were very proud how it allowed us to adequately tackle the challenges of the Audi AI:ME project.

While we did need to work in pre-production on establishing our capturing workflow for large environments (as explained in Part One), we already had most of the features and modules needed for the Audi project and were able to spend more time iterating on the processing and final VR experience, rather than spending time developing the underlying technology.

If I think back now about my humble beginnings in photogrammetry, when I did a lot of optimizations by hand, I think about the huge amount of hand work that would have been required for those two differently detailed scenes like in the example above. It fills me with pride and joy to see how far we have come and what we’ve been able to make. Things that would have taken days or even weeks is something that can be queued up and done overnight by a machine. Our Pipeline is still a bit rough around the edges, but it feels very powerful and feature complete.