Research Card: Exploring 3D-aware Latent Spaces for Efficiently Learning Numerous Scenes

Criteo R&D
Criteo Tech Blog
Published in
3 min readMay 2, 2024
Photo by Voicu Apostol on Unsplash

Paper: Exploring 3D-aware Latent Spaces for Efficiently Learning Numerous Scenes
Authors : Antoine Schnepf (Criteo AI Lab), Karim Kassab (Criteo AI Lab), Jean-Yves Franceschi (Criteo AI Lab), Laurent Caraffa, Flavian Vasile (Criteo AI Lab), Jeremie Mary (Criteo AI Lab), Andrew Comport, Valérie Gouet-Brunet
Category: Deep Learning, Computer vision, 3D
Revue: CVPR 2024 3DMV Workshop

Why did we work on this topic (the problem we want to solve)?

The inverse graphics problem consists of learning the geometry and appearance of a 3D scene given only its views (2D images). While learning a single scene has recently been widely explored, the scaled version of the problem — simultaneously learning many semantically similar scenes — remains unexplored. In this paper, we address this scaled problem. By avoiding learning redundant information across different scenes, we aim at increasing quality while mitigating resource costs such as computational and memory requirements.

How did we proceed?

First, we developed a compressed image space in which our 3D scenes can be learnt, instead of doing it in the usual image space. To do so, we used an image compression model (an autoencoder), which we modulated to be compatible with scene learning. We also adapted existing methods for 3D scene learning to make them work in the compressed space. Second, we integrated global scene representations that store the common information across scenes. Each scene is then represented as a combination of global and local information. This enables the sharing of 3D knowledge and avoids learning redundant information across semantically similar scenes.

We learn a 3D-aware latent space by regularizing its training with 3D constraints. To this end, we jointly train an encoder, a decoder and N scenes lying in this latent space. For each scene s, we learn a Tri-Planes representation 𝞣ₛ, built from the concatenation of local Tri-Planes 𝞣ₛᵐⁱᶜ and global Tri-Planes 𝞣ₛᵐᵃᶜ. 𝞣ₛᵐⁱᶜ is retrieved via a one-hot vector eₛ from a set of scene-specific planes stored in memory. 𝞣ₛᵐᵃᶜ is computed from a summation of M globally shared Tri-Planes, weighted with weights Wₛ.

What did we find? What did we achieve?

We developed two techniques to improve training speed and memory usage while maintaining quality when learning many scenes. Combining these two techniques leads to ten times faster and three times more memory efficient scene learning in a large-scale setting.

The video illustrates the scenes learned in the compressed space and how they translate to RGB visualisations

What is the originality here?

Training scenes in a 3D-aware compressed space is a novel and promising idea that has remained an unexplored area of research since the development of 3D scene learning. While recent research mostly focuses on improving the scene representation or the sampling strategies to render scenes, our approach is orthogonal to all these improvements as we directly modify the image space by adjusting the space in which these scenes are learned.

--

--

Criteo R&D
Criteo Tech Blog

The R&D team building the Commerce Media Platform for the Open Internet.