AI-Generated Graphics: One Step Closer

Overview of the paper “World-Consistent Video-to-Video Synthesis” by A Mallya et al.

Chintan Trivedi
deepgamingai

--

Over the past few years, we have made tremendous progress in AI generated graphics using image/video synthesis techniques with Generative Deep Learning. The researchers at NVIDIA introduced the pix2pix model in 2017 for picture-to-picture translation which set the direction for AI-generated graphics. NVIDIA then improved the qualitative results with the vid2vid model in 2018 which was optimized for video-to-video translation by introducing consistency across the consecutive frames of the video.

Visual inconsistencies in vid2vid when revisiting a scene. [source]

But one major challenge that still remained with the vid2vid video synthesis method was that of visual consistency. This method suffered from long-term visual consistency as the network would forget things and if you revisit the same scene it might be rendered differently, making it impractical for use in games.

However, fast forward to 2020, there are already advancements made to address those problems by the same research team at NVIDIA. Compared to their previous vid2vid model, the newer vid2vid model is able to maintain consistency of objects when you revisit and re-render the same scene you have been to previously. In this article, let’s take a further look at this paper titled “World-Consistent Video-to-Video Synthesis” by Mallya et. al.

Inputs for world-consistent vid2vid model. Top Left: Semantic Labels. Top Right: 3D Point Cloud (Guidance Frames). Bottom Left: Depth Maps. Bottom Right: Synthesized Frames. [source]

This method uses additional information like depth maps along with the standard semantic labels, and it introduces an additional input called the guidance image. This image basically creates a view of the scene with 3D point clouds and applies the rendered texture of the objects to these point clouds. This way, when we revisit the same scene, our model can refer to the guidance image to re-render the same scene with consistency. This is why this model is called world-consistent video synthesis. It uses multiple adaptive spade blocks or layers for synthesizing the frame where it starts with rendering a small 16 by 32 image and then grows it up to HD resolution.

They even created a sample 3D world in a traditional game engine, and used their world-consistent vid2vid model to render the graphic textures, giving us a glimpse of how the graphics pipeline of future games might look like.

Left: 3D world game engine. Right: same scene from this 3D world rendered twice by world-consistent vid2vid. [source]

Notice that after you travel this world in a closed loop and return to the same position, there is consistency in the rendered image, making it highly suitable for applications to gaming. I can’t wait to see what this research group at NVIDIA comes up with next, and when they do, i’ll be sure to cover their progress.

Thank you for reading. If you liked this article, you may follow more of my work on Medium, GitHub, or subscribe to my YouTube channel.

--

--

Chintan Trivedi
deepgamingai

AI, ML for Digital Games Researcher. Founder at DG AI Research Lab, India. Visit our publication homepage medium.com/deepgamingai for weekly AI & Games content!