This AI could be used to generate GTA 6 graphics

Published in

deepgamingai

7 min readJun 30, 2020

Designing virtual worlds in video games is a time-consuming and resource-intensive task. Open-world games like Grand Theft Auto (GTA) have huge virtual environments that the players can freely navigate and building these visuals can take up to 4 to 5 years even for large game studios. Therefore, the turn-around time for releasing new games tends to be pretty large. This is where Deep Learning algorithms can help in terms of reducing the development time by taking over the task of designing and rendering the game visuals from the creative artists. In this article, I am going to go over two state-of-the-art research papers that can help perform the task of designing and visualising virtual worlds in games like GTA.

Image Synthesis with Deep Learning

**Left:** A semantic label map depicting the objects appearing in this scene. **Right:** A fake image synthesised by a Deep Neural Network from this semantic label map. [source]

We can train a neural network to learn the appearance of various objects or assets we want to include in our game’s virtual world. Then, if we feed it a semantic label map (as shown above) describing the positioning of these objects, it can render realistic looking visuals.

This task is known as Image Synthesis using semantic labels and this research field has seen a lot of new developments in the last few years. One such research paper I want to cover today that excels in this task goes by the name of vid2vid.

Video-to-Video Synthesis (vid2vid)

Researchers at MIT and NVIDIA published a paper titled “Video-to-Video Synthesis” which is capable of synthesising videos from high resolution, temporally-consistent images. It uses a special GAN architecture to ensure that the synthesised frames in the video are looking realistic and that there is visual consistency across different frames.

The vid2vid network’s Generator uses not only the current semantic map that we want to convert, but also few semantic maps from the previous frames. Additionally, it uses the final synthesized frames from its previous output and combines these together to compute a Flow Map. This provides the information required to understand the difference between two consecutive frames and is therefore able to synthesize temporally consistent images. On the discriminator side, it uses an Image Discriminator to control the quality of output and in additon to that, it also uses a Video Dicriminator to ensure that the frame-by-frame sequence of the synthesized images makes sense according to the flow maps. This ensures that there is very little flickering between the frames. On top of this, it employs a progressive-growth approach that starts with first perfecting lower resolution outputs and using that knowledge it progressively moves up to produce higher resolution output. Check out the amazing results from this network in the figure below.

Short clip of fake virtual city synthesised by the vid2vid network trained on cityscapes dataset. [source]

The issue with GAN-based approach

While the visual quality of the vid2vid GAN network is impressive, there is one practical issue if we want to actually use this in games. You must have noticed how games like GTA have a day and night cycle in them which changes the appearance of its virtual world. Also, other weather effects like rain and fog completely alter the appearance of this world. This means any neural network attempting to render graphics of the virtual world must also be able to do it for different visual styles corresponding to their lighting and weather effects. However, producing diverse-looking images is a problem with GANs due to the phenomenon of Mode Collapse.

Mode Collapse in GANs resulting in synthesised images unable to produce visually diverse outputs.

Mode Collapse with GAN in vid2vid

Imagine a bunch of training images in some higher dimensional coordinate space, which we are simply represented in the above figure in two dimensions. Some of these points represent day time samples and some represent night time images. Now, when we start training an unconditional GAN, we first generate a bunch of random images that will be pushed through the generator. Now the training process essentially tries to push these fake images towards the training images so that they look real. This leads to a problem where some of the training images may be left out and never be used. This will lead the generator to only produce images of the same kind as training goes on. Hence, GANs suffer from mode collapse and images generated by this method can not be visually diverse in nature.

This is how I came across the research paper that aims to solve this issue using Maximum Likelihood Estimation.

Image Synthesis with Conditional IMLE

Researchers at Berkeley published the paper “Diverse Image Synthesis from Semantic Layouts via Conditional IMLE” which aims to solve the problem mentioned above with GAN based training process of vid2vid network. Rather than focusing on improving the quality of the output frame, it focuses on being able to synthesize diverse images from the exact same semantic map. This means we can have the same scene rendered in any lighting or weather condition using this method, unlike with GANs where one semantic label can only produce one output. This paper shows how to use Implicit Likelihood Estimation or IMLE in order to achieve this. Let us try to understand why IMLE seems to work better than GANs for this particular use-case.

Uncondiitonal case of Implicit Maximum Likelihood Estimation (IMLE) training process.

It first picks a training image, and then tries to pull a randomly generated image close to it. Note that this process is the opposite of how it works in GANs. Next, it picks another training image and pulls another random image toward it. This process is repeated until we have covered all the training images. This means, our training process now covers all day and night time images and thus, our generator is forcefully trained to produce diverse style of images. Now, note that this is the unconditional case of IMLE where we start from a random noise image rather than a semantic label map, but the training process remains the same for both cases. The only thing that changes is the input encoding when we use semantic maps, so let’s take a look at that.

Conditional case of IMLE, where input is the semantic label and not a random image like we saw before. A random input noise channel is added to the input encoding which is used to control the visual style of the network’s output.

Instead of using an RGB semantic label as input, we break down the map into multiple channels. Each channel corresponds to one object type in the map. Now here comes the most important part of this paper which I personally found the most interesting. It uses an additonal Noise Input Channel to control how the output style looks like. So, for one random noise image in this channel, the output will follow a fixed output style like day-time effect. If we change this channel to some other random noise image, it will follow another style like maybe night time effect. And by interpolating between these two random images we can actually control the time of the day in the output image. This is really cool and fascinating!

Day-and-night cycle of a virtual world imagined by a Deep Neural Network [source]

Trying out this AI to render GTA 5

I tried to recreate this effect for a short clip from the game GTA 5. I obtained the semantic labels of the game using an Image Segmentaiton network and then ran it through the IMLE trained network. The results are fascinating given that it is the exact same generator network that is able to produce both the day time and night time clip of the footage from GTA!

You may watch more of these results in video format on my Youtube channel, with the video embedded below.

Conclusion

Between the two papers we saw today, vid2vid and IMLE based image synthesis, we can truly see how far we have come in AI based graphics rendering. There are only a few more hurdles we need to clear before we start experimenting with this new AI-based graphics tech. I predict in about a decade from today, games like Grand Theft Auto will have some sort of AI-based assets rendering to help reduce the development time of the game. The future of game development is exciting!

References

Thank you for reading. If you liked this article, you may follow more of my work on Medium, GitHub, or subscribe to my YouTube channel.

Note: This is a repost of the article originally published with towardsdatascience in 2018.