GameNGen : Google’s AI Game Simulation Neural Network

An AI based Gaming Engine using Deep Learning architecture

Mehul Gupta
Data Science in your pocket

--

Photo by Ben Griffiths on Unsplash

Amid all the LLM and Generative AI rush, Google has released a game changer in the Gaming Space i.e. GameNGen, a deep learning based architecture which can generate game simulation i.e.

A Game Engine, without a Game Engine

Don’t mistake this to be similar as “an Agent playing a game”. This is different as GameNGen can generate

Video games simulation

An entire video game in the future (and not just a AI Agent who can play it). This will include the gaming environment, different levels as well.

In the paper released, the results are shown for GameNGen generating GamePlay videos for the popular game,DOOM and can interactively simulate the classic game DOOM at over 20 frames per second on a single TPU. Next frame prediction (an image of the gameplay) achieves a PSNR of 29.4, comparable to lossy JPEG compression.

Note:

1. DOOM is a classic first-person shooter game where players take on the role of a space marine fighting against hordes of demons and monsters from Hell, known for its fast-paced action, intense combat, and dark atmosphere

2. PSNR measures image quality, with higher values indicating less noise and distortion. A PSNR of 29.4 for the model is comparable to the typical range of 28–32 PSNR for JPEG compression, indicating a good balance between image quality and compression efficiency.

Surprisingly, Human raters are only slightly better than random chance (50% accuracy) at distinguishing short clips of the game from clips of the simulation that whether it is AI generated or Human played gameplay !!

Talking about the model, it is trained in multiple levels:

1. Data Generation

Generating gameplay videos at bulk using humans is just impossible. So,

A Reinforcement Learning based Agent was 1st trained to play DOOM to generate ample dataset (gameplay video). Not much details are presented for this in the paper

2. Training the Generative Diffusion Model

A. Using a Pre-Trained Model

The team started with a smaller version of a model called Stable Diffusion v1.4 (U-Net based architecture), which is typically used for generating images from text.

They adapted this model to work with the game DOOM by focusing on sequences of actions and the frames (images) the game produces.

B. Adding Noise for Stability

During training, they intentionally added random noise to the frames to make them less clear. This might sound counterintuitive, but it helps the model learn how to “fix” these noisy images.

By learning to recover the original images from the noisy ones, the model becomes better at generating stable and clear images during gameplay.

3. Latent Decoder Fine-Tuning (ResNet)

Stable Diffusion 1.4 model pre-trained in the above steps, consists of a auto-encoder, which is designed to compress images into smaller, simpler versions (latent representations) while still keeping important information.

In this case, it takes small 8x8 pixel sections of the game image and compresses them into 4 channels.

When the model tries to create new game frames, it sometimes produces unwanted visual glitches or artifacts, especially in small details like the heads-up display (HUD) at the bottom of the screen. These artifacts can make the images look less realistic.

To fix these issues, the team focuses on training only the “decoder” part of the auto-encoder. The decoder is responsible for turning the compressed information back into full images.

They used MSE as loss function.

https://arxiv.org/pdf/2408.14837

The output gameplay for DOOM can be seen below (its almost realistic):

GameNGen looks just amazing given the output video. And the Gaming Industry is up for a revolution (can be taken as a ChatGPT moment of gaming industry). If this scales up well to other games, even Gaming Engines and Game Dev software may become obsolete.

We are in for some treat soon !

--

--