Will Game Developers Lose Their Jobs to AI?

4 min readAug 23, 2024

I have covered topics on Vision Transformers and Video Transformers in several previous blog posts(e.g. TimSformer, ViViT and C-ViViT). The reason for covering these topics was to gain a solid understanding of the following paper!

The paper about Genie has been published by J.Bruce from Google DeepMind to ICML 2024.

Genie: Generative Interactive Environments
Jake Bruce · Michael Dennis · Ashley Edwards · Jack Parker-Holder · Yuge Shi · Edward Hughes · Matthew Lai · Aditi Mavalankar · Richie Steigerwald · Chris Apps · Yusuf Aytar · Sarah Bechtle · Feryal Behbahani · Stephanie Chan · Nicolas Heess · Lucy Gonzalez · Simon Osindero · Sherjil Ozair · Scott Reed · Jingwei Zhang · Konrad Zolna · Jeff Clune · Nando de Freitas · Satinder Singh · Tim

This paper proposed a new structure to generate a game(i.e. controllable environment) which is called as generative interactive environment(Genie).

Strictly speaking, it does not generate a game but it generate images what are derived from a latent action model from Genie.

If you do not understand “Transformer”, I recommend you to read this article by J. Alammar which is written with a lot of illustrations to help you.

Let’s dive into the world in Genie !

Summary

With the 11B parameter model proposed, it is possible to generate continuous images (whether photos or sketches) that can be manipulated through user action input, creating an environment akin to a video game
Remarkably, this model has been trained using unsupervised learning with over 200,000 hours of unlabeled game footage obtained from the internet
The proposed model architecture consists of a spatiotemporal Transformer model for video tokenization, a codebook learned from actions within videos, and an autoregressive model that generates the next frame from latent actions and tokenized frames
Experiments have confirmed that the model’s latent action space exhibits generalization capabilities, even for previously unseen video content
However, a significant challenge is that running three Transformers totaling approximately 11B parameters requires high-performance TPUs, limiting Genei to generating at only 1 FPS.

Transformer Component

To improve memory efficiency, use the ST-transformer across all components. It features L layers where spatial and temporal Attention are alternated, followed by a single feed-forward layer as a standard Attention block. This approach ensures that computational efficiency scales linearly with the number of frames:

Spatial self-attention focuses on 1xHxW tokens within each time step
Temporal self-attention focuses on Tx1x1 tokens across T time steps, incorporating causal masking

Architecture of Genei

The figures from the paper, with additional explanations provided by me. It seems that the diagram you mentioned contains all the essential information! I’ve added some additional comments, so please take a look.

Experiments

They created a dataset using 2D gameplay videos sourced from the internet, though the specific platforms are not identified. The videos were filtered based on the following criteria:

The title includes keywords related to 2D games
The title or description contains action-oriented words (e.g., “speedrun” or “playthrough”)
The title does not include negating words (e.g., “movie” or “unboxing”)

After manual review, the final dataset comprises 6.8 million video clips, each 16 seconds long (totaling 30,000 hours), with a resolution of 160x90 and a frame rate of 10 FPS.

The results regarding scalability are omitted. You can refer to Figure 9 in the paper for details. Here, we highlight some interesting results.

Firstly, images generated using their proposed method effectively represent disparities despite being unlabeled.

Secondly, the consistency of actions validated on robotics datasets is notable.

Additionally, the results from the CoinRun environment demonstrate that Genie is capable of generating highly versatile actions.

While there is no mention of conditions such as hitboxes or game over scenarios, it appears that the model has learned to handle interactions involving actions, 2D games, and objects.

My Summary

To summarize, while current AI-based environment generation technologies may not yet be advanced enough to replace human game development, the ability to create operable environments from input data such as sketches suggests potential applications in entertainment. For example, children’s drawings could be photographed and transformed into simple interactive games, similar to augmented reality experiences. Additionally, the ability to generate certain types of simulation environments indicates that this research could have versatile and appealing applications.

Thank you!

Written by Taks.skyfoliage.com

This post is republished from skyfoliage.com