Using a Generative Adversarial Network to author playable Super Mario Bros. levels

10 min readApr 29, 2021

A Generative Adversarial Network (GAN) is a machine learning architecture that can produce novel outputs which have the same properties as a set of training samples. This type of model is often used in image generation tasks and here we’re going to use it to generate Super Mario Bros. levels.

Playable Super Mario Bros. level authored by a GAN

In order to do that we’re going to first read the level data from the original game and convert it into a sequence of 16x16 images. We can then use this dataset to train our generative convolutional network to produce new scenes. Finally we’re going to convert these back into their original level format and write them to the Super Mario Bros. ROM.

The end goal is to produce Super Mario Bros. levels that are both interesting and playable and that reproduce the properties of the levels in the original game.

In game screenshots of scenes produced by the generator

A GAN is set up as a competition between two neural networks. The first network is called the generator, which has the task of producing new sample images. We feed the generator a set of random numbers (the latent vector), which it transforms into an image. The discriminator has the task of determining whether images are real (part of the training set) or fake (produced by the generator). By training both models together we aim to produce a generator that can output images which are indistinguishable from those that came from our training set. GANs can be used to produce images of handwritten digits, living rooms, celebrity faces or any other type of images with a big enough training set.

A tile based game like Super Mario Bros. is well suited to level generation using this technique since roughly speaking we can think of a Super Mario level as being composed of a sequence of 16x16 images. This is not the first use of neural networks to generate Super Mario levels, there’s a number of previous models doing the same using slightly different techniques. However we’re going to do things a little bit differently and introduce a new idea for a conditional GAN capable of generating arbitrarily long levels that stitch together seamlessly.

Level format

Before we get into generating levels, a little bit of background on the level format of the original game and image encoding we’re going to use. Super Mario Bros. came out in 1985 and the entire game weighs in at 40 kB, which doesn’t leave a lot of room for level data. The level data is stored in a compressed format which gives the position and type of each object in the level. Generally each object is described in only two bytes. The first byte specifies the x-y coordinates in a 16x16 grid. The second byte specifies the type of object and one bit is reserved for a ‘new page’ flag which tells the game to skip ahead 16 tiles.

To get an idea of the level format we can look at this familiar scene from the start of level 1–1. The level data is heavily compressed, with this scene being described in only 12 bytes (and another 2 bytes for the Goomba sprite offscreen).

Screenshot of SMB 1–1 converted into RGBY image representation used by the neural network

Encoded in hexadecimal the entire scene above is written as:

07 81 | 47 24 | 57 00 | 63 01 | 77 01 | C9 71

Big data this ain’t. In fact our challenge is to train the network from less than 2 kB of level data. The first step in training our network is to convert that string of level data into the RGBY image representation on the right. We’re going to encode our Mario levels as 16x16 images with 4 separate color channels. On the red channel we put all the bricks and question blocks, the green channel contains all the ground and block tiles. The blue channel is for enemy sprites and the fourth (yellow) channel is for pipes.

There’s three reasons for this particular choice of level encoding. One, we want to be able to capture enough of the variety of objects and sprites in the game to be able to generate interesting levels, but we don’t want to over-complicate the model.

Two, we want to group objects as a function of how they’re arranged in the level. For example the solid blocks are often arranged into large staircase structures, whereas bricks are most often arranged into horizontal rows like we see in our rendering of level 1–1. We want to give our network a chance to learn these arrangements and reproduce them.

Three, we want to organize all of our objects into a rational scheme where small misses by the generator network translate into small changes in the output level. The substitution of a coin block for a brick, for example, has very little impact on the quality of the generated level. Whereas if we substituted a pipe, or a Koopa sprite, the level quality would be negatively affected.

Training the GAN

Now that we have the image representation of our Mario levels we can move on to building and training the network. We’re going to use a pretty standard GAN architecture composed of strided convolutional layers with Leaky ReLU activations for both generator and discriminator. We’re also going to use some common tricks like label swapping and exponential averaging to improve our results. All of this is implemented in Python using TensorFlow as the backend and Keras to build the model.

At the start of the training process the network doesn’t know anything about the dataset and it essentially outputs random noise. However, as the training progresses the network eventually starts to generate levels that are more “Mario like”.

This is a good time to stop and think about what makes a given arrangement of tiles a Super Mario Bros level. This is actually pretty hard to define, but there’s a few standout features of the original SMB game that we’re going to look for.

Pipes that connect to the ground or else to a row floating blocks.
Blocks arranged into structures, often staircases of various sizes.
Rows of bricks and question blocks, often on rows 3 and 7.
Sprites that stand on the ground or other objects, sometimes grouped in twos and threes.

As we train the GAN we start to produce more and more of those features as shown below:

Network training progression from top to bottom.

Final model

After several thousand training epochs we settle on a stopping point that produces good results. Looking at some outputs from the final model we see that the generated levels reproduce a lot of the features identified in the original game:

Model outputs (note that these aren’t real in-game screenshots but actually rendered separately)

More importantly, they pass the eye test. Although of course that’s kind of subjective.

Level stitching

Given that we can generate high quality 16x16 scenes, the next problem becomes how to stitch together these scenes into coherent, playable levels. We could just string the generator outputs together, but that has the potential to generate impassible obstacles (for example if a new screen begins with a tall pipe or a tower of blocks). Instead we’re going to modify the GAN to generate scenes sequentially, with each one being dependent on the preceding scene.

To start we’re going to have the GAN to produce wider screens with dimensions of 16x20 instead of 16x16. The discriminator will use the same architecture of convolution layers as before, except modified for the 16x20 input. However we’re going to add a wrinkle to the generator in order to make the output conditional on the previous scene in our level.

Network architecture (visualization made with Netron)

In order to do this we’re going to feed the generator two inputs. The first will be our latent vector of random noise and the second input is a 16x4 slice of level data, which we will take to be the last 4 columns of the previously generated scene.

This input will be run through some convolutional layers to generate a 1x4 vector which we will prepend to our latent vector. The rest of the generator remains unchanged from the previous architecture, and produces 16x16 images. The only difference being that now 4 of the values in the input vector are not random noise, but instead contain some information about the previous screen.

The final thing the generator does is stitch together the 16x4 input to the 16x16 generated output to produce the final 16x20 level image. This combined 16x20 image is what we train the discriminator on. As a result the generator must learn how to produce levels that stitch properly with the input level data if it wants to fool the discriminator.

With this architecture we can generate levels iteratively, by feeding the last 4 columns of each scene back into the generator and combining them with a new latent vector to produce the successive screen. To kickstart our level generation we use a blank 16x4 level input (containing only 4 ground tiles) as the input to the first screen. Using this input is also good practice because it guarantees that Mario doesn’t start the level by falling in a hole or being embedded in a pile of blocks, etc.

The approach is based on the idea of a conditional GAN. In a conditional GAN both generator and discriminator are conditioned on some label data which we can then use as a switch to control the generator output. The difference here being that we don’t have pre-labeled data, but are instead going to learn it as we go.

To demonstrate the effect of the conditional generation we can look at what happens if we hold our latent vector fixed, but modify our conditional input to be either empty (only ground tiles), have a pipe in the 4th column, have a small 3x3 staircase, or a set of bricks and blocks.

Conditional generation with each row generated using a fixed latent vector and one of 4 level inputs.

Notice that some features, like the 3x3 staircase, strongly influence the generator to produce an extended staircase (although the exact structure is modified depending on the latent vector). Some features are persistent regardless of the conditional input, for example, the enemy sprites near the right edge of the scene in rows 1, 2 and 4. Whereas others, like the pair of goombas in row 3, are much more fleeting. Generally we should expect that the left edge of the scene is more dependent on the level input than the right edge.

Level Generation

Now we can use this conditional generator to iteratively produce arbitrarily long levels. Here is a a level stitched together containing 40 iteratively generated 16x16 scenes.

We note that the level doesn’t produce any obvious stitching errors or generate any impassible obstacles. The generated levels aren’t pixel perfect, and there are a few problematic tiles, like floating pipes. However, I think they pass the test of being both interesting and playable, as well as reproducing the observed features of our training levels and stitching properly. Some of these problems might be improved by optimizing the GAN training, but a lot of them are also pretty trivial to fix with a couple of lines of code in our ROM patching script.

Final Thoughts

There’s still lots of work that could go into fine tuning this model to produce fewer mistakes, adding extra objects from the game that weren’t included in this version or incorporating the different varieties of levels found in the original game (underground, sky, underwater, castle). There’s also lots of potential to tune the level generation process to produce themed levels that emphasize certain features.

Generated scenes range from very ordinary to a little out there

I think this demonstrates the potential of machine learning for video game content generation. This type of model could be used for unsupervised content generation or with the content edited by a human designer. It could even be used as a sort of ‘auto-complete’ for a human level design.

The generated levels might not have the same intentional design as levels authored by a human game designer, but they manage to produce interesting gameplay scenarios regardless. Finally and most importantly, the generated levels are (in my opinion) fun to play!