Building an Intelligent Mario Bot using Reinforcement Learning and Python

7 min readMar 13, 2023

Video games have always been an excellent platform for testing and evaluating the capabilities of artificial intelligence (AI) models. Reinforcement learning, a subfield of machine learning, is a popular approach to develop AI models that can learn to make decisions based on feedback from their environment. In this blog, we will explore how to build an AI Mario model using reinforcement learning with Python.

The game mechanics of Super Mario Bros. involve movement, obstacles, enemies, power-ups, and level design. Mario moves left to right and can jump and perform various acrobatic maneuvers. Obstacles such as pits and gaps must be avoided, enemies have unique attack methods, and power-ups grant special abilities. The game is divided into several levels with distinct themes, challenges, and hidden secrets. Understanding these mechanics is essential for building an AI model that can play the game effectively.

To build our Mario agent, we’ll be using the OpenAI Gym, a toolkit for developing and comparing reinforcement learning algorithms. We’ll also be using the TensorFlow library to implement our deep learning model.

Step 1: Set up the Environment

The first step is to set up our environment. We’ll be using the gym-super-mario-bros package, which includes the Super Mario Bros. game environment. We can install it using pip:

!pip install gym_super_mario_bros==7.4.0 nes_py

Next, we’ll create our environment and reset it:

import gym_super_mario_bros
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
from nes_py.wrappers import JoypadSpace

env = gym_super_mario_bros.make('SuperMarioBros-v0')
env = JoypadSpace(env, SIMPLE_MOVEMENT)
state = env.reset()

Let’s take some random actions

done = True
for step in range(5000):
    if done:
        state = env.reset()
    state, reward, done, info = env.step(env.action_space.sample())
    env.render()

env.close()

Step 2: Preprocess Environment

!pip3 install torch torchvision torchaudio

!pip install stable-baselines3[extra]

We need to Import the Vectorization Wrappers

from gym.wrappers import GrayScaleObservation
from stable_baselines3.common.vec_env import VecFrameStack, DummyVecEnv
from matplotlib import pyplot as plt

The GrayScaleObservation class is a preprocessor in the OpenAI Gym that converts RGB images to grayscale. It is used to reduce the dimensionality of the image data and simplify the agent's observation space, which can help to speed up training and reduce memory requirements.

By applying GrayScaleObservation, we can reduce the observation space from 240x256x3 (a 240x256 RGB image with 3 color channels) to 240x256x1 (a 240x256 grayscale image with 1 color channel). This can make the observation data easier to work with and can help our agent to learn more quickly and effectively.

env = gym_super_mario_bros.make('SuperMarioBros-v0')

env = JoypadSpace(env, SIMPLE_MOVEMENT)

env = GrayScaleObservation(env, keep_dim=True)

env = DummyVecEnv([lambda: env])

env = VecFrameStack(env, 4, channels_order='last')

state, reward, done, info = env.step([5])

plt.figure(figsize=(20,16))
for idx in range(state.shape[3]):
    plt.subplot(1,4,idx+1)
    plt.imshow(state[0][:,:,idx])
plt.show()

Step 3: Train the Model

Now we can train our model using reinforcement learning. We’ll use the Proximal Policy Optimization (PPO) algorithm, which is a popular algorithm for training agents in continuous action spaces.

import os 

from stable_baselines3 import PPO

from stable_baselines3.common.callbacks import BaseCallback

The BaseCallback class is a callback in the Stable Baselines3 library that can be used to monitor the progress of an RL agent during training. It is a base class that other callback classes can inherit from to implement specific functionality.

A callback is a function that gets called at specific points during training to provide information or perform some action. For example, a callback might be used to log the agent’s performance after each episode, or to stop training early if the agent has reached a certain level of performance.

The BaseCallback class defines a number of methods that can be overridden in a subclass to implement specific behavior. Some of the most commonly used methods include:

on_training_start(self, locals, globals): Called at the start of training.
on_rollout_start(self): Called at the start of each rollout.
on_step_end(self, step, logs): Called at the end of each training step.
on_episode_end(self, episode, logs): Called at the end of each episode.
on_training_end(self, logs): Called at the end of training.

By subclassing BaseCallback and implementing the desired behavior in these methods, we can create custom callbacks to monitor the agent's performance during training and perform actions as needed

class TrainAndLoggingCallback(BaseCallback):

    def __init__(self, check_freq, save_path, verbose=1):
        super(TrainAndLoggingCallback, self).__init__(verbose)
        self.check_freq = check_freq
        self.save_path = save_path

    def _init_callback(self):
        if self.save_path is not None:
            os.makedirs(self.save_path, exist_ok=True)

    def _on_step(self):
        if self.n_calls % self.check_freq == 0:
            model_path = os.path.join(self.save_path, 'best_model_{}'.format(self.n_calls))
            self.model.save(model_path)

        return True

The logging is optional

CHECKPOINT_DIR = './train/'
LOG_DIR = './logs/'

callback = TrainAndLoggingCallback(check_freq=10000, save_path=CHECKPOINT_DIR)


model = PPO('CnnPolicy', env, verbose=1, tensorboard_log=LOG_DIR, learning_rate=0.000001, 
            n_steps=512)

In reinforcement learning, a policy is a function that takes in an observation of the environment and returns an action to take. The Proximal Policy Optimization (PPO) algorithm is a popular method for training policies in RL. PPO can be used to learn both continuous and discrete policies.

In PPO, there are two main types of policies: the actor and the critic. The actor is responsible for selecting actions based on the current state of the environment, while the critic estimates the value of being in a certain state.

The PPO algorithm trains the actor and critic policies by iteratively collecting experiences from the environment and updating the policies based on those experiences. During each iteration, the actor policy is trained to maximize the expected reward, while the critic policy is trained to accurately estimate the value of states.

In the Stable Baselines3 library, there are several pre-defined PPO policies that can be used for different types of RL problems. Some of the most commonly used PPO policies in Stable Baselines3 include:

MlpPolicy: A multi-layer perceptron (MLP) policy that uses fully connected layers to process observations and make decisions. This policy is well-suited for problems with continuous action spaces.
CnnPolicy: A convolutional neural network (CNN) policy that uses convolutional layers to process image-based observations. This policy is well-suited for problems with image-based observations and continuous action spaces.
MultiInputPolicy: A policy that can handle multiple types of input, such as images and other types of features. This policy is useful for problems with complex observations.
LstmPolicy: A policy that uses a long short-term memory (LSTM) network to process sequential observations. This policy is useful for problems with sequential observations, such as time-series data.

Each of these policies can be used as a starting point for custom policies by modifying the network architecture or hyperparameters to better fit the specific problem at hand.

model.learn(total_timesteps=100000, callback=callback)

fps : Frame Per Second
iterations : Number of times the process repeated.
time_elapsed : How long it been training for.
total_timesteps : How many frames our model goes through.fps : Frame Per Second
iterations : Number of times the process repeated.
time_elapsed : How long it been training for.
total_timesteps : How many frames our model goes through.

Test out the model

model = PPO.load('./train/best_model_1000000')


state = env.reset()
while True: 

    action, _ = model.predict(state)
    state, reward, done, info = env.step(action)
    env.render()

Conclusion:

In this blog, we explored how to build an AI Mario model using reinforcement learning with Python. By following the steps outlined in this blog, you can create a working AI model that can learn to play the game and make decisions based on the feedback it receives. The possibilities for AI in gaming are endless, and we are just beginning to scratch the surface of what is possible.

If you’re interested in learning more about building AI models with reinforcement learning, here are some additional resources to check out:

Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto — a comprehensive textbook on reinforcement learning that covers the theory and practical applications of the field.
Deep Reinforcement Learning with Python by Sudharsan Ravichandiran — a practical guide to building reinforcement learning models using Python, with a focus on deep learning techniques.
OpenAI Gym — a toolkit for developing and comparing reinforcement learning algorithms. It includes a variety of environments for testing and evaluating AI models, including classic Atari games like Space Invaders and Pac-Man.
TensorFlow — an open-source platform for building and training machine learning models, including reinforcement learning models.
The Berkeley Deep RL Course — a free online course on deep reinforcement learning, taught by professors and researchers from UC Berkeley.
Udacity Reinforcement Learning Nanodegree — a comprehensive online program that covers the theory and practice of reinforcement learning, including hands-on projects and mentorship from experts in the field.

By exploring these resources, you can deepen your understanding of reinforcement learning and explore the many ways it can be applied to solve complex problems in a variety of fields, including gaming.

Building an Intelligent Mario Bot using Reinforcement Learning and Python

Written by Emmanuel odor