Reinforcement Learning 101: AI plays Pokémon!

Published in

Sopra Steria NL Data & AI

8 min readMay 27, 2024

Like many 90s kids I grew up playing Pokémon. I spent hundreds of hours on these games.
Now I'm a professional Data Scientist with an interest in Reinforcement Learning. What interests me about Reinforcement Learning is that it's one of the few Machine Learning techniques that stays fairly close to how a human would approach the problem. Instead of just throwing math at it.

In this blog we will go through the basics of Reinforcement Learning and how each aspect can be applied to the Pokémon games.

What is Reinforcement Learning?

Reinforcement Learning is a type of Machine Learning where an agent learns to make decisions by performing certain actions in an environment.
In short with the key terms in bold:

The agent uses a machine learning model that we aim to train.
The input for this model is the current state of the environment.
The output is the action it takes based on the input.
The action is chose gets executed in the environment resulting in a new state and a reward.
The agent uses the state-action-reward combination as training data and updates the model using a predefined policy.

All this is executed in a loop, each iteration feeding the new state to the updated agent, resulting in slowly improving actions.

Environment: Exploring the world

This is the Pokémon game itself, including the world, battles, menus and all interactions. The environment has a variety of tasks within the training loop.

- Managing the emulator running the game. This includes loading, saving and resetting the game as needed.
- Executing actions on that emulator.
- Producing the state in a format that is ready for the model to interpret.
- Calculating the reward for a given action. This is done by reading the memory of the emulator for important in-game values like levels, world exploration and any other values that indicate game progression.

Multiple of these environments can be run in parallel. This not only speeds up the training significantly, but also avoids ending up in a local minimum.
Initially the agent moves randomly. With a single environment there is decent chance that the agent does not find the next big reward before the iteration is over.
When running 24 environments in parallel, the chance they all get stuck is a lot smaller. I chose 24 environments because that's the maximum that could fit on my RAM.

The essentials of the environment are as follows:

from gymnasium import Env
from pyboy import PyBoy

class GoldGymEnv(Env, config):
  def __init__(self):
    # define the valid incoming actions
    self.valid_actions = [
            WindowEvent.PRESS_ARROW_DOWN,
            WindowEvent.PRESS_ARROW_LEFT,
            WindowEvent.PRESS_ARROW_RIGHT,
            WindowEvent.PRESS_ARROW_UP,
            WindowEvent.PRESS_BUTTON_A,
            WindowEvent.PRESS_BUTTON_B,
    ]
    self.action_space = spaces.Discrete(len(self.valid_actions))

    # define the shape of outgoing states
    self.observation_space = spaces.Box(low=0, high=255, shape=(36, 40, 3), dtype=np.uint8)

    # start an instance of the emulator
    self.pyboy = PyBoy(config['rom_path'])

    # save the screen for easy access to screenshots
    self.screen = self.pyboy.botsupport_manager().screen()
    self.max_steps = config['max_steps']

  def render(self):
    game_pixels_render = self.screen.screen_ndarray()
    game_pixels_render = (255 * resize(game_pixels_render, self.output_shape)).astype(np.uint8)
    return game_pixels_render
  
  def step(self, action):
    self.run_action_on_emulator(action)
    screenshot = self.render()

    new_reward, new_prog = self.update_reward()

    self.step_count += 1
    step_limit_reached = self.step_count >= self.max_steps

    return screenshot, new_reward, False, step_limit_reached, {}

Now creating an environment is as simple as:

import GoldGymEnv

env = GoldGymEnv(config)

Agent: The AI trainer

The agent is the player. The agent holds the model used to determine the action to take.
The model is initialized randomly, so it starts with no knowledge and learns by trial-and-error.
Early on the actions will be random, but with each iteration of learning it gets less and less random.

There are many different algorithms the agent can use to update the model. Some don't even require a model at all.
Explaining the difference between these could fill a blog on its own.

To choose one for this project we’ll compare the popular algorithms. These are Deep Q-Networks (DQN), Proximal Policy Optimisation (PPO) and Advantage Actor Critic (A2C).
I found that:

- DQN approximates a state-value function. This algorithm works best with discrete state spaces.
Tic-tac-toe has such a state-space: {empty, X or O} per cell. A screenshot as used in this project is not discreet.
DQN is also off-policy, meaning it does not necessarily need a policy. It just learns from all the states it has seen so far to optimize the reward. This does come with some overhead, which means more RAM usage and therefore fewer parallel agents.
- PPO can be used for a wide variety of tasks, both in discrete and continuous state spaces. However, it will require more samples to train to the same level as other algorithms.
- A2C trains multiple neural nets, the actor and the critic, which comes with overhead in the learning phase.

As DQN doesn't handle continuous state-spaces well, it was easily disqualified.
PPO vs. A2C was close, but tests showed that PPO learns a lot faster. PPO requires more iterations to get to the same point in the game, but A2C was slower overall, spending more time learning instead of gathering more data.

Creating an agent is as simple as

from stable_baselines3 import PPO

agent = PPO('CnnPolicy', env)

Here ‘CnnPolicy’ defines the model update policy. Let’s dive more into that.

Policy: Improving strategies

The model is updated using a policy. Which policy you can or should use depends mostly on the complexity of your environment and the way you're going to feed the state to the model.

For example: For a model learning to play tic-tac-toe, using a neural net would be overkill. The input is 3 by 3 and there are a very limited number of possible states to learn the optimal action for.
For Pokémon however, there are hundreds of variables plus the screen output you could use as input.
I chose to use just a screenshot as input for the model, making a policy that uses a Convolutional Neural Net the logical choice.

State: Snapshot of the adventure

A screenshot of the current state of the game. This is fed to the model to determine the best action to take.

A state can also just be a list of variables. When using RL for something like stock trading, the input would be the stock prices, history, relevant news, etc.
But for this problem a screenshot is easy to obtain from the emulator and has all the info the model should need.

Action: Decisive moves

The action the agent takes to influence the state of the environment in some way.
The action-space is the collection of actions the agent can do. In this case most of the buttons: ←↑→↓AB.
I chose to leave out the START and SELECT button, since you don't need them until halfway the game. They would therefor give no reward and the model would just take time to unlearn using them, wasting time.

Reward: The Heart of the Training Process

The reward is what it's all about. It is the only feedback the agent gets to update its model.
The Agent will train the model to maximize the reward, without knowing what it is made up of.

The reward can include anything you can get from the emulator. The most important parts of the reward were:
- Levels: The total level of your Pokémon. This makes sure that the agent keeps defeating wild Pokémon. It does run the risk of getting stuck in the first grass it sees, training its Pokémon to level 100.
- Damage done: Damage dealt to opponent Pokémon. This helped the agent understand how to defeat wild Pokémon at all. It will get a reward for using damaging moves, and no reward for using the moves it should avoid.
- Exploration: A small reward per unique square visited in the world. This motivates the agent to explore the world as much as possible
- Negative steps: A tiny negative reward per action executed. This ensures that the agent will keep trying to find a next reward as it punished actions that don't result in any reward.

For example in the following screenshot, there are only two actions that don’t result in a negative reward. The actions ←↓AB will keep you stuck in the corner. However ↑→ will get you out of there, towards unexplored squares.

Challenges along the way

One of the challenges with Pokémon is that battles look completely different to the over-world. The CNN needs to learn how to choose an action in both situations. In many experiments done the model learned to either navigate the world with ease, or battle perfectly.

Balancing these two scenarios is done by tweaking the reward function. Anything that you can read from the emulator memory can be used in the reward function.

The full reward function is as follows:

def get_reward(self):
    state_scores = {
        'events': self.get_event_reward() * 0.01,
        'level': self.get_levels_reward(),
        'items': self.get_items_reward(),
        'healing': self.get_healing_reward(),
        'opponent_lvl': self.get_max_opponent_level(),
        'dmg_done': self.get_damage_reward(),
        'dead': self.get_dead_count() * -1,
        'seen_count': self.get_seen_count(),
        'caught_count': self.get_caught_count(),
        'exploration': self.get_exploration_reward(),
        'maps': self.get_maps_explored(),
        'neg_steps': self.step_count * -0.001
    }

    return state_scores

Each of these functions retrieves the relevant data from the emulator memory using the hexadecimal location and calculates the reward. Memory location like these can be found here. For example the items reward is the following:

def get_items_reward(self):
    num_items = max(self.pyboy.get_memory_value(0xD5B7), 0)
    num_ball_items = max(self.pyboy.get_memory_value(0xD5FC), 0)
    num_key_items = max(self.pyboy.get_memory_value(0xD5E1), 0)
    return sum([num_items * 0.05, num_ball_items * 0.1, num_key_items * 2])

Another challenge is that the Pokémon games don't have just one goal. Of course there is the well-known 'Gotta catch 'em all', but that is a goal few people achieve.
Another goal is beating all the gyms, important fights that get progressively harder throughout the game.
To achieve any of these goals, there are smaller tasks to complete. Pick up your first Pokémon, make your way to the next town, win battles etc.
The reward function would be slightly different for each of these goals. Crafting the perfect reward function is the real challenge in Reinforcement Learning.

Conclusion: Gotta train 'em All

Combining Reinforcement Learning and Pokémon presents a fascinating mix of nostalgia and modern technology. By applying the core concepts of Reinforcement Learning, we can have it play Pokémon relatively easily.
It took quite some training, but the agent made it to the first town. With more time and training capacity I have no doubt it could finish the game.

What's next?

Compared to Atari, Pokémon is a relatively complex game. Steadily progressing in Pokémon is a big step, but there are way more complex games out there. I think it is only a matter of time before AI can play them all.