An Introduction to Building Custom Reinforcement Learning Environments Using OpenAI Gym

10 min readAug 6, 2022

Introduction

Getting into reinforcement learning (RL), and making custom environments for your problems can be a daunting task. I know it was for me when I was getting started (and I am by no means an expert in RL). Most posts on this subject are either overly complicated with terminology or have examples so complex that it can be hard to understand. I wanted to give back to the community with what I have learned in the most simple terms possible, so hopefully this will be helpful to you!

This article will take you through the process of building a very simple custom environment from scratch using OpenAI Gym. If you want to skip all the background for RL and Gym and get right to the code, go to “The Game” section.

What You Need to Know

Need to Know:

Python Programming Basics

Nice to Know:

Reinforcement Learning Basics
Object Oriented Programming

What is OpenAI Gym and Why Use It?

OpenAI Gym is an open source Python module which allows developers, researchers and data scientists to build reinforcement learning (RL) environments using a pre-defined framework. The primary motivation for using Gym instead of just base Python or some other programming language is designed to interact with other RL Python modules. One such module is stablebaselines3, which allows you to quickly train RL models on these environments without having to write all the algorithms yourself.

Background for Reinforcement Learning

I won’t get too in the weeds here. This is meant to be an introduction not a super technical article!

There are two main parts of reinforcement learning (RL). The environment, and the agent. The environment is the space that the agent can interact with. In many cases, environments are just games such as Atari games like Asteroids, but they can be configured to represent a whole lot of other different problems too. The agent is what interacts with the environment. If you were interacting with Asteroids for example, you would be the agent playing the ship.

RL environments have a few defining characteristics; the state of the environment, the actions an agent can take, and the reward returned to the agent after any given state transition. The state represents all of the information within the environment at a given time. For asteroids this would be like taking a snapshot of the game. In that snapshot (the state) you have the position of the ship, the number of points the agent has earned, the number of lives, where all of the asteroids are, how fast the asteroids are moving, etc, etc.

When actions are taken by an agent, the environment state will transition to the next step and return a reward to the agent. In Asteroids, this would be something like shooting a laser. Once the “shoot laser” action is taken the next environment state will update all of the asteroid positions, the ship position, and most importantly shoot a laser from the ship. Not all actions or steps in the environment have to return rewards, but in Asteroids, a reward might be returned once an asteroid is hit. Notably, rewards can also be bad! If the ship is destroyed, a negative reward could be returned.

The rewards returned by an environment are all defined within the environment. These rewards will define what an RL agent tries to do, so it is important to think hard about what you want to reward an agent for. Having a positive reward for a ship exploding on an asteroid might cause the agent to blow itself up as fast as possible!

The Game

We’re going to implement a very simple game so that the focus remains on how to develop a reinforcement learning (RL) environment in Gym.

The game is as follows:

The agent is the blue square
The agent can move up, down, left, or right
The game is fully contained within a 6x6 grid
All of the colored square positions are randomized at the start of the game and cannot overlap
The agent cannot move outside of the grid, i.e. if it tries to move left two times from the above position, it will move left once, then not move on the second action.
If the agent gets to the green square, it wins the game
If the agent gets to the red square, it loses the game
The game will continue until the green or red squares are landed on

Code Representation of the Game

Since Gym requires all of the environment states to be represented numerically, we will represent the environment as seen above.

The agent is the number 1
The green square is the number 2
The red square is the number 3
The empty squares are the number 0

Gym also requires the environment state to be represented as a single row of values, not a grid of values. We can achieve this by taking the top row, adding the 2nd row to the end of it, adding the 3rd row to the end of those combined rows, etc. etc. In linear algebra, this is called flattening a matrix (the grid). The new state visually, now looks something like this:

The actions the agent can take must represented as number too. I have defined them as follows:

0 = Up
1 = Down
2 = Left
3 = Right

We will define a few rewards to be returned to the agent after it takes certain actions.

If the agent wins the game, it will be rewarded with 1 point
If the agent loses the game, it will be rewarded with -1 point
Every action that does not result in a win or a loss will give a reward of -0.01 points. This is to incentivize the agent to take the fewest actions to win the game.

Awesome! Let’s get to coding.

Coding the Environment

You will need to download and install Python 3.5+
After Python is installed, I usually install stablebaselines3 which will also install Gym.

All of the following code is available publicly on my github

Gym environments have 4 functions that need to be defined within the environment class:

__init__(self)
step(self, action)
reset(self)
render(self)

A good starting point for any custom environment would be to copy another existing environment like this one, or one from the OpenAI repo

Imports

# the Gym environment class
from gym import Env # predefined spaces from Gym
from gym import spaces # used to randomize starting positions
import random # used for integer datatypes
import numpy as np # used for clearing the display in jupyter notebooks from
# IPython.display 
import clear_output# used for clearing the display
import os

Global constants used in this environment for readability

#
# global constants
## game board values
NOTHING = 0
PLAYER = 1
WIN = 2
LOSE = 3# action values
UP = 0
DOWN = 1
LEFT = 2
RIGHT = 3

The environment class

Below is a basic shell of a Gym environment. BasicEnv is the name of our custom environment. In the BasicEnv class definition, it is passed Env as an argument. This is the abstract Gym class that our custom environment will implement. Inside of the class we have all of the functions mentioned above.

class BasicEnv(Env):
    __init__(self):
        pass
    
    step(self, action):
        pass    reset(self):
        pass    render(self):
        pass

__init__

The init function is the function which sets up all of the variables for the class and defines the action and state spaces. Every environment requires a few variables to be setup in order for agents to interact with it:

self.state
This is the state of the game which we described above (remember that flattened grid we talked about?)
self.observation_space
This is Gym’s way of describing what values are valid for a given position within the state (aka an observation). In our case, the observations within the state must be an agent, an empty square, a green square, or a red square. As numbers like we mentioned earlier, these are 0, 1, 2, or 3.
self.action_space
These are the valid actions that an agent can take. In our case: Up, Down, Left, or Right. Or as numbers: 0, 1, 2, or 3.

def __init__(self):
    # custom class variable used to display the reward earned
    self.cumulative_reward = 0    #
    # set the initial state to a flattened 6x6 grid with a randomly
    # placed entry, win, and player
    #
    self.state = [NOTHING] * 36    self.player_position = random.randrange(0, 36)
    self.win_position = random.randrange(0, 36)
    self.lose_position = random.randrange(0, 36)    # make sure the player, win, and lose points aren't 
    # overlapping each other
    while self.win_position == self.player_position:
        self.win_position = random.randrange(0, 36)    while self.lose_position == self.win_position or
        self.lose_position == self.player_position:
        self.lose_position = random.randrange(0, 36)    self.state[self.player_position] = PLAYER
    self.state[self.win_position] = WIN
    self.state[self.lose_position] = LOSE    # convert the python array into a numpy array 
    # (This is needed since Gym expects the state to be this way)
    self.state = np.array(self.state, dtype=np.int16)    # observation space (valid ranges for observations in the state)
    self.observation_space = spaces.Box(0, 3, [36,], dtype=np.int16)    # valid actions:
    #   0 = up
    #   1 = down
    #   2 = left
    #   3 = right
    # spaces.Discrete(4) is a shortcut for defining the actions 0-3
    self.action_space = spaces.Discrete(4)

step

The step function is the most involved function. This will define how an action will update the environment state as well as what rewards (if any) were earned by the agent during the action. For this game, this code will define what happens when the agent moves, whether the game is over, and how much reward was earned.

The step function takes an action as a parameter when it is called. The action must be within the action_space defined in the init function, or an exception will be thrown.

This function is required to return 4 values:

self.state
The updated state after the action is taken.
reward
The reward earned by the agent after taking an action.
done
True/False depending on whether the game has ended after the action was taken.
info
A Python dictionary which can be used to return information for debugging purposes.

def step(self, action):
    # placeholder for debugging information
    info = {}    # set default values for done, reward, and the player position
    #before taking the action
    done = False
    reward = -0.01
    previous_position = self.player_position    #
    # take the action by moving the player
    #
    # this section can be a bit confusing, but 
    # just trust that they move the agent and prevent 
    # it from moving off of the grid
    #
    if action == UP:
        if (self.player_position - 6) >= 0:
            self.player_position -= 6    elif action == DOWN:
        if (self.player_position + 6) < 36:
            self.player_position += 6    elif action == LEFT:
        if (self.player_position % 6) != 0:
            self.player_position -= 1    elif action == RIGHT:
        if (self.player_position % 6) != 5:
            self.player_position += 1    else:
        # check for invalid actions
        raise Exception("invalid action")    #
    # check for win/lose conditions and set reward
    #
    if self.state[self.player_position] == WIN:
        reward = 1.0
        self.cumulative_reward += reward
        done = True    
    
        # this section is for display purposes
        clear_screen()
        print(f'Cumulative Reward: {self.cumulative_reward}')
        print('YOU WIN!!!!')    elif self.state[self.player_position] == LOSE:
        reward = -1.0
        self.cumulative_reward += reward 
        done = True        # this section is for display purposes
        clear_screen()
        print(f'Cumulative Reward: {self.cumulative_reward}')
        print('YOU LOSE')    #
    # Update the environment state
    #
    if not done:
        # update the player position
        self.state[previous_position] = NOTHING
        self.state[self.player_position] = PLAYER    self.cumulative_reward += reward
    return self.state, reward, done, info

reset

This function will reset the environment to an initial state. Kinda like a reset button on a video game. The reset function is required to return self.state.

Most of the code in this section will look just like the init function, but we do not have to redefine the action space or the observation space.

def reset(self):
    self.cumulative_reward = 0
    #
    # set the initial state to a flattened 6x6 grid with a randomly 
    # placed entry, win, and player
    #
    self.state = [NOTHING] * 36    self.player_position = random.randrange(0, 36)
    self.win_position = random.randrange(0, 36)
    self.lose_position = random.randrange(0, 36)
    
    # make sure the entry and lose points aren't 
    # overlapping each other
    while self.win_position == self.player_position:
        self.win_position = random.randrange(0, 36)    while self.lose_position == self.win_position or 
        self.lose_position == self.player_position:
        self.lose_position = random.randrange(0, 36)    self.state[self.player_position] = PLAYER
    self.state[self.win_position] = WIN
    self.state[self.lose_position] = LOSE    # convert the python array into a numpy array 
    # (needed since Gym expects the state to be this way)
    self.state = np.array(self.state, dtype=np.int16)    return self.state

render

The render function defines how the game will be visualized. On Linux computers, you can create video-game like visualizations of the environment, but this is not required. For this tutorial, we will use a text-based visualization.

I created two helper functions to assist in the visualization.

# clears the screen of any output
def clear_screen():
    clear_output()
    os.system("cls")# prints out the environment state in a visually appealing way
def pretty_print(state_array, cumulative_reward):
   clear_screen()
   print(f'Cumulative Reward: {cumulative_reward}')
   print()
   for i in range(6):
       for j in range(6):
           print('{:4}'.format(state_array[i*6 + j]), end = "")
       print()

With these functions setup, the render function looks like this.

def render(self):
    # visualization can be added here
    pretty_print(self.state, self.cumulative_reward)

Running the Environment

Your custom environment can be run by initializing it, and taking actions against it. Here I made a separate python script which takes user inputs to interact with the environment.

# Import our custom environment code
from BasicEnvironment import *# create a new Basic Environment
env = BasicEnv()# visualize the current state of the environment
env.render()# ask for some user input for the action
action = int(input("Enter action:"))# take the action provided by the user.
#
# note that this function returns the environment state, a
# reward, whether the game is over, and some information
state, reward, done, info = env.step(action)# keep repeating those steps until the game is over
while not done:
    env.render()
    action = int(input("Enter action:"))
    state, reward, done, info = env.step(action)

Testing the Environment for StableBaselines3 Compatibility

You can test if the environment is compatible with the stablebaselines3 module with this very handy function check_env:

from BasicEnvironment import *
from stable_baselines3.common.env_checker import check_envenv = BasicEnv()
check_env(env)

Any errors in implementation will be caught by this function with some good details on what went wrong.

Conclusion

I hope this was helpful to you. This code probably isn’t optimized or perfect, but if you see any major errors, feel free to shoot me a message. Be sure to look at plenty of other examples of Gym environments, it will probably take more than just this tutorial to get a feel for how they work.

Good luck on your journey with reinforcement learning!

Additional Resources

Source code on my Github:
https://github.com/PaulSwenson2/ReinforcementLearningProjects

Things that were helpful to me when learning:

Gym documentation https://www.gymlibrary.ml/
Gym pre-built environments https://github.com/openai/gym/tree/master/gym/envs
stablebaselines3 documentation https://stable-baselines3.readthedocs.io/en/master/
sentdex’s tutorial on stablebaselines3 https://youtu.be/XbWhJdQgi7E