Anatomy of a custom environment for RLlib

Paco Nathan
Aug 27, 2020 · 18 min read

RLlib is an open-source library in Python, based on Ray, which is used for reinforcement learning (RL). This article presents a brief tutorial about how to build custom Gym environments to use with RLlib. You can use this as a starting point for representing your own use cases to solve with reinforcement learning (RL).

Note that this article is a follow-up to:

If you haven’t read that previous article already, check it out.

Source code for this article is available at https://github.com/DerwenAI/gym_example

The material here comes from Anyscale Academy and complements the RLlib documentation. This is intended for those who have:

  • some Python programming experience
  • some familiarity with machine learning
  • an introduction to reinforcement learning and RLLib (see previous article)

Key takeaways for this article include how to:

  • represent a problem to solve with RL
  • build custom Gym environments that work well with RLlib
  • structure a Git repo to support custom Gym environments
  • register a custom Gym environment
  • train a policy in RLlib using PPO
  • run a rollout of the trained policy

The backstory

One of the first things that many people run when they’re learning about RL is the CartPole example environment. OpenAI Gym has an implementation called CartPole-v1 which has an animated rendering that many RL tutorials feature:

That has become a kind of “Hello World” for reinforcement learning, and the CartPole visualization of how RL trains a policy for an agent is super helpful.

However, when it comes time to represent your own use case as an environment for RL, what should you use as a base? While there are many examples for Gym environments, the documentation is sparse. Moreover the those examples use class inheritance in ways that often make their source code difficult to follow. Plus, the requirements for structuring a Python library (as a Git repo) so that it can be registered as a custom environment — they are neither intuitive, simple to troubleshoot, nor especially well-documented.

More importantly, we need to discuss the how and why of building a Gym environment to represent problems that are well suited for work with reinforcement learning. Effective problem representation is often a conceptual hurdle for Python developers who are just starting to use RL. Checking through the source code of popular Gym environments, many involve complex simulations and physics that have much to do with that specific use case … however, they contribute very little to understanding how one could build other environments in general.

This article attempts to show a “minimum viable environment” for RLlib, one which illustrates and exercises all of the features needed to be an exemplary RL problem. Meanwhile the code is kept simple enough to generalize and adapt for other problems in general. Hopefully this will provide a starting point for representing your own use cases to train and solve using RLlib.

Represent a problem for reinforcement learning

Let’s work with a relatively simple example, one that illustrates most of the elements of effective problem representation for reinforcement learning. Consider an environment which is simply an array with index values ranging from min to max and with a goal position set at the middle of the array.

At the beginning of each episode, the agent starts at a random index other than the goal. The agent does not know the goal’s position and must explore the environment — moving one index at a time — to discover it before the end of an episode.

Think of this as a robotics equivalent of the popular children’s guessing game “Hot and Cold” where an agent must decide whether to step left or right, then afterwards gets told “You’re getting warmer” or “You’re getting colder” as feedback. In robotics — or let’s say within the field of control systems in general — there is often a problem of working through real-world sensors and actuators which have error rates or other potential distortions. Those systems must work with imperfect information as feedback. This kind of “Hot and Cold” game is what robots often must learn to “play” during calibration, much like children do.

In terms of using RL to train a policy with this environment, at each step the agent will:

  • observe its position in the array
  • decide whether to move left or right
  • receive a reward

We want the agent to reach the goal as efficiently as possible, so we need to structure the rewards carefully. Each intermediate step returns a negative reward, while reaching the goal returns a large positive reward. The reward for moving away from the goal needs to be more negative than moving closer to it. That way fewer steps in the correct direction result in larger cumulative rewards.

Let’s repeat a subtle point about structuring the rewards: a good intermediate step returns a less negative reward, while a bad intermediate step returns a more negative reward, and achieving the goal returns a large positive reward. That’s an important pattern to follow, otherwise the agent might get trapped in a kind of infinite loop and not learn effective policy.

We also need to limit the maximum number of steps in an episode, so that the agent is not merely correct in identifying the goal but also efficient. Too small of a limit will constrain the information that an agent obtains per episode. Too large of a limit will cause extra work with diminishing returns. To keep this simple, we’ll set the limit to the length of the array. Extra credit: if we call the “max steps” limit a hyperparameter, how might you determine an optimal setting for it? Looking ahead several tutorials, here’s a hint.

At this point, we’ve represented the problem in terms of a gradient for an agent to explore through some sequence of decisions. The agent only has partial information about the environment (observations and rewards) although that’s enough to explore and learn how to navigate effectively.

Define a custom environment in Gym

Source code for this custom environment is located on GitHub and the bulk of what the following code snippets explore is in the module at https://github.com/DerwenAI/gym_example/blob/master/gym-example/gym_example/envs/example_env.py

You can follow the full code listing as we describe each block of code in detail to show why and how each function is defined and used. First, we need to import the libraries needed for Gym:

import gym
from gym.utils import seeding

The import for seeding helps manage random seeds for the pseudorandom number generator used by this environment. We’ll revisit that point later in the tutorial. While it’s not required, this feature can become quite useful when troubleshooting problems.

Next, we need to define the Python class for our custom environment:

class Example_v0 (gym.Env):

Our custom environment is named Example_v0 defined as a subclass of gym.Env. Within this class we will define several constant values to describe the “array” in our environment. These aren’t required by Gym, but they help manage the array simulation…

Possible index values in the array range from a minimum (left-most position) to a maximum (right-most position). To keep this simple and easy to debug, let’s define the length of the array as 10 with a starting position 1 — in other words, use 1-based indexing. Let’s define constants for the bounds of the array:

LF_MIN = 1
RT_MAX = 10

At each step the agent can either move left or right, so let’s define constants to represent these actions:

MOVE_LF = 0
MOVE_RT = 1

To make the trained policies efficient, we need to place a limit on the maximum number of steps before an episode ends and structure the rewards:

MAX_STEPS = 10

REWARD_AWAY = -2
REWARD_STEP = -1
REWARD_GOAL = MAX_STEPS

A Gym environment class can define an optional metadata dictionary:

metadata = {
"render.modes": ["human"]
}

In this case, we’ll define one metadata parameter render.modes to be a list with "human" as it’s only element. That means the environment will support the simple “human” mode of rendering text to the terminal – not something more complex, such as image data to be converted into an animation.

Next, we need to define six methods for our environment class. Gym provides some documentation about these methods although arguably it’s not complete. Not all of these are required by Gym or RLlib, even so let’s discuss why and how to implement them — in case you need them for representing your use case.

The __init__() method

Let’s define the required __init__() method which initializes the class:

def __init__ (self):
self.action_space = gym.spaces.Discrete(2)
self.observation_space = gym.spaces.Discrete(self.RT_MAX + 1)
# possible positions to chose on `reset()`
self.goal = int((self.LF_MIN + self.RT_MAX - 1) / 2)
self.init_positions = list(range(self.LF_MIN, self.RT_MAX))
self.init_positions.remove(self.goal)
# change to guarantee the sequence of pseudorandom numbers
# (e.g., for debugging)
self.seed()
self.reset()

This function initializes two required members as Gym spaces:

  • self.action_space – the action space of possible actions taken by the agent
  • self.observation_space – the observation space for what info the agent receives after taking an action

The action space is based on gym.spaces.Discrete and is a finite array with two values: the constants MOVE_LF = 0 and MOVE_RT = 1 defined above. These determine how the agent communicates back to the environment.

The observation space is also based on gym.spaces.Discrete and is a finite array which is the length of the environment array, plus one. Recall that for simplicity we chose above to use a 1-based array.

The self.goal and self.init_positions members are specific to this environment and not required by Gym. These members place our goal in the middle of the array, so that it’s not randomly chosen. We decided to do that to help make this environment simpler for a reader to understand, troubleshoot, dissect, recombine parts, etc. While the agent’s initial position is randomized (anywhere in the array other than the goal) the goal stays in place. Later in this tutorial, when we train a policy with RLlib, we’ll show how this example converges quickly and the performance curves for the RL learning metrics have classic shapes — what you hope to see in an RL problem.

We call self.seed() to randomize the pseudorandom numbers (more about that later) and then call self.reset() to reset the environment to the start of an episode.

The reset() method

Now we’ll define the required reset() method. This resets the state of the environment for a new episode and also returns an initial observation:

def reset (self):
self.position = self.np_random.choice(self.init_positions)
self.count = 0
self.state = self.position
self.reward = 0
self.done = False
self.info = {}
return self.state

The self.position and self.count members are specific to this environment and not required by Gym. These keep track of the agent’s position in the array and how many steps have occurred in the current episode.

Gym environments typically use four other members to describe the outcome of a step: self.state, self.reward, self.done, self.info as a convention. For the returned value from a reset, we’ll provide self.state as the initial state, which is the agent’s randomized initial position.

The step() method

Now we’ll define the required step() method to handle how an agent takes an action during one step in an episode:

def step (self, action):
if self.done:
# should never reach this point
print("EPISODE DONE!!!")
elif self.count == self.MAX_STEPS:
self.done = True;
else:
assert self.action_space.contains(action)
self.count += 1
// insert simulation logic to handle an action ... try:
assert self.observation_space.contains(self.state)
except AssertionError:
print("INVALID STATE", self.state)
return [self.state, self.reward, self.done, self.info]

In other words, take an action as the only parameter for this step. If the self.done flag is set, then the episode already finished. While that should never happen, let’s trap it as an edge case. If the agent exceeds the maximum number of steps in this episode without reaching the goal, then we set the self.done flag and end the episode.

Otherwise, assert that the input action is valid, increment the count of steps, then run through the simulation logic to handle an action. At the end of the function we assert that the resulting state is valid, then return the expected list [self.state, self.reward, self.done, self.info] to complete this action.

The block of programming logic required for handling an action is a matter of updating the state of the environment and determining a reward. Let’s review this logic in terms of the two possible actions. When the action is to “move left”, then the resulting state and reward depend on the position of the agent compared with the position of the goal:

if action == self.MOVE_LF:
if self.position == self.LF_MIN:
# invalid
self.reward = self.REWARD_AWAY
else:
self.position -= 1
if self.position == self.goal:
# on goal now
self.reward = self.REWARD_GOAL
self.done = 1
elif self.position < self.goal:
# moving away from goal
self.reward = self.REWARD_AWAY
else:
# moving toward goal
self.reward = self.REWARD_STEP

In other words, the agent cannot move further left than self.LF_MIN and any attempt to do so is a wasted move. Otherwise, the agent moves one position to the left. If that move lands the agent on the goal, then the episode is done and the resulting reward is the maximum positive value. If not, then the agent receives a less negative reward for moving toward the goal, and a more negative reward for moving away from the goal.

The logic for handling the action to “move right” is written in a similar way:

elif action == self.MOVE_RT:
if self.position == self.RT_MAX:
# invalid
self.reward = self.REWARD_AWAY
else:
self.position += 1
if self.position == self.goal:
# on goal now
self.reward = self.REWARD_GOAL
self.done = 1
elif self.position > self.goal:
# moving away from goal
self.reward = self.REWARD_AWAY
else:
# moving toward goal
self.reward = self.REWARD_STEP

After handling that logic, then update the environment’s state and also define an optional self.info member, which is a Python dictionary that provides diagnostic information that can be useful for troubleshooting:

self.state = self.position
self.info["dist"] = self.goal - self.position

The contents of self.info can be anything that fits in a Python dictionary. In this case, let’s keep track of the distance between the agent and goal, to measure whether we’re getting closer. Note: this additional info cannot be used by RLlib during training.

The render() method

Next, we’ll define the optional render() method, to visualize the state of the environment:

def render (self, mode="human"):
s = "position: {:2d} reward: {:2d} info: {}"
print(s.format(self.state, self.reward, self.info))

This is especially helpful for troubleshooting and you can make it as simple or as complex as needed. In this case, we’re merely printing out text to describe the current state, the most recent reward, and the debugging info defined above.

The seed() method

Now we’ll define the optional seed() method to set a seed for this environment’s pseudorandom number generator:

def seed (self, seed=None):
self.np_random, seed = seeding.np_random(seed)
return [seed]

This function returns the list of one or more seeds used, where the first value in the list should be the “main” seed, i.e., the value to be passed in to reproduce a sequence of random numbers. For example, each reset in this environment initializes the agent’s position in the array. For debugging purposes you may want to insure that the sequence of initial positions for the agent stays the same while iterating through the episodes.

The close() method

For the optional close() method, we’ll define how to handle closing an environment:

def close (self):
pass

Gym environments automatically close during garbage collection or when a program exists. In this case we used pass as a no-op. Override this function to handle any special clean-up procedures that are required by your use case.

There, we did it! We’ve defined a custom environment. Even so, we cannot use it quite yet… First we need to add setup instructions for using its source code as a Python library.

Structure the Python library and Git repo

Now that we have an environment implemented, next we need to structure the subdirectories for how to import and register it in usage. This becomes a bit tricky, since multiple software libraries will make use of the environment. Python will consider it to be a library to be imported, and also as a class to be instantiated — these require specific naming conventions. Then RLlib will need to have the custom environment registered prior to training. Then Gym will need to construct the environment separately so that we can deserialize a trained policy for a rollout. RLlib and Gym have different means of referencing the environment than the Python import. Therefore we must structure the layout and naming of subdirectories rather carefully.

Given that the source code module example_env.py lives in a Git repo, here’s a subdirectory and file layout for the repo:

We will consider each component of this layout in turn. As mentioned above, pay attention to the differences between a dash - and an underscore _ in the subdirectory and file names, otherwise various software along the way will get rather persnickety about it.

Due to the way that Gym environments get installed and imported by Python, we need to define gym-example/setup.py to describe path names and library dependencies required for installation:

from setuptools import setupsetup(name="gym_example",
version="1.0.0",
install_requires=["gym"]
)

In other words, this environment implementation depends on the Gym library and its source code will be expected in the gym_example subdirectory.

Then we need to define the gym-example/gym_example/__init__.py script so that our custom environment can be registered before usage:

from gym.envs.registration import registerregister(
id="example-v0",
entry_point="gym_example.envs:Example_v0",
)

In other words, there will be a Python class Example_v0 defined within the envs subdirectory. When we register the environment prior to training in RLlib we’ll use example-v0 as its key. Going into th envs subdirectory, we need to define the script gym-example/gym_example/envs/__init__.py as:

from gym_example.envs.example_env import Example_v0

We also need to add our source code module example_env.py into the envs subdirectory.

Finally, we have a full path described to reach the source code for the custom environment that we defined above.

Measure a random-action baseline

Now that we have a library defined, let’s use it. Before jumping into the RLlib usage, first we’ll create a simple Python script that runs an agent taking random actions. Source code is available at https://github.com/DerwenAI/gym_example/blob/master/sample.py

This script serves two purposes: First, it creates a “test harness” to exercise our environment implementation simply and quickly, before we move to train a policy. In other words, we can validate the environment’s behaviors separately. Second, it measures a baseline for how well the agent performs, statistically, by taking random actions without the benefit of reinforcement learning.

First we need to import both Gym and our custom environment:

import gym
import gym_example

Now we’ll define a function run_one_episode() which resets the environment initially then runs through all the steps in one episode, returning the cumulative rewards:

def run_one_episode (env):
env.reset()
sum_reward = 0
for i in range(env.MAX_STEPS):
action = env.action_space.sample()
state, reward, done, info = env.step(action)
sum_reward += reward
if done:
break
return sum_reward

At each step, an action is sampled using env.action_space.sample() then used in the env.step(action) call. This is another good reason to use env.seed() for troubleshooting — to force that sequence of “random” actions to be the same each time through. BTW, you may want to sprinkle some debugging breakpoints or print() statements throughout this loop, to see how an episode runs in detail.

To use this function, first, we’ll create the custom environment and run it for just one episode:

env = gym.make("example-v0")
sum_reward = run_one_episode(env)

Note that we called gym.make("example-v0") with the key defined in the previous section, not the name of the Python class or the library path. Given that this code runs as expected, next let’s calculate a statistical baseline of rewards based on random actions:

history = []for _ in range(10000):
sum_reward = run_one_episode(env)
history.append(sum_reward)
avg_sum_reward = sum(history) / len(history)
print("\nbaseline cumulative reward: {:6.2}".format(avg_sum_reward))

This code block iterates through 10000 episodes to calculate a mean cumulative reward. In practice, the resulting value should be approximately -5.0 give or take a small fraction.

Train a policy with RLlib

At last, we’re ready to use our custom environment in RLlib. Let’s definite another Python script to train a policy with reinforcement learning. Source code for this script is at https://github.com/DerwenAI/gym_example/blob/master/train.py

Let’s take care of a few preparations before we start training. Initialize the directory in which to save checkpoints (i.e., serialize a policy to disk) as a subdirectory ./tmp/exa and also the directory in which to write the logs which Ray expects to be at ~/ray_results/ by default:

import os
import shutil
chkpt_root = "tmp/exa"
shutil.rmtree(chkpt_root, ignore_errors=True, onerror=None)
ray_results = "{}/ray_results/".format(os.getenv("HOME"))
shutil.rmtree(ray_results, ignore_errors=True, onerror=None)

We’ll start Ray running in local mode, i.e., not running on a remote cluster:

import rayray.init(ignore_reinit_error=True)

BTW, if you ever need to use a debugger to troubleshoot a custom environment, there’s a “local mode” for Ray that forces all tasks into a single process for simpler debugging. Just add another parameter local_mode=True in the ray.init() call.

Next we need to register our custom environment:

from ray.tune.registry import register_env
from gym_example.envs.example_env import Example_v0
select_env = "example-v0"
register_env(select_env, lambda config: Example_v0())

Note how we needed to use both the "example-v0" key and the Example_v0() Python class name, and that the Python import requires a full path to the source model.

Next we’ll configure the environment to use proximal policy optimization (PPO) and create an agent to train using RLlib:

import ray.rllib.agents.ppo as ppoconfig = ppo.DEFAULT_CONFIG.copy()
config["log_level"] = "WARN"
agent = ppo.PPOTrainer(config, env=select_env)

The preparations are all in place, and now we can train a policy using PPO. This loop run through 5 iterations. Given that this is a relatively simple environment, that should be enough to show much improvement in the policy by using RLlib:

status = "{:2d} reward {:6.2f}/{:6.2f}/{:6.2f} len {:4.2f} saved {}"
n_iter = 5
for n in range(n_iter):
result = agent.train()
chkpt_file = agent.save(chkpt_root)
print(status.format(
n + 1,
result["episode_reward_min"],
result["episode_reward_mean"],
result["episode_reward_max"],
result["episode_len_mean"],
chkpt_file
))

For each iteration, we call result = agent.train() to run the episodes, and then call chkpt_file = agent.save(chkpt_root) to save a checkpoint of the latest policy. Then we print metrics that show how well the learning has progressed. The resulting output should look close to the following:

 1 reward -21.00/ -6.96/ 10.00 len 7.83 saved tmp/exa/checkpoint_1/checkpoint-1
2 reward -20.00/ 1.24/ 10.00 len 5.51 saved tmp/exa/checkpoint_2/checkpoint-2
3 reward -20.00/ 5.89/ 10.00 len 3.90 saved tmp/exa/checkpoint_3/checkpoint-3
4 reward -17.00/ 7.19/ 10.00 len 3.30 saved tmp/exa/checkpoint_4/checkpoint-4
5 reward -17.00/ 7.83/ 10.00 len 2.92 saved tmp/exa/checkpoint_5/checkpoint-5

After the first iteration, the mean cumulative reward is -6.96 and the mean episode length is 7.83 … by the third iteration the mean cumulative reward has increased to 5.89 and the mean episode length has dropped to 3.90 … meanwhile, both metrics continue to improve in subsequent iterations.

If you run this loop longer, the training reaches a point of diminishing returns after about the ten iterations. Then you can run Tensorboard from the command line to visualize the RL training metrics from the log files:

tensorboard --logdir=$HOME/ray_results

Recall that our baseline measure for mean cumulative reward was -5.0, so the policy trained by RLlib has improved substantially over an agent taking actions at random. The curves in the Tensorboard visualizations above — such as episode_len_mean and episode_reward_mean — have classic shapes for what you generally hope to see in an RL problem.

Apply a trained policy in a rollout

Continuing within the same train.py script, let’s make use of the trained policy through what’s known as a rollout. First, some preparations: we need to restore the latest saved checkpoint for the policy, then create our environment and reset its state:

import gymagent.restore(chkpt_file)
env = gym.make(select_env)
state = env.reset()

Now let’s run the rollout through through 20 episodes, rendering the state of the environment at the end of each episode:

sum_reward = 0
n_step = 20
for step in range(n_step):
action = agent.compute_action(state)
state, reward, done, info = env.step(action)
sum_reward += reward
env.render()
if done == 1:
print("cumulative reward", sum_reward)
state = env.reset()
sum_reward = 0

The line action = agent.compute_action(state) represents most of the rollout magic here for using a policy instead of training one. The resulting output should look close to the following:

position:  5  reward: 10  info: {'dist': 0}
cumulative reward 10
position: 3 reward: -1 info: {'dist': 2}
position: 4 reward: -1 info: {'dist': 1}
position: 5 reward: 10 info: {'dist': 0}
cumulative reward 8
position: 7 reward: -1 info: {'dist': -2}
position: 6 reward: -1 info: {'dist': -1}
position: 5 reward: 10 info: {'dist': 0}
cumulative reward 8

Great, we have used RLlib to train a reasonably efficient policy for an agent in our Example_v0 custom environment. The rollout shows code for how this could be integrated and deployed in a use case.

Summary

The full source code in Python for this tutorial is in the GitHub repo https://github.com/DerwenAI/gym_example

Use Git to clone that repo, connect into its directory, then install this custom environment:

pip install -r requirements.txt
pip install -e gym-example

In summary, we’ve created a custom Gym environment to represent a problem to solve with RL. We showed a template for how to implement that environment, how to structure the subdirectories of its Git repo. We created a “test harness” script and analyzed the environment to get a baseline measure of the cumulative reward with an agent taking random actions. Then we used RLlib to train a policy using PPO, saved checkpoints, and evaluated the results by comparing with our random-action baseline metrics. Finally, we ran a rollout of the trained policy, showing how the resulting policy could be deployed in a use case.

Hopefully this will provide a starting point for representing your own use cases to train and solve using RLlib.

Also, check the Anyscale Academy for related tutorials, discussions, and events. In particular, you can learn much more about reinforcement learning (tools, use cases, latest research, etc.) at the Ray Summit conference which will be held online September 30 through October 1 (free!), with tutorials on September 29 (nominal fee).

Kudos to https://deepdreamgenerator.com/ for image processing with deep learning.

Distributed Computing with Ray

Ray is a fast and simple framework for distributed…