Intro to RLlib: Example Environments

Paco Xander Nathan
Distributed Computing with Ray
18 min readJul 9, 2020


RLlib is an open-source library in Python, based on Ray, which is used for reinforcement learning (RL). This article provides a hands-on introduction to RLlib and reinforcement learning by working step-by-step through sample code. The material in this article, which comes from Anyscale Academy, provides a complement to the RLlib documentation and is especially intended for those who have:

  • some Python programming experience
  • some familiarity with machine learning
  • no previous work in reinforcement learning
  • no previous hands-on experience with RLlib

Key takeaways: we will compare and contrast well-known RL examples running in RLlib to explore the essential concepts and terminology for reinforcement learning, and highlight typical coding patterns used in RLlib by examining end-to-end use cases in Python.

Note that the code is in Python, which you can copy/paste into a script and run. We also use a few Bash scripts that you must run separately from a command line. Those will be highlighted.

Also, check the Anyscale Academy for related tutorials, discussions, and events. In particular, you can learn much more about reinforcement learning (tools, use cases, latest research, etc.) at Ray Summit.


First question: what is reinforcement learning?

When people talk about machine learning, the discussion is typically about supervised learning. Train a model based on patterns that are associated with specific labels in data — generalizing from those patterns — then use the model to detect similar patterns within other data. Decisions get made based on predicted labels. Many use cases tend to be in predictive analytics, such as whether or not a credit card transaction should be investigated as potential fraud.

With reinforcement learning, one or more agents interact within an environment which may be either a simulation or a connection to real-world sensors and actuators.

At each step, the agent receives an observation (i.e., the state of the environment), takes an action, and receives a reward. Agents learn from repeated trials, and a sequence of those is called an episode — the sequence of actions from an initial observation up to either a “success” or “failure” causing the environment to reach its “done” state. The learning portion of an RL framework trains a policy about which actions (i.e., sequential decisions) cause agents to maximize their long-term, cumulative rewards. Many RL use cases involve control systems, where policies determine sequential decisions over time: competing in video games, managing a financial portfolio, robotics, self-driving cars, factory automation, and so on.

Next question: why use Ray and RLlib?

On the one hand, RLlib offers scalability. RL applications can be quite compute-intensive and often need to scale-out onto a cluster. Ray provides foundations for parallelism and scalability which are simple to use and allow Python programs to scale anywhere from a laptop to a large cluster. One the other hand, RLlib provides a unified API which can be leveraged across a wide range of applications.

While the core ideas of reinforcement learning have been used in industry for decades, many of those implementations were isolated. Through the RLlib developer community, many different algorithms and frameworks are becoming integrated within a common library. It becomes simpler to evaluate the performance and trade-offs of different alternative approaches. While RLlib includes integrations for the popular TensorFlow and PyTorch frameworks, most of its internals are agnostic about specific frameworks. The unified API helps support a broad range of use cases — whether for integrating RL support into a consumer application at scale, or conducting research with a large volume of offline data.


To get started with the coding examples, we’ll use pip from the command line to install three required libraries. Alternatively, use conda if that’s preferred.

The first line installs Ray and RLlib. The second line installs the Gym toolkit from OpenAI, which provides many different environments that illustrate well-known RL problems. Use of environments helps to standardize RL approaches and compare results more objectively. We’ll be working with four Gym environments in particular:

Each of these environments has been studied extensively, so there are available tutorials, papers, example solutions, and so on for further study. These four start with simple text-based problems, then progress to more complex problems in control theory. We chose these environments because they are simple to install and run on laptops (GPUs aren’t required). Moreover, each environment implements a render() method to visualize its actions and state.

BTW, the third installation is needed to use TensorBoard later to visualize metrics for how well the RL policy training is running.


We’ll start with the “Taxi-v3” environment, and for details about it check the Open AI site at

For more background about this RL problem in general see:

The agent in this problem is a taxi driver whose goal is to pick up passengers and drop them off at their desired destinations as fast as possible while navigating through available paths in a 5x5 grid. The __init__() method in the source code shows that this environment has a total of 500 possible states, numbered between 0 and 499. Rewards get pre-computed for each state to make the environment simulation simply an array lookup, where each state has been encoded based on a tuple:

(taxi_row, taxi_col, passenger_location, destination)

After training a policy with many iterations, we’ll save a checkpoint copy of the trained policy to a file. Then later we can use a rollout to run the taxi agent in an example use case. In other words, at a later point we can restore a trained policy from a checkpoint file, then use that policy to guide an agent through its environment.

Graph representation of the “Taxi” RL problem, [Dietterich 1999]

At each step in the rollout, the render() method prints a 2-D map of the taxi agent operating inside its environment: picking up a passenger, driving, turning, dropping off a passenger (“put-down”), and so on. Text symbols in the “Taxi-v3” map encode the environment’s observation space — in other words, the state used by the agent to decide which action to take. Understanding these maps requires some decoding of the text symbols:

  • R — r(ed) destination in the Northwest corner
  • G — g(reen) destination in the Northeast corner
  • Y — y(ellow) destination in the Southwest corner
  • B — b(lue) destination in the Southeast corner
  • : — cells where the taxi is allowed to drive
  • | — obstructions (“walls”) which the taxi must avoid
  • blue letter represents the current passenger’s location for pick-up
  • purple letter represents the drop-off destination
  • yellow rectangle — location of taxi/agent when empty
  • green rectangle — location of taxi/agent when full

For example, here’s an initial observation of the “Taxi-v3” environment:

That’s one possible starting point. The taxi is located in the first row, third column. A wall is on its west side. A passenger is waiting to on the Northeast corner and with a destination at the Northwest corner.

The action space of possible actions for the taxi agent is defined as:

  • move the taxi one square North
  • move the taxi one square South
  • move the taxi one square East
  • move the taxi one square West
  • pick-up the passenger
  • put-down the passenger

The rewards are structured as -1 for each action plus:

  • +20 points when the taxi performs a correct drop-off for the passenger
  • -10 points when the taxi attempts illegal pick-up/drop-off actions

Recall that the taxi agent is attempting to pick-up, navigate, and drop-off as fast as possible without making mistakes. In other words, we’re training policies to have episode lengths that are as short as possible, and cumulative rewards that are as large as possible. That’s why each action encodes a -1 penalty. Based on the initial observation shown above, it would take a minimum of 12 steps in an episode for the taxi agent to accomplish its goal, with a maximum value of 8 for its reward. Early episodes will probably include many mistakes, and tend to have greater lengths and lower rewards. Over time we expect to see the trained policy improve with higher average rewards and shorter average episodes.

With all of those definitions in mind, let’s jump into some code. First we’ll start Ray running in the background. Note: this kind of initialization only runs Ray on a laptop — we’d use a different approach to launch Ray on a cluster.

Running a shutdown followed by an init should get things started. There are other command line tools being developed to help automated this step, but this is the programmatic way to start in Python. Note that the acronym “PPO” means Proximal Policy Optimization, which is the method we’ll use in RLlib for reinforcement learning. That allows for minibatch updates to optimize the training process. For more details see the RLlib documentation about PPO, as well as the original paper “Proximal Policy Optimization Algorithms” by Schulman, et al., which describes the benefits of PPO as “a favorable balance between sample complexity, simplicity, and wall-time.”

After Ray launches on a laptop it will have a dashboard running on a local port. Run the following line in Python to show the Ray dashboard port:

The dashboard is helpful for understanding metrics, charts, and other features that describe the operation of Ray, and potentially for any troubleshooting or configuration changes:

Next we’ll configure a file location for checkpoints, in this case in a tmp/ppo/taxi subdirectory, deleting any previous files there. Run the following code in Python:

Now we’re ready to configure RLlib to train a policy using the “Taxi-v3” environment and a PPO optimizer:

Now let’s train a policy. The following code runs 30 iterations and that’s generally enough to begin to see improvements in the “Taxi-v3” problem:

Do the min/mean/max rewards increase after multiple iterations? Are the mean episode lengths decreasing? Those metrics will show whether a policy is improving with additional training:

Increase the value of N_ITER and rerun to see the effects of more training iterations.

We can use TensorBoard to visualize these training metrics. To launch it from the command line:

In this case the charts show two training runs with RLlib, which have similar performance metrics. Based on these charts, we likely could have iterated further to obtain a better policy.

Let’s inspect the trained policy and model, to see the results of training in detail:

The output should be close to:

Note how the InputLayer in the deep learning model has a shape with 500 inputs, encoded as one for each possible state. The final layer value_out has one output, which is the action the agent will take.

Next we’ll use the RLlib rollout script to restore from a checkpoint and evaluate the trained policy in a use case. If you were deploying a model into production — say, if there were a video game with a taxi running inside it — a policy rollout would need to run continuously, connected to the inputs and outputs of the use case. From a command line run:

Similar to the training command, we’re telling the rollout script to use one of the last checkpoints with the “Taxi-v3” environment and a PPO optimizer, then evaluate it through 2000 steps. The episodes will be visualized as 5x5 grids, such as:

Note: the Gym implementation of “Taxi-v3” does not show a state number with each state visualization. Also, one must wait until the end of an episode to see its cumulative reward.

That covers the “Taxi-v3” example. To run this code in a Jupyter notebook, see the Anyscale Academy repo at:


Similar to “Taxi-v3” environment, the “FrozenLake-v0” environment is another one of the “toy text” examples provided in OpenAI Gym, albeit perhaps somewhat less well-known:

In this environment, a “character” agent has been playing frisbee with friends at a park during winter, next to a frozen lake. A wild throw landed the frisbee in the midst of the lake. The surface of the lake is mostly frozen over, although there are a few holes where the ice has melted. Also, an international shortage in frisbee supplies means that the agent absolutely must retrieve the frisbee, but of course not fall through a hole in the ice while doing so. Unfortunately the ice is slippery, so the agent doesn’t always move in an intended direction. In other words, the agent must try to find a walkable path to a goal tile, amongst probabilistic hazards that make the RL problem even more challenging. Those hazards perform a function similar to the per-action penalties in “Taxi-v3” above — in other words, the probabilistic slipping on the ice makes longer episodes less successful, encouraging the agent to find efficient solutions.

The observation space in “FrozenLake-v0” is defined as a 4x4 grid:

  • S — starting point, safe
  • F — frozen surface, safe
  • H — hole, fall to your doom
  • G — goal, where the frisbee is located
  • orange rectangle shows where the agent is currently located

Note that the output for “FrozenLake-v0” is transposed compared with output from the “Taxi-v3” environment. In this observation, the agent has moved “right” from the starting point and is now located in the first row, second column. A hole is immediately “down” from the agent. The frisbee is located in the fourth row, fourth column. Good luck!

The action space is defined by four possible movements across the grid on the frozen lake:

  • move up
  • move down
  • move left
  • move right

The rewards are given the end of each episode (when the agent either reaches the goal or falls through a hole in the ice) and are structured as:

  • 1 if the agent reaches the goal (i.e., retrieves the frisbee)
  • 0 otherwise (i.e., utter doom)

In the observation given above, the agent could reach the goal with a maximum reward of 1 within 6 actions.

We can reuse much of the same code from the “Taxi-v3” example, and in this description we’ll skip over the redundant parts. First, set up the checkpoint subdirectory:

Similarly, we’ll configure to use the PPO optimizer again:

Then train a policy, this time using 10 iterations:

Output for the training metrics show steady, linear improvement through the 10 iterations:

For the sake of comparison, let’s examine the policy and model:

The output should be close to:

Note how the input layer has 256 inputs (~51% the input length of “Taxi-v3”) and how the resulting model has considerably fewer parameters (~36% the size of the “Taxi-v3” model).

Now let’s try a rollout from the command line:

Notice how the episode rewards in this example have less information: either 0 or 1 at the end of each episode. Even so, the training results should show that the episode_reward_mean metric increases steadily after the first few iterations, although not quite as dramatically as in the “Taxi-v3” training.

The point of this example is to illustrate how the “Taxi-v3” and “FrozenLake-v0” environments have much in common. The process of training either environment with RLlib then running the resulting policy in a rollout uses the same code with only a few parameters changed. This underscores the point about RLlib providing a unified API, as a generalized Python library for evaluating different kinds of reinforcement learning use cases and approaches for optimizing them.

To run this code in a Jupyter notebook, see the Anyscale Academy repo at:


Next we’ll run the “CartPole-v1” environment. This is one of the “classic control” examples in OpenAI Gym, and arguably one of the most well-known RL problems. One might even call this problem the “Hello World” of reinforcement learning.

This is the same Sutton and Barto who wrote Reinforcement Learning: An Introduction. Their original paper is highly recommended; while it’s published behind a membership paywall on the IEEE site, since this paper has become part of the canon for RL research you can find open access copies, such as

The problem at the heart of “CartPole-v1” was originally described in a much earlier paper about machine learning: “Boxes: an experiment in Adaptive Control” (1968) by D Michie and RA Chambers. The problem consists of balancing a pole that’s hinged to a cart which moves along a frictionless track. The agent can either push the cart to the left or to the right at each timestep. The amount of velocity resulting from these pushes depends on the angle of the pole at the time, since the amount of energy required to move the cart changes as the pole’s center of gravity changes. Interestingly, the original problem and proposed solution by Barto, et al., was used to explore using an early kind of neural network (related to Hebbian learning) which comes full circle now given that RLlib uses network networks (deep learning) to learn RL policies.

CartPole problem illustration from [Barto 1983]

The observation space, i.e., the state of the system, is defined by four variables:

  • xposition of the cart on the track [-4.8, 4.8]
  • 𝜃 — pole angle from vertical [-0.418 rad, 0.418 rad]
  • x— cart velocity
  • 𝜃’ — pole angular velocity

The pole (pendulum) starts upright, and the goal of the agent is to prevent the pole from falling over. That happens whenever the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center — either of which will end the episode.

The action space is defined by two possible movements:

  • push left
  • push right

A reward of +1 is given for every timestep that the pole remains upright.

Note that an episode of “CartPole-v1” can continue for a maximum of 500 timesteps, the problem is considered “solved” when the average reward is 475.0 or greater over 100 consecutive trials. There’s an earlier version of this environment called “CartPole-v0” and the only difference is that its max episode length and max reward threshold are lower.

Also note that this kind of control problem can be implemented in robots. Here’s a video of a real-life CartPole environment in action:

“Cart-Pole Swing-up” from PilcoLearner

To run the code, first we’ll set up the directories for logging results and saving checkpoint files:

Then we’ll configure to use the PPO optimizer again:

Next, let’s train a policy using 40 iterations:

Output from the training shows a jump in improvement after ~20 iterations:

We’ll use TensorBoard again to visualize the training metrics. To launch from the command line:

Corresponding closely to that point, note the abrupt knee in the curve for episode_reward_min (bottom/right chart) after about 90K timesteps where the agent begins performing much more reliably well.

Let’s compare the policy and model again:

The output should be close to:

Note how the model for “CartPole-v1” is not large: it has less parameters than the “FrozenLake-v0” model.

Now let’s try a rollout, from the command line:

One fun aspect of “CartPole-v1” is that its render() method creates an animation of the pole hinged on the moving cart:

Overall, the “CartPole-v1” problem has a more complex simulation and state (observation space) than the earlier environments. As shown in the video above, it’s also close to real-world problems in robotics. Even so its trained policy and model are smaller than with the simple “toy text” examples. After 50–100 training iterations, a policy can be trained on a laptop with RLlib to provide reasonably good solutions.

To run this code in a Jupyter notebook, see the Anyscale Academy repo at:


Next we’ll run the “MountainCar-v0” environment. Similar to “CartPole-v1” this is another one of the “classic control” examples in OpenAI Gym. Arguably it’s much more computationally expensive than the previous examples.

“MountainCar-v0” illustrates a classic RL problem where the agent — as a car driving on a road — must learn to climb a steep hill to reach a goal marked by a flag. Unfortunately, the car’s engine isn’t powerful enough to climb the hill without a head start. Therefore the agent must learn to drive the car up the hill just enough to roll back and gain momentum — rocking back and forth between the two sides of the valley below. The control is based purely on the agent choosing among three actions: accelerate to the left, accelerate to the right, or apply no acceleration. Timing is crucial.

This problem was contrived such that the observation space and the action space are both one-dimensional. Even so, it turns out to be a relatively complex problem. The simulation of “MountainCar-v0” involves non-linear relations among some variables, such as the speed of the car depending on the mountain gradient, altitude of the car, gravity, and the acceleration action. So it presents an interesting problem in control theory. Note that the paper builds upon the papers cited for the “CartPole-v1” environment. It’s also interesting to see calculations illustrated for machine learning approaches from 30 years ago, long before cloud computing and contemporary hardware were available.

TensorBoard’s great-great-grandparent, from [Moore 1990]

The observation space for the Gym implementation of this environment is defined by:

  • car position [-1.2, 0.6]
  • car velocity [-0.07, 0.07]

The action space is one of three possible actions:

  • accelerate to the left
  • don’t accelerate
  • accelerate to the right

The reward is structured as:

  • 0 if the car reaches the flag on top of the mountain (position = 0.5)
  • -1 if the car is somewhere down the mountain (position < 0.5)

Each episode starts with a car randomly positioned between [-0.6, -0.4] at 0 velocity. An episode terminates when the car reaches position ≥ 0.5 or the episode length is greater than 200 timesteps.

To run the code, first we’ll set up the directories for logging results and saving checkpoint files:

Then we’ll configure to use the PPO optimizer again; however, this time we’ll change some of the configuration parameters to attempt to adjust RLlib for more efficient training on a laptop:

Next, we’ll train a policy using 40 iterations:

Output from training will probably not show much improvement, and we’ll come back back to that point:

In this case, TensorBoard won’t tell us much other than flat lines. Let’s examine the policy and model:

The output should be close to:

Similar to “CartPole-v1”, the resulting model for “MountainCar-v0” is not large.

Now let’s try a rollout, from the command line:

This render() method also creates an animation, and an example is shown in:

A key takeaway here is that “MountainCar-v0” requires lots of iterations before training an effective policy. One alternative is to start from a previously trained checkpoint. One of those just happens to live in the Anyscale Academy repo:

To run this code in a Jupyter notebook, see the Anyscale Academy repo at:

That covers these four example Gym environments getting trained with RLlib. In this article we’ve shown how to:

  • install Ray, RLlib, and related libraries for reinforcement learning
  • configure an environment, train a policy, checkpoint results
  • examine a resulting policy and model
  • use the Ray and TensorBoard dashboards to monitor resource use and the training performance
  • rollout from a saved checkpoint to run a trained policy within a use case

Hopefully the compare/contrast of four different RL problems — plus the use of these Gym environments with RLlib and evaluations of their trained policies — helps illustrate coding patterns in Python using RLlib. Definitely read up on the primary sources listed above, to understand more about the history of how reinforcement learning has developed over time.

We’ll follow-up this article with posts that explore in more detail about some of the coding related to RLlib, such as how to build a custom environment:

If you have any questions, feedback, or suggestions, please let us know through our community Discourse or Slack. If you would like to know how RLlib is being used in industry, consider attending Ray Summit.

Kudos to for image processing with deep learning.