A brief introduction to Gymnasium
A reinforcement learning API standard with a wide range of reference environments
Keywords: reinforcement learning, environment, simulation
About Gymnasium
Gymnasium is a project that provides an API for all single-agent reinforcement learning settings. It includes implementations of typical environments such as Cart Pole, Pendulum, Mountain Car, Mujoco, Atari, and others.
The API has four core functions:
make
: Initializing environmentsstep
: Updates an environment with actions, returning the next agent observation and the reward for taking those actions.reset
: Resets the environment to an initial state.render
: Renders the environments to help visualise what the agent see, examples modes are “human”, “rgb_array”, and “ansi” for text.
Now, we will demonstrate how to perform RL using Gymnasium.
Code Implementation
For demonstration, we will use the “Frozen Lake” game. The game involves crossing a frozen lake from start to goal without falling into any holes by walking over the frozen lake. The player may not always move in the intended direction due to the slippery nature of the frozen lake.
Description
The game starts with the player at location [0,0] of the frozen lake grid world, with the goal located at a far-reaching extent of the world, e.g., [3,3] for the 4x4 environment.
Holes in the ice are distributed in set locations when using a pre-determined map or in random locations when a random map is generated.
The player makes moves until they reach the goal or fall into a hole.
The lake is slippery (unless disabled), so the player may move perpendicular to the intended direction sometimes (see is_slippery
).
Randomly generated worlds will always have a path to the goal.
Gymnasium
First, we start with installing and calling important libraries.
!pip install gymnasium pygame
import gymnasium as gym
First, an environment is created using make
with an additional keyword "render_mode"
that specifies how the environment should be visualised. See render
for details on the default meaning of different render modes. In this example, we use an FrozenLake-v1
environment where the agent controls a spaceship that needs to land safely.
env = gym.make('FrozenLake-v1',
render_mode='human')
observation, info = env.reset()
print(f"The environment's observation space: {env.observation_space}")
print(f"The environment's action space: {env.action_space}")
"""
The environment's observation space: Discrete(16)
The environment's action space: Discrete(4)
"""
After initializing the environment, we reset
the environment to get the first observation of the environment. For initializing the environment with a particular random seed or options (see environment documentation for possible values) use the seed
or options
parameters with reset
.
observation, info = env.reset()
episode = 0
actions,rewards = [], []
for _ in range(1000):
action = env.action_space.sample() # agent policy that uses the observation and info
observation, reward, terminated, truncated, info = env.step(action)
actions.append(action)
rewards.append(reward)
if terminated or truncated:
observation, info = env.reset()
episode += 1
env.close()
print(episode)
"""
113: indicates that for the loop of 1000 rounds, the game suffers 113 episodes.
"""
Next, the agent takes an action in the environment, step
which can be imagined as moving a robot or pressing a button on a game controller to produce a change in the environment. As a result, the agent receives a new observation from the modified environment, as well as a reward for taking action. This reward could be favorable if you destroy an enemy or negative if you move into lava. A timestep is a specific type of action-observation interaction.
However, the environment may stop after a certain number of timesteps, which is referred to as the terminal state. For example, if the robot crashes or the agent completes a task, the environment must stop because the agent cannot continue. If the environment in the gymnasium has terminated, it is returned step
. Similarly, we may wish the environment to terminate after a predetermined amount of timesteps; in this instance, the environment sends a truncated signal. If either terminated
or truncated
is True, call reset
to resume the environment.