In-depth look into PRL — the new reinforcement learning framework in Python

Written by Piotr Tempczyk

Piotr Tempczyk
Acta Schola Automata Polonica
6 min readFeb 7, 2020

--

People’s Reinforcement Learning logo

After an introduction (in my previous blog post) to the concept behind People’s Reinforcement Learning library I want to describe in details each part of the library and show how these parts work together. I am going to go through each submodule of the PRL.

Code fragments in gray under the names of following chapters are the submodule’s name of the discussed library fragment.

Environments

prl.environments

Environment class is the base class for the PRL environments. It preserves OpenAI Gyms API (documentation here) with some minor exceptions. Environment wraps around OpenAI Gym environment inside and have placeholders for state, action and reward transformer objects. It also tracks the history of currently played episode. Example classes inheriting from Environment are TimeShiftEnvironment and FrameSkipEnvironment.

Transformers can use the history of an episode to transform current observation from gym environment to state representation required by agent. Transformer objects can be implemented as stateless functions or stateful objects depending on what is more convenient for the user. States, actions and rewards stored in the episode history are in the raw form (before any transformations).

Only environments with discrete action spaces and observations being numpy.ndarray are supported in current version of PRL.

State transformers

prl.transformers.state_transformers

State transformers are used by the environment to change observations to state representation for function approximators used by the agent. To build your own transformer all you need is to create a class inheriting from prl.core.StateTransformer class and implement transform and reset methods. The transform method takes two arguments: state in a form of numpy.ndarray and episode history. It returns a state represented also by numpy.ndarray. The reset method resets transformer state (if any) between episodes. You can make your transformation fitable to the data by implementing fit method in the same manner as in the scikit-learn library.

It is important to use only NumPy functions while implementing transformers for the sake of performance. More advanced Python programmers can use Numba library to speed up their transformers even more. Very good and deep NumPy tutorial can be found here.

You can check the performance of your transformations by importing and using print function on prl.utils.time_logger object at the end of your program.

Reward transformers

prl.transformers.reward_transformers

If you want to make your own reward shaping transformer you need to inherit from prl.core.RewardTransformer and perform analogous steps as in the state transformer case.

Action transformers

prl.transformers.action_transformers

While implementing the action transformer you need to inherit your class from prl.core.ActionTransformer, implement reset, transform and fit method. After that you have to assign a gym.Space object to the action_space attribute, because it cannot be automatically inferred only from the class implementation.

Illustration by Daniel Mróz from Stanisław Lem’s “The Cyberiad”. Image not directly connected to the topic but the book is certainly worth reading.

Storage

prl.storage

Classes for storage are created for easy management of the training history.

History

prl.storage.History

History class is used to keep the episodes’ history. You can get actions, states, rewards and done flag from it. It also gives user methods to prepare array with returns, count total rewards or sample a batch for neural network training. You can concatenate two history objects by using inplace add operator +=.

Because appending to an numpy.ndarray (used to store data) is a very expensive operation, the History object allocates in advance some bigger arrays, and doubles its size when arrays are full. You can set the initial length of a History object during initialization (e.g. Environment does it based on expected_episode_length parameter).

Memory

prl.storage.Memory

Class similar to History created to be used as a replay buffer. It does not have to keep complete episodes, so its API has less methods than History. Its length is constant and it is set during object initialization. You can't concatenate two Memory objects or calculate total rewards. This object is used by the DQN agent as a replay buffer for the experience replay.

Function approximators

prl.function_approximators

Function approximators (FA) are created to deliver unified API for any kind of function approximators used by RL algorithms. FA have two methods: train and predict and are implemented in PyTorch for now.

PytorchFA

prl.function_approximators.PytorchFA

PyTorch implementation of function approximator. It needs three arguments to initialize: PytorchNet object, loss and optimizer. Loss and optimizers can be imported directly from PyTorch. PytorchNet class is similar to torch.nn.Module but with additional method predict.

PytorchNet

prl.function_approximators.pytorch_nn

Some neural networks and losses implementations used for RL problems are kept in this module. These are the standard torch.nn.Modules and you can learn more about them in this great tutorial.

Callbacks

prl.callbacks

You can pass some callbacks to the agent’s train method to control and supervise the training. Some of the implemented callbacks are: TensorboardLogger (more about this logger can be found here) to log training statistics to tensorboardX, EarlyStopping, PyTorchAgentCheckpoint, TrainingLogger, ValidationLogger.

Loggers and profiling

prl.utils

User and agents have access to five loggers. Most of them are used automatically by agents, environments, transformers or function approximators. These loggers are:

  • time_logger - this logger is used to monitor execution time of many functions and methods. You can print this object to generate report of execution times. If you want to profile your function you can decorate your function with prl.utils.timeit decorator. From now on, the execution time of this function will be logged.
  • memory_logger - logger is used to monitor RAM usage (currently unused).
  • agent_logger - this logger is used to monitor agent training statistics.
  • nn_logger - in this logger all the statistics from neural network training are stored. It is important to pass some distinct id argument to each network during initialization when training agent with many networks. This id will be used as a key in the logger.
  • misc_logger - logger for the user statistics. They are captured by the TensorboardLogger and plotted in the browser. You can log only numbers (ints or floats) with a string key using add method.

Agents

prl.agents

And finally the agents! Thanks to the above classes the agent implementations in PRL are simple and compact. While implementing the agent all you need to do is implement act, train_iteration and __init__ methods. train_iteration is a base step in agent training (e.g. one step in environment for DQN or some number of complete episodes in REINFORCE agent). You can also implement pre_train_setup and post_train_cleanup methods if needed. They are called before and after main training loop.

act method is called by the agent while making one step in the environment. Agent have also methods inherited from base Agent class like: play_episodes and play_steps and test which can be used within train_iteration method. train method should be used only to initialize training from outside the agent.

Example agent code looks like this:

As you can see it is very simple and self explanatory. We have implemented some most popular RL agents. This is the list of them:

  • Random Agent
  • Cross-Entropy Agent
  • REINFORCE Agent
  • Actor-Critic Agent
  • A2C Agent with many advantage functions
  • DQN Agent

Examples

There are many examples of use of every element of the library in the examples/ folder in the repository and we encourage you to look at them to get better understanding of the PRL framework. Let’s look at one more complicated example. We hope it will be self explanatory after this blog post.

cart_pole_a2c_gpu.py

Final remarks

If you encounter any problems with library, documentation, this tutorial or you want to contribute to the project please write an email to us at piotr.tempczyk [at] opium.sh. For now we have suspended the development of the library and moved our resources to new projects, but feel free to use this library or develop your framework using parts of our framework or join us and contribute to the library by yourself. If you use our code or ideas in your tools please cite our repository as:

Tempczyk, P., Sliwowski, M., Kozakowski, P., Smuda, P., Topolski, B., Nabrdalik, F., & Malisz, T. (2020). opium-sh/prl: First release of Peoples’s Reinforcement Learning (PRL). Zenodo. https://doi.org/10.5281/ZENODO.3662113

Contributors

There were many people involved in this project. This is the list of the most important of them:

Project Lead: Piotr Tempczyk

Developers: Piotr Tempczyk, Maciej Śliwowski, Piotr Kozakowski, Filip Nabrdalik, Piotr Smuda, Bartosz Topolski, Tomasz Malisz

If you enjoyed this post, please hit the clap button below and follow our publication for more interesting articles about ML & AI.

--

--