In-depth look into PRL — the new reinforcement learning framework in Python
Written by Piotr Tempczyk
After an introduction (in my previous blog post) to the concept behind People’s Reinforcement Learning library I want to describe in details each part of the library and show how these parts work together. I am going to go through each submodule of the PRL.
Code fragments in gray under the names of following chapters are the submodule’s name of the discussed library fragment.
Environments
prl.environments
Environment
class is the base class for the PRL environments. It preserves OpenAI Gym’s API (documentation here) with some minor exceptions. Environment
wraps around OpenAI Gym environment inside and have placeholders for state, action and reward transformer objects. It also tracks the history of currently played episode. Example classes inheriting from Environment
are TimeShiftEnvironment
and FrameSkipEnvironment
.
Transformers can use the history of an episode to transform current observation from gym environment to state representation required by agent. Transformer objects can be implemented as stateless functions or stateful objects depending on what is more convenient for the user. States, actions and rewards stored in the episode history are in the raw form (before any transformations).
Only environments with discrete action spaces and observations being numpy.ndarray
are supported in current version of PRL.
State transformers
prl.transformers.state_transformers
State transformers are used by the environment to change observations to state representation for function approximators used by the agent. To build your own transformer all you need is to create a class inheriting from prl.core.StateTransformer
class and implement transform
and reset
methods. The transform
method takes two arguments: state in a form of numpy.ndarray
and episode history. It returns a state represented also by numpy.ndarray
. The reset
method resets transformer state (if any) between episodes. You can make your transformation fitable to the data by implementing fit
method in the same manner as in the scikit-learn library.
It is important to use only NumPy functions while implementing transformers for the sake of performance. More advanced Python programmers can use Numba library to speed up their transformers even more. Very good and deep NumPy tutorial can be found here.
You can check the performance of your transformations by importing and using print
function on prl.utils.time_logger
object at the end of your program.
Reward transformers
prl.transformers.reward_transformers
If you want to make your own reward shaping transformer you need to inherit from prl.core.RewardTransformer
and perform analogous steps as in the state transformer case.
Action transformers
prl.transformers.action_transformers
While implementing the action transformer you need to inherit your class from prl.core.ActionTransformer
, implement reset
, transform
and fit
method. After that you have to assign a gym.Space
object to the action_space
attribute, because it cannot be automatically inferred only from the class implementation.
Storage
prl.storage
Classes for storage are created for easy management of the training history.
History
prl.storage.History
History
class is used to keep the episodes’ history. You can get actions, states, rewards and done flag from it. It also gives user methods to prepare array with returns, count total rewards or sample a batch for neural network training. You can concatenate two history objects by using inplace add operator +=
.
Because appending to an numpy.ndarray
(used to store data) is a very expensive operation, the History
object allocates in advance some bigger arrays, and doubles its size when arrays are full. You can set the initial length of a History
object during initialization (e.g. Environment
does it based on expected_episode_length
parameter).
Memory
prl.storage.Memory
Class similar to History
created to be used as a replay buffer. It does not have to keep complete episodes, so its API has less methods than History
. Its length is constant and it is set during object initialization. You can't concatenate two Memory
objects or calculate total rewards. This object is used by the DQN agent as a replay buffer for the experience replay.
Function approximators
prl.function_approximators
Function approximators (FA) are created to deliver unified API for any kind of function approximators used by RL algorithms. FA have two methods: train
and predict
and are implemented in PyTorch for now.
PytorchFA
prl.function_approximators.PytorchFA
PyTorch implementation of function approximator. It needs three arguments to initialize: PytorchNet
object, loss and optimizer. Loss and optimizers can be imported directly from PyTorch. PytorchNet
class is similar to torch.nn.Module
but with additional method predict
.
PytorchNet
prl.function_approximators.pytorch_nn
Some neural networks and losses implementations used for RL problems are kept in this module. These are the standard torch.nn.Modules
and you can learn more about them in this great tutorial.
Callbacks
prl.callbacks
You can pass some callbacks to the agent’s train
method to control and supervise the training. Some of the implemented callbacks are: TensorboardLogger
(more about this logger can be found here) to log training statistics to tensorboardX, EarlyStopping
, PyTorchAgentCheckpoint
, TrainingLogger
, ValidationLogger
.
Loggers and profiling
prl.utils
User and agents have access to five loggers. Most of them are used automatically by agents, environments, transformers or function approximators. These loggers are:
time_logger
- this logger is used to monitor execution time of many functions and methods. You can print this object to generate report of execution times. If you want to profile your function you can decorate your function withprl.utils.timeit
decorator. From now on, the execution time of this function will be logged.memory_logger
- logger is used to monitor RAM usage (currently unused).agent_logger
- this logger is used to monitor agent training statistics.nn_logger
- in this logger all the statistics from neural network training are stored. It is important to pass some distinctid
argument to each network during initialization when training agent with many networks. This id will be used as a key in the logger.misc_logger
- logger for the user statistics. They are captured by theTensorboardLogger
and plotted in the browser. You can log only numbers (ints or floats) with a string key usingadd
method.
Agents
prl.agents
And finally the agents! Thanks to the above classes the agent implementations in PRL are simple and compact. While implementing the agent all you need to do is implement act
, train_iteration
and __init__
methods. train_iteration
is a base step in agent training (e.g. one step in environment for DQN or some number of complete episodes in REINFORCE agent). You can also implement pre_train_setup
and post_train_cleanup
methods if needed. They are called before and after main training loop.
act
method is called by the agent while making one step in the environment. Agent have also methods inherited from base Agent
class like: play_episodes
and play_steps
and test
which can be used within train_iteration
method. train
method should be used only to initialize training from outside the agent.
Example agent code looks like this:
As you can see it is very simple and self explanatory. We have implemented some most popular RL agents. This is the list of them:
- Random Agent
- Cross-Entropy Agent
- REINFORCE Agent
- Actor-Critic Agent
- A2C Agent with many advantage functions
- DQN Agent
Examples
There are many examples of use of every element of the library in the examples/
folder in the repository and we encourage you to look at them to get better understanding of the PRL framework. Let’s look at one more complicated example. We hope it will be self explanatory after this blog post.
Final remarks
If you encounter any problems with library, documentation, this tutorial or you want to contribute to the project please write an email to us at piotr.tempczyk [at] opium.sh. For now we have suspended the development of the library and moved our resources to new projects, but feel free to use this library or develop your framework using parts of our framework or join us and contribute to the library by yourself. If you use our code or ideas in your tools please cite our repository as:
Tempczyk, P., Sliwowski, M., Kozakowski, P., Smuda, P., Topolski, B., Nabrdalik, F., & Malisz, T. (2020). opium-sh/prl: First release of Peoples’s Reinforcement Learning (PRL). Zenodo. https://doi.org/10.5281/ZENODO.3662113
Contributors
There were many people involved in this project. This is the list of the most important of them:
Project Lead: Piotr Tempczyk
Developers: Piotr Tempczyk, Maciej Śliwowski, Piotr Kozakowski, Filip Nabrdalik, Piotr Smuda, Bartosz Topolski, Tomasz Malisz
If you enjoyed this post, please hit the clap button below and follow our publication for more interesting articles about ML & AI.