Deep Reinforcement Learning with SLM Lab

Wah Loon Keng
Sep 5, 2018 · 4 min read

SLM-Lab v2.0.0 was recently released. This version represents a major milestone, as it has grown to over a dozen algorithms while retaining its light light codebase due to modular component reuse. All the components are also distributable ala A3C-style. They can all run in both in OpenAI gym and Unity environments. See the full list here:

The most important change is the API, which now defaults to single-agent-single-environment case, while still enabling the multi-agent-multi-environment as a natural extension. This change vastly improves usability — a common request from users.

All the algorithms are preconfigured in spec files, which share a standard format and use the modular components implemented in the lab. Users can choose from multiple spec files: sarsa, dqn, ddqn, reinforce, a2c, ppo, sil, etc. and initialize them.

spec_file = 'ppo.json'
spec_name = 'ppo_mlp_shared_cartpole'
spec = spec_util.get(spec_file, spec_name)
info_space = InfoSpace()

The environment and agent instantiation has also vastly simplified. Pass the spec file and info space from above to the init methods, which will call the proper RL modules accordingly to construct the agent and its components as specified. Starting from below, the logics are clearly encapsulated within a Session class.

Next, a control loop which defines an episode will run the agent and env interactions. All the necessary updates, training, inference, etc. are properly structured within the lab’s API methods:

The top level run loop will then run these episodes for as many times necessary:

We will now give a quick tour of using Deep RL agents in the Lab.



To see the lab in action, let’s run a quick demo of a DQN agent to solve the CartPole environment.

  1. We will use a prepared spec file at slm_lab/spec/demo.json to specify the DQN agent parameters. The lab will read it to initialize an agent using many of its implementations. You can also change the value to experiment with DQN as well:
"dqn_cartpole": {
"agent": [{
"name": "DQN",
"algorithm": {
"name": "DQN",
"action_pdtype": "Argmax",
"action_policy": "epsilon_greedy",
"action_policy_update": "linear_decay",
"explore_var_start": 1.0,
"explore_var_end": 0.1,
"explore_anneal_epi": 20,
"gamma": 0.99,
"training_batch_epoch": 3,

2. To tell the lab to use this spec file to run an experiment, go toconfig/experiments.json. It looks like this:

"demo.json": {
"dqn_cartpole": "dev"

To run faster, change lab mode from “dev” to “train” above and rendering will be disabled.

3. Now, launch terminal in the repo directory, run the lab:

conda activate lab

You should then see the following. The DQN agent will start out poor at balancing the pole. But as the training goes on, it starts to learn, gets better, and balances the pole for longer before the environment resets.

4. This demo experiment runs a single trial with 4 repeated sessions for reproducibility. When the demo finishes, it will save some analyzed data and graphs of the experiment, along with the DQN model, located at data/dqn_cartpole_2018_06_16_214527 (timestamp will differ). You should see some healthy graphs:

Trial graph showing average envelope of repeated sessions.
Session graph showing total rewards, exploration variable and loss for the episodes.

5. Now, you can enjoy your trained model, with Enjoy mode! The graph above is trial 1 session 0, and you can see a pytorch model saved at data/dqn_cartpole_2018_06_16_214527/ Use the prepath (the string above without the model suffix) at config/experiments.json to tell the lab to run the enjoy mode using this model:

"demo.json": {
"dqn_cartpole": "enjoy@data/dqn_cartpole_2018_06_16_214527/dqn_cartpole_t1_s0"

enjoy mode will automatically disable learning and exploration. Graphs will still save.

Hyperparameter search

Finally, let’s do a full experiment with hyperparameter search for DQN. Change the lab mode from "train" to "search" in config/experiments.json ,

"demo.json": {
"dqn_cartpole": "search"

and run


The lab will now run experiments of multiple trials with hyperparameter search. For efficiency, environments will not be rendered, and everything will be ran at full speed.

When it ends, refer to {prepath}_experiment_graph.png and {prepath}_experiment_df.csv to find the best trials. The experiment graph will look like:

Experiment graph summarizing the trials in hyperparameter search.

This experiment graph tells us how sensitive is DQN to some hyperparameters. For example, a single-hidden-layer, 64-unit neural network works better than 2-layers; tanh activation function allows a faster learning for this problem; learning rate around 0.01 works better.

That’s all for the quick demo. To see more algorithms, RL components and usage details, check out the documentation page:

If you wish to use this for research, teaching, or industrial applications, feel free to reach out! We would be delighted to hear any suggestions and provide support.

We will soon conduct a full benchmark by running all algorithms against all environments, although this will translate to thousands of experiments (including parameter search) and require some serious compute power. If you are interested in helping, please let us know.

Wah Loon Keng

Written by

Deep Reinforcement Learning. Semantics. Rock Climbing.