PyIDF: Diversity of experiences in Reinforcement Learning

Kostya Kanishev
Imandra Inc.
Published in
11 min readSep 13, 2019

In Reinforcement Learning, an agent learns from interaction with its environment. The agent’s goal is to arrive at a behavior policy that maximizes the expected reward. Intuitively, an agent should learn from a diverse set of experiences, but how one can quantify and ensure this “diversity”? At Imandra, we’ve developed a novel technique for exploring algorithm state-spaces called Region Decomposition. Our previous post focused on the details of Region Decompositions. Here, we will describe our PyIDF framework for decomposing state transition algorithms written in Python and demonstrate, with a simple example, how to use the resulting regions to improve the convergence and stability of Reinforcement Learning.

The Intuition

When training a Machine Learning classifier, one must be careful in order not to have class-imbalanced training datasets. Let’s illustrate this with a classic MNIST learning example (you can follow it in this Google Colab notebook). We’ll train two neural network classifiers with the same layer architecture. The first classifier will be trained with the original class-balanced MNIST dataset (left plot below), while for the second classifier we’ll use a training dataset that has one class — the “9” — overrepresented (right plot):

At each training epoch, both classifiers are evaluated on the same balanced test set. As one can see, the test performance of the second classifier degrades significantly — the neural network gets overtrained on the overrepresented class and does not generalize well to other digits.

Turning to Reinforcement Learning (RL) one might ask a similar question: how can we make sure that the experiences that an agent undergoes are “diverse” enough? What does it mean for a set of experiences to be “balanced” in the context of RL? How can we quantify and measure “diversity” of experiences?

Iterative Decomposition Framework (IDF)

As discussed in the previous article, we’ve developed a novel technique in Imandra for exploring algorithm states-spaces we call Region Decomposition — a mathematically inspired way to subdivide the space of algorithm inputs into symbolically different regions based upon the unique behaviors of the algorithm.

To make the integration seamless, we’ve created a Python interface to Imandra for encoding RL models and decomposing them. Now, given a specially structured Python class that encodes some state transition logic, our system generates Python code that can be used to produce samples of logically different paths through the algorithm.

We believe that these regions provide a natural “diversity” measure for an RL agent’s experiences. By visiting different regions in the input space of the environment, the agent retrieves more representative information about the environment behavior. As a result, the agent will learn more accurate and better generalizing value functions and demonstrate better learning performance and more meaningful exploration behavior.

To illustrate this, we’ll apply Region Decomposition to Deep Q Reinforcement Learning. The idea is to ensure a more “diverse” sampling of the states and actions by enforcing a region-uniform choice of actions at the exploration stage: during the exploration play-out, instead of choosing a completely random action, we are picking actions in a way that the symbolic regions are visited more uniformly. We expect this to result in a more representative sampling of the behavior of the algorithm, leading to a better convergence of the learning.

Setup and the first example

The PyIDF decomposition framework takes as an input a specially constructed state transition Python class. The framework then converts the incoming Python code to Imandra Modelling Language (IML) and then runs the Imandra Iterative Decomposition on the IML model. The resulting region descriptions are then used to generate Python code that represents the constraints for each region.

We’ve created an imandra Python module that simplifies the interaction with the PyIDF API. You can install it from PyPI.org via pip install imandra. To illustrate how to use the module, let’s create a simple “Hello, World” example of a state transition class.

We’ll make a “counter” state machine: the state holds a counter, which gets increased by incoming Add actions; if the counter exceeds 100, we reset it back to zero. The Add actions carry the amount x which we add to the counter, we also want to constrain the values of x to always be positive. The Python code for the State class of such a state machine might look like (you can follow this “Hello World” example via the Google Collab notebook):

Together with the State class we’ve created a decomposition scenario — a list with the sequence of actions one wants for the decomposition. Here we’ve asked to see all possible ways our model behaves with 3 incoming Add actions.

The program above is valid Python (note that it uses type hints introduced in Python 3.5). There are a few restrictions that one must follow to create a valid State class decomposable by PyIDF. Here are some of the restrictions we’ve followed in our example:

  • The State class must have an __init__ method that contains only declarations of state variables, their types and their initial values.
  • Actions are defined with receive_ActionName methods, action validators can be defined with the corresponding validate_ActionName methods.
  • The arguments of the action-related methods must have type information supplied via Python type hints. Type hints are also required in all the assignment statements inside the methods.

There are more such restrictions on the allowed Python to be a valid PyIDF model. For example, you can define and call Python functions, but only if they are type-safe and have no side effects. Some user-defined enums and NamedTuple types are allowed. Also, thedefaultdict dictionaries are supported. Check our PyIDF documentation for more details on the rules for writing PyIDF models.

To facilitate the creation of such models we’ve created a PyIDF VS Code plugin, that checks your code in realtime. The plugin catches the violations of the structural requirements for the PyIDF models, identifying and highlighting all the problems via IDE warnings.

Let’s now run the Region Decomposition on our “Hello World” example. The code we’ve prepared above should be passed as a string to the imandra.idf.decompose function. The decompose function call starts a decomposition job in the cloud and returns the job handle. One can use the handle to query the job status:

When the job has its{"status": "done"} we are ready to retrieve the generated Python code. We can get the Python string with the code by calling decomposition.dumps(). In the generated code you will see the main entry into the computed region decomposition Python code — the calculate_regions function:

The calculate_regions function is made as a nested if-then-else decision tree. It takes in a list of actions and outputs a region “code” for the given inputs. Here is how one can construct and pass the actions into the function to retrieve the corresponding region:

One can see that some sequences of incoming actions are falling into the same region. It turns out that PyIDF classifies the inputs according to the number of times the counter was reset — this illustrates how the regions isolate distinct behaviors of our state-transition model. We are going to exploit this ability to carve logically-different regions in the model input space in Reinforcement Learning to ensure the aforementioned “diversity” of the agent experiences.

Infinite State Machines

The state-transition systems that we are dealing with can be formally called “Infinite State Machines” — a generalization of Finite State Machines with the number of possible states and transitions being not necessarily finite. In the context of Reinforcement Learning, we want to use such Infinite State Machines to model the RL environment. Such environment models should be able to receive agent actions and then transition to the new states and produce rewards accordingly.

The PyIDF accepts state machine models that receive actions that change its internal state. To model the rewards we are adding a reward state variable that tracks the cumulative (and, if necessary, discounted) reward.

We want to use the Region Decomposition of the environment during the playouts, meaning that we want to be able to obtain the region information starting from the current state of the game — not the “global” initial one. For this, we usually introduce a special SetState action — this action is always the first one in the decomposition scenario and carries the “initialization” information for the state machine, setting the state to the current state of the RL problem.

Helicopters Example

As an example Reinforcement Learning problem, we’ve chosen a simple “helicopter-islands” game, where an agent has to cross a map by visiting several islands. The simplest instance of such a game, has a 3x3 map and the agent is able to move on a neighboring cell vertically, horizontally or diagonally:

The agent must eventually arrive from the bottom-left corner to the top-right corner, always moving over the islands on a randomly generated map. The previously visited islands are removed from the map. We encode the rules of this game in the following PyIDF model:

Here we use some more advanced features of the PyIDF framework: we’ve declared a custom Status enum type and aPosition vector type, we’re using a dictionary to store the current map information. This dictionary is passed as an initialization parameter to SetState and updated in the process of the game.

Executing the Region Decomposition on the PyIDF model above, we’ve retrieved the calculate_regions function that breaks the input space into 34 regions. Here is a couple of inputs that belong to a region that has the code 0.0.0.0.0.0.0.1.0:

All the inputs in the region above describe the motion from the bottom-right corner of the map, jumping over the existing island and reaching the goal. Notice how the rest of the map is irrelevant for the inputs to be classified into that region. Let us take a look at some other regions:

It seems that the first region above describes first moving to an island, then to another island. And the second region describes first moving to an island and then moving to an empty square. This illustrates how different regions correspond to logically different experiences the agent can have. Naturally, we can use the distribution of experiences over the regions as a measure of their “diversity” — and improve this “diversity” by forcing the experiences to be more uniformly sampled from different regions.

We’ve trained a Deep Q-Learning agent with such a region-based exploration. Our agents approximate the Q value function with a Neural Network with 2 fully connected layers with 32 neurons and ReLU activation. The (s,a,r,s’) training samples are retrieved from the circular replay memory buffer of 1000 entries. The replay memory is filled by following a “region-based ε-greedy” policy: with probability ε, the agent takes an “exploration” action — this random action is sampled from a region-uniform distribution.

We train the agent by changing ε linearly from 0.95 to 0.05 over 50 steps. At every step in ε, the performance of the agent is evaluated on all 512 maps. We’ve run the training from a randomly initialized state 100 times using the standard randomized epsilon-greedy exploration (black points) and another 100 times using region-based exploration (red points).

The plot shows the median performance of the agents, evaluated on all possible 512 maps at every step in epsilon. Among all possible 512 maps, only 368 are solvable, so the performance saturates at that value. The error bars are showing the 68% (one sigma) confidence interval on the obtained value.

Notice how the region-based agents reach almost perfect performance at the end of the training. On the left, we show the full distribution over the agents at epsilon = 0.05. From that, we can get another metric that illustrates the advantage of region-based exploration: the number of agents reaching the perfect score of 368 maps. Region-based agents reach the perfect score 55 times out of 100, while the agents without the regions reach it only in 14 cases out of 100.

All this data clearly illustrates the performance increase when region-based exploration is used. The agent that explores by following the regions obtained via Region Decomposition shows faster learning, better final performance and the learning process is significantly more stable.

Both agents in our comparisons had the same values of hyperparameters: like learning rate, batch size or discount factor. We’ve studied how sensitive our results are to hyperparameter changes. Performing a grid search we’ve seen the improvement due to region-based exploration across all the hyperparameter values, meaning that the region-based improvement is, in a sense, “orthogonal” to the hyperparameter tuning.

What’s next

More complex examples. We’ve illustrated the advantages of region-based exploration on a very simple “helicopter” game with a 3-by-3 map. We are running more RL experiments with different and more complex problems — we will publish them soon. Our ultimate goal is the application of region decomposition RL to complex games like chess and logical puzzles like Sokoban. We’re also working on applications of Region Decomposition to more complex RL algorithms: like modern policy gradient algorithms and actor-critic approaches.

Efficient generation of sample points for regions. We’ve shown how to obtain the calculate_regions function using our PyIDF framework. This function returns the region given an input sequence of state transition actions. Yet, for our machine learning applications, we need an “inverse” sampling: given a (previously unvisited) region, we want to sample a random input sequence from it. Currently, we are performing such sampling via naive rejection Monte Carlo search in input space. We are working on more efficient sampling approaches.

Synthesis of the state-machine model. In all the above we’ve assumed that the model of the environment is perfectly known and even encoded it as a Python state-transition program. While, usually, in RL the behaviour of the environment is not known and must be explored by the agent. We are working on agents that are able to synthesize a formal model of the environment, that can then be decomposed and used for training.

If you have any questions or requests, don’t hesitate to drop us a line (or two) in our Discord server.

--

--