POET & Enhanced POET: Open-Ended Reinforcement Learning through Unbounded Invention of Learning Challenges and their Solutions

Jaroslav Urban
knowledge-engineering-seminar
10 min readMay 22, 2020

In this post, I’m going to outline some of the ideas that stand behind the somewhat recently released paper called POET (paired open-ended trailblazer) and also the very recently released Enhanced POET. I came across the first paper thanks to a wonderful ICML talk by Jeff Clune, Kenneth Stanley, and others in which they outline the use of population-based search in machine learning and I hereby encourage you to go check it out. I’d say that these 2 papers encapsulate most of the ideas from the 3-hour talk including optimization, open-endedness, novelty search, and much more, which is why I chose to present them.

Introduction

First, I’ll introduce a bit of intuition behind where the authors are coming from and what problems are they ultimately trying to solve. Then I’ll quickly outline some of their previous efforts in this area and some other work that precedes POET. Then I’ll describe the original POET paper and subsequently its improved version. These papers are more of a proof of concept rather than a push into the SOTA, so there won’t be as many technical details as in your usual paper.

Optimization deception

In scenarios where you have ambitious goals, trying to optimize towards a certain objective might be deceiving in itself. Let’s look at an example: Imagine that you’re an ancient scientist who just invented a basic abacus and you’re very happy about your invention because it helps you calculate more efficiently. If you now set to yourself the ambitious objective of inventing something much more computationally capable (like a modern computer), then you might put a tremendous amount of effort into engineering how to scale up the abacus or make it from more durable materials before completely failing to produce a machine with computer-like capabilities. What you should have been doing instead is researching lightning and subsequently discovering electricity and going on with building up logic gates and so on and build up your first computer in a couple of hundreds of years instead of trying to optimize the thing you already built.

These so-called stepping stones that you need to part-take in to create a computer aren’t obvious at all from the objective itself, which is why it’s called optimization deception.

Novelty search

The idea of stepping stones that might not resemble the final solution (like researching lighting to make a better abacus) coincides with a similar area of work from these authors, called novelty search.

Novelty search asks the question: If ambitious goals can’t be directly optimized for, what’s left to do? The authors argue that novelty should be encouraged, mainly because novel (and better) solutions usually stem from other novel solutions.

For example, when a robot is being taught to walk, it might first just learn to crawl in a circle or step on the spot. This behavior would be discouraged by direct optimization because the robot won’t get too far by doing this, but those behaviors might be the key stepping stones that need to be explored before learning how to walk properly. Novelty search will encourage them as long as they are novel, and when they stop being novel, the robot will have to move forward in order to continue ending up in novel positions.

A video from [5], where they optimize one bipedal walker to walk as far as possible and another one to end up in novel states. The novelty search walker learns to walk much farther than its counterpart.

Quality diversity & Goal switching

A direction that utilizes stepping stones more explicitly and is closely related to novelty search is called quality diversity (QD). In QD algorithms, instead of working with one agent and trying to produce novel behaviors through it, you keep a diverse set of niche problems and agents that do well on each of those problems. Then you periodically try to optimize these agents and see if they perform better at either their own niche or at other niches. If they perform better at someone else’s niche, then you transfer that agent to the other niche and optimize him for that objective instead. This process is referred to as goal-switching. This technique was first used in Innovation engines where the authors aimed at generating images that would be recognizable by Alexnet as one of the 1000 Imagenet classes. They kept a set of the best generated images for each class and then added small perturbations to them and checked if they performed well in any of the other classes. If they did, they started optimizing them for the other class instead. They found that 18% of the new best images at each step of the algorithm came not from refining the current champions but from evolving an image from a different class.

Open-Endedness

The established approach to improving SOTA is to pick or sometimes create a problem and then try to solve it. In doing so, algorithms get improved which then becomes beneficial for solving future tasks. Some of the previous and now iconic problems include solving chess & go, mastering image recognition with Imagenet, or beating Atari games.

An alternative approach that would perhaps be suggested by the open-endedness community would be to strive to create algorithms that not only solve problems but also create them. An open-ended algorithm would be an algorithm that is capable of generating endless amounts of new problems and their solutions. The inspiration and the obvious example is the natural evolution which is an open-ended process that is constantly generating new problems and at the same time optimizing to solve them.

Minimal Criterion Coevolution

Although novelty search and QD algorithms are a step towards open-endedness, they require manually designed behavioral characterization or an archive of past states to generate novel behaviors or a variety of diverse problems and solutions.

Minimal Criterion Coevolution (MCC) tries to rectify some of those problems and is a direct predecessor to POET. The idea is having a population of problems and solutions where each individual from each population satisfies a minimal criterion to the other population in order to be considered for future generations. In this paper, for example, they have a set of mazes and a set of neural networks that solve these mazes. The minimal criterion is that each new NN has to solve at least one maze and each maze has to be solved by at least one NN.

MCC evolved a population of primitive mazes and agents that solve them by maintaining a minimal criterion for reproduction. [3]

Original POET

Just as in MCC, POET maintains 2 coevolving populations. In the paper, they use a population of bipedal walkers and obstacle courses, but the goal is to be as general as possible so that POET is applicable to other problems as well. Each run begins with one environment-agent pair containing a trivial environment (just a flat surface) and a randomly initialized agent. After that, POET performs 3 different tasks in the main loop:

  1. At each step, optimize paired agents for their respective environments.
    This is actually something that the MCC paper didn’t do. They use Evolutionary Strategies for optimizing the agents, but in principle, any optimization algorithm would do and they show that in the Enhanced version.
  2. Each N steps, mutate eligible environments.
    Eligible environments are environments where their paired agent already crossed some score threshold. If an environment is eligible to reproduce, it is first mutated several times to create children of that environment, it is then checked against a minimal criterion which are 2 score thresholds of the original paired agent (minimum and maximum) that ensure that the new environment isn’t too easy nor too hard. A few of the leftover environments that are most novel are then tested by agents that are paired with different environments and the best-performing ones are chosen to be the new paired agents.
  3. Each M steps, perform transfer attempts.
    In this step of the algorithm, all of the agents are tested in all of the environments and if any of them performs better than the original paired agent, they replace him.

Experiments

One of the goals of POET is to showcase that in a challenging environment, the path to a solution cannot be found by direct optimization and that it is rather found by a divergent search that discovers the correct stepping stones that need to be walked through to successfully solve these environments. As their first experiment, they restricted POET to generate environments with only one type of obstacle and picked out some of the solved environments from the later stages of the run. After that, they tried to directly optimize an agent on them with ES. For every type of obstacle, the ES agent converged to a local optimum and failed to solve the environment after 16000 steps.

Direct optimization fails to solve environments that are generated & solved by POET. 230 is the score that POET uses to determine whether an environment was solved. [1]

Their explanation is that POET is implicitly an automatic curriculum builder and by modifying active environments and accepting only those that are not too hard and not too easy, POET is effectively building many overlapping curricula for its agents.

As a reaction to this interpretation of the result, they decided to set up a direct curricula builder for the agents who are trying to solve these difficult environments by direct optimization. The curricula gradually increases the difficulty of the environment up to the original environment that POET generated and solved.

A curriculum that gradually increased the difficulty was not able to solve the environments that were solved by POET. [1]

Environments in this experiment were split into 3 different categories based on satisfying criteria on the minimum gap width, stump height, etc. Each red pentagon signifies an environment that POET created and solved and the 5 blue pentagons inside each plot signify the hardest environment that the direct optimization curriculum was able to solve. The curriculum starts at a flat environment and once the agent solves the environment (by the same criteria as in POET) a new environment gets generated that is a bit closer to the target environment. The direct curricula used the same amount of environmental change when generating new environments as in POET and also used the same optimization algorithm, but it was not able to solve most of the POET environments.

Enhanced POET

Changing the environmental encoding made new types of environments possible. [2]

To make POET more general, they split up the environmental encoding (EE) and environmental characterization (EC), both of which were previously represented by the parameters of the distributions over obstacle types.

For EE, they decided to use a compositional pattern producing network (CPPN), which is a neural network that takes in coordinates and produces a geometrical pattern.

For EC, which should determine the novelty of an environment, they created a metric called the performance of all transferred agents environmental characterization (PATA-EC). It’s based on the observation that novel environments should make novel distinctions on how the agents perform in them. When a new environment is created, all agents are tested in it and ordered by their performance. The novelty of the environment is then determined based on how different the ordering of the agents is compared to other environments.

Experiments

To measure the progress of open-ended systems and compare the enhanced POETs performance against the original POET, they also proposed a new metric called ANNECS, which stands for the accumulated number of environments created and solved. It (unsurprisingly) expresses how many new environments were created and solved by the algorithm. They don’t actually share any further details but I’m guessing they used PATA-AC for both POET and enhanced POET to make the comparison fair.

Thanks to the new environmental encoding, enhanced POET was able to generate more novel environments. [2]

Also, they again compared enhanced POET with direct optimization, this time using PPO.

Direct optimization with a curriculum couldn’t solve the environments by enhanced POET either. [2]

Conclusions

POET & the algorithms that proceed it bring a different point of view on optimization which is quite exciting. The 2 senior co-authors of most of these papers, Jeff Clune & Kenneth O. Stanley, both recently joined Open-AI to lead their new research groups on AI-generating algorithms (AI-GA) and open-endedness. This means that we can probably expect more papers like this with even bigger computational budgets.

References

[1] Rui Wang, Joel Lehman, Jeff Clune, Kenneth O. Stanley. Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions. arXiv preprint arXiv:1901.01753, 2019.

[2] Rui Wang, Joel Lehman, Aditya Rawal, Jiale Zhi, Yulun Li, Jeff Clune, Kenneth O. Stanley. Enhanced POET: Open-Ended Reinforcement Learning through Unbounded Invention of Learning Challenges and their Solutions. arXiv preprint arXiv:2003.08536, 2020.

[3] Jonathan C. Brant, Kenneth O. Stanley. Minimal Criterion Coevolution: A New Approach to Open-Ended Search. Proceedings of the 2017 Genetic and Evolutionary Computation Conference (GECCO), 2017.

[4] Anh Nguyen, Jason Yosinski, Jeff Clune. Understanding innovation engines: Automated creativity and improved stochastic optimization via deep learning. Evolutionary Computation, 24(3):545–572, 2016.

[5] Joel Lehman, Kenneth O. Stanley. Abandoning Objectives: Evolution through the Search for Novelty Alone. Evolutionary Computation journal, (19):2, pages 189–223, Cambridge, MA: MIT Press, 2011

--

--