Reinforcement Learning, Meta Learning and Self Play

By Ilya Sutskever, Co-Founder and Research Director of OpenAI

Published in

BuzzRobot

8 min readMay 30, 2018

Where should we start? Let’s start with the reinforcement learning (RL) problem. Here is slightly higher-level introduction to the problem.

With the reinforcement learning framework, you know you have an agent in some environment and are aware that you need to find a policy for this agent to maximize your productivity. In the formulation, the agent, the observations and the rewards are provided by the environment. But, things are slightly different in the real world, with the agent determining its own rewards from the observation. The observations come in and you have a little neural network, or maybe even a big neural network, that performs some processing and produces an action.

Here is a more-simplified explanation of how most reinforcement learning algorithms work. You try something random and if the outcome exceeds your expectations, you try it again. This is followed by some mathematical permutations and that’s the core of how reinforcement learning algorithms work. Everything else is just ways to amplify and make better use of this randomness.

The Potential for Reinforcement Learning

While existing reinforcement algorithms can solve some problems, there aren’t equipped to solve many others. Ideally, an algorithm should combine all the spectrum of ideas for machine learning, which includes supervised learning, unsupervised learning, representation learning, reasoning and inference, test time and training of test time.

Synergizing all those ideas in the right way will result in a system capable of determining how the world works and achieves it goals, and that too rather quickly. But the algorithms we have today are still nowhere near at the level of what they can be in the future, and will be.

Hindsight Experience Replay

There are some ways in which reinforcement learning algorithms can be improved. As mentioned earlier, in reinforcement learning algorithms, we try something random and if it works and turns out better than expected, we try it again. But, what if does not work? What if try many random things produces no outcome or the outcome you expect? Perhaps, the answer is in finding a way to learn from failure.

You want to achieve something, but chances are that you will fail to do so unless you have specialized skills. But, rather than letting failure bring you down, why not use as a learning tool to achieve something else? You need some kind parameterization of goals to ensure that if you try to achieve something and fail, there’s always the opportunity to achieve a different goal.

So, you try something to achieve a goal but are unable to achieve it. But, you achieve something else, an outcome which you can treat as the outcome you intended all along. And, this means that you are now learning from experience.

Dynamics Randomization for Sim2Real

An area that has been subjected to a lot of research recently is robotics. The reason for this is that robotics provides us with all these cool problems, keeping us in touch with reality. The constraints imposed on you by robotics are not really constraints as they give you an idea of what you’d like your algorithm to achieve.

When it comes to dealing with robots, one thing that would be very useful is being able to train a simulated system and then taking its knowledge out from the simulated environment and into the real robot. Following is how this idea works:

You are aware that there will be some difference between the physics of the simulation and the physics of the real robot. You also know that identifying this difference would be difficult. One thing that you can do to help the policy is randomizing several dimensions of the simulation. You can randomize gravity, friction torques and many other things.

Now, when you implement the policy in the real world, it does not know what are these different dimensions that you’ve randomized. And, it needs to determine this on the go. It’s a recurrent neural network, and as it interacts with the environment, it gets an idea of how things are in the real world.

Now, if you train with these different randomizations, then the recurrent neural network can, in effect, try to perform system identification and make assumptions about all the unknown little coefficients that it’s already learned to identify just from the observations. So, you can see that it’s quite clearly a closed-loop. There’s a very simple way of doing Sim2Real and it might even scale to more difficult set ups.

Learning a Hierarchy of Actions with Meta Learning

Learning with a hierarchy of some kind is one of the things that would work well in reinforcement learning. However, it’s yet to be successful or truly successful. If you have distribution over tasks, then one thing that would work in your favor is training low-level controllers in a way which allows them to help you solve the matter of distributing tasks quickly.

Evolved Policy Gradients

Here is another situation for you to consider. Imagine getting an outcome where you are saying to yourself that it would be brilliant if we could evolve a cost function which makes it possible to solve RL problems quickly.

As is generally the case with these kind of situations, you have a distribution over tasks, and you evolve the cost function and its fitness as quickly as the cost function allows you to solve problems from the distribution of problems. The point is that once the cost function has been learned, learning is extremely fast. Additionally, there is a continual learning trial.

It’s constantly updating its parameters and it’s trying to achieve this green half sphere. It’s a little jittery, but after a while, it will succeed. The learned cost function allows for extremely rapid learning, but the learned cost function also has a lot of information about the distribution of tasks. In this case, this result is not magic because you need your training task distribution to be equal to test task distribution.

Self-Play

An extremely interesting and mysterious concept, self-play is not a new idea. In fact, it has existed since the 60’s. The first interesting outcome of this concept was seen in 1992 when Gerald Tesauro used a cluster of 386 computers to train a neural network using cue learning to play backgammon with self-play.

The neural network learned to defeat the world champion, and discovered strategies that backgammon experts weren’t aware of, and they acknowledged that those strategies were superior. Then came the AlphaGo Zero result, which defeated all humans in Go from pureself-play. There have also been our results on the Dota2 1v1 bot which was also trained with self-play. It was also able to beat the world champion in 1v1.

From the above, it is evident that self-play works but what is the most noticeable about self-play? Self-play can work in very simple environments. Yet, if you run self-play in a very simple environment, then you could potentially get behaviors with unbounded complexity.

Self-play gives you a way of converting compute into data, which is great because data is really hard to get but compute is easier to get. Another good thing about self-play is that it has a very natural curriculum because if you’re good, it’s always difficult.

You always win 50% of the time. If you and your opponent are equally good, it’s still difficult. It doesn’t matter how good you are or how bad you are, or how good the bot is or the system is, it’s always challenging at the right level of challenge.

So, that means that you have a very smooth path of going from agents that don’t do much to agents that potentially do a lot of things. Therefore, the idea of self-play is exciting, and appealing, and worth thinking about.

Recently, self-play was tried in a physical environment. The purpose of this was basically to learn a martial art. The cool thing about this is that no supervised learning was used. This is basically creativity. You simply say hey, here’s an environment. Can you please figure out what to do here? What was discovered can be regarded as legitimate martial art.

So, this one is interesting. This is transfer learning. You take one of the sumo wrestlers, and you take his friend away. Now, he’s standing in the ring, friendless. You apply big random forces on it. As you can see, it is able to do a decent job at maintaining its balance, and the reason for that is that it’s already used to someone pushing on it. That’s why it’s stable.

Imagine putting these robots in self play, and then they learn all the physical cues. Then, you can fine tune them to solve some useful task. This is something which hasn’t been done, the closing the loop of taking an agent outside of the self-play environment, and maybe fine tuning it on some task which we otherwise cannot solve at all.

The final word on self-play, and this is more speculative is that can such self-play system lead us all the way from where we are right now to AGI?

The high level is that a self-play environment gives you a perfect curriculum, it gives you an ability to convert compute into data. If you set it up just right, there is basically no bound to how far, or how complex you can be inside a self-play environment.

AI Alignment: Learning from Human Feedback

The question that we’re trying to address is simple. As we train progressively more powerful AI systems, it will be important to communicate to them goals of greater subtlety and intricacy. How can we do that?

Here, we investigate one approach, which is having humans judge the behavior of an algorithm and be efficient. Imagine a scenario where human judges provide one bit of information at a time where they tell the agent which behavior is more desirable. So, they do it for a while and after 500 such interactions, you see something amazing happen. It learns to back flip.

How does it happen? This is basically 10% of model-based RL. In a sense, this is model-based RL, but you’re modeling the reward but not the environment.

The way it really works is that the human judges provide feedback to the system. All those bits of feedback are being cached into a model of a reward using a triplet loss. You try to come up with a single reward function that respects all the human feedback that was given to it. Then, you run your RL on this cost function.

You can also communicate non-standard goals. In this game, this is a racing game, but we’ve asked the human judges to communicate the goal that the white car, the one that’s driving, should be behind the red car; it shouldn’t overtake the red one. That’s what it learned to do.

What’s going to happen in the future, most likely, is that as these systems get more powerful, they will hopefully be able solve the technical problem of communicating whatever goals are sent to them. It is the choice of the goals that will be hard, which is a political problem, that we will all, I guess, either enjoy facing, or simply loathe.

Reinforcement Learning, Meta Learning and Self Play

By Ilya Sutskever, Co-Founder and Research Director of OpenAI

Written by BuzzRobot