You might have heard of Google DeepMind beating several Atari games, beating a professional Go player, or even teaching a humanoid simulation to walk. This has been achieved with the help of Reinforcement Learning.
Reinforcement Learning is a subcategory of Machine Learning where an agent learns to behave in an environment. Other popular subcategories of Machine Learning are Supervised Learning and Unsupervised Learning. Reinforcement Learning is different than supervised learning and unsupervised learning, in that the sequence of actions you take to achieve your goal is important to the problem at hand.
Supervised learning is done using ground truth. Thus, the goal of supervised learning is to learn a function or mapping between sample data and the ground truth outcomes. Popular uses of Supervised Learning include Regression and Classification.
Unsupervised learning, on the other hand, does not have labeled ground truth. Thus, the goal of unsupervised learning is to infer the inherent structure of the sample data without using explicitly provided ground truth labels. A popular use of Unsupervised Learning is Clustering.
Reinforcement Learning does not require preexisting ground truth mappings. It learns a behavior (a set of actions) by experiencing an environment through trial and error. This can be a real-world environment or a simulated world. The agent tries different actions in the environment until he starts learning the optimal actions to take. As such, the goal of reinforcement learning is to learn an optimal behavior to achieve a goal.
Examples of successful use cases for Reinforcement Learning include: Robotic Arm Manipulation, Google DeepMind beating a professional Alpha Go Player, Google DeepMind beating several Atari games, Google DeepMind training a humanoid to walk on its own, Unity teaching a puppy how to walk and play fetch, and the OpenAI team beating a professional DOTA player.
Agent and Environment
The first concept to try to understand when you are doing Reinforcement Learning is that a lot of it takes place as a conversation between an agent and an environment. You can imagine that you are an agent in a video game environment. You are the agent. The video game environment is the environment. The conversation is what is going back and forth between the agent and the environment.
An agent is merely an abstraction that can take actions, exist in a current state, transition between states, and receive rewards. You can think of an agent as a player of a game. On the other hand, an environment is just an abstraction that is represented via a set of states and rewards. The environment also has a goal that it wants accomplished. You can think of an environment as a board game or a sports game — really any activity with a goal.
The environment is going to reveal itself in the form of states. The agent can then influence the environment by taking actions. After taking an action, you (the agent) can get some kind of reward before moving onto the next state. This reward is based on the most recent state-action combination. This interaction between the agent and environment allows the agent to learn more and more about the environment by experiencing the states and rewards as consequences of its actions. Really, all of the computation is happening in the head of the agent. And the information about the environment is really only available through the course of this interaction between the agent and the environment. The environment is not fully known to the agent.
Remember to keep in mind that the agent does not know all of the environment. You (the agent) only experience the environment through the set of interactions with the environment. If you interact with the environment often enough, then you can build some kind of model of the environment in your head. But note that the model of the environment that you build, and the actual environment are not necessarily the same thing. In other words, the agent does not know the environment, the agent is just experiencing the environment by interacting with it. All of the computation is happening in the brain of the agent.
If all of the environment was already known, then the agent would not need to do any learning and instead the problem would just be to plan what to do based on the existing known environment. This brings us to the difference between planning and learning. Learning happens when you do not know the model/environment. Planning happens when you do know the model/environment. Planning might be able to happen in your own home or in your local streets, since you know them. Learning might happen in a garden maze or a new theme park that you have not been to before. You can only do planning when you know the environment and have a model of it.
Learning Through Experience
When we put an agent in an environment, this environment will have states that the agent can exist within and the environment will also have actions that the agent can take. This environment can be a game or the real world. Taking an action can transition you to another state. Taking an action can also result in a reward. However, as opposed to a regular game, you do not know the rules. You do not know how things work. You do not even know what you’re supposed to do. But what you can do is to start playing and use that experience to start discovering how things work.
If you choose an action, then the environment will tell you if you gain a reward and also if you move to another state. Some actions might give you a reward and some might not (depending on your state). Some actions might change your state and some might not (depending on your state). For example, “forward action” might move you forward; however, “forward action” might not move you forward if you are in front of a wall. Another example is that “pickup action” might pickup an object underneath you; however, if there is no object underneath you, then the “pickup action” might not do anything.
These rules of the environment are unknown to you when you begin the game. So, as an agent, you just have to try different actions and see what happens. Also, keep in mind that while “forward action” or “pickup action” might mean something to you (as a human with life experience), the agent has no idea what “forward action” or “pickup action” does and it might as well be called “Action 1” or “Action 2”. You can think of an agent as starting out as a baby with no life experience. Another example is that an agent does not initially know what a “wall” is, so it has to learn what happens when you bump into a wall.
As you can see, an agent has an even harder time learning what a state-action pair might do than a human. Eventually, the agent might understand how the environment works. This understanding might be correct, it might be incorrect, or more likely somewhere in between. All you know is that you want to maximize the reward that you’ve been collecting along the way.
For each action you take in a state, you receive a reward. You can think of rewards as positive rewards, negative rewards or neutral rewards. The higher the reward the better. If you win the game, you might receive an extremely high positive reward. If you fall into a pit, you might lose the game and receive an extremely negative reward. Along the way, you can receive smaller positive and negative rewards. For example, if you get closer to the goal you might get a small positive reward every time you get closer. On the other hand, if you lose health in your video game or you lose a pawn in your chess match, then you might receive a small negative reward. There might be actions that do not affect your goal at all, so you might receive a neutral reward of zero.
During your experience in the environment, you gain a sum of rewards which is called the return. The goal of interacting with the environment is to maximize the total return. So you might have one run of a video game where you lose the game if walk into a pit, this will likely lead a low total return. Another run of the same video game might try another action (such as a jump), which might lead to not falling into the pit and receiving a higher total return. It is by trial and error that the agent learns what are good actions and what are bad actions in certain states.
So far we have been considering a deterministic environment, where the same thing happens for any state-action pair. But what if different things happen sometimes for a given state-action pair? This is what we call a Stochastic Environment. Most real-life environments are stochastic. If you cross an intersection and there is incoming traffic, then you will not move ahead, and instead your car will collide with another car. So in a given state (e.g., an intersection), the same thing doesn’t always happen with one action. To account for this, we create a probabilistic model, which might say that there is oncoming traffic 20% of the time and there is no oncoming traffic 80% of the time. This part of the stochastic environment, is what you call a probabilistic transition function.
A good way to model the environment is through a Markov Decision Process (MDP). We go further into MDP’s in the next article.
With reinforcement learning algorithms, we are trying to learn behavior. We are trying to learn a way to interact with the environment that collects high reward. The behavior that you might be represented by a plan or a conditional plan, but reinforcement learning algorithms prefer to represent behavior as a universal plan (aka, stationary policy).
A plan is a fixed sequence of actions. Assuming that there is no traffic and all of the stop lights are green, then you have to do is follow a plan to get from work to home (e.g., “left”, “left”, “up”, “left”, “left”, “down”). However, these assumptions will not always hold in real life situations. There might be a traffic at one intersection, so it is better to have a conditional plan. A conditional plan allows you to pick different actions at a particular state. Most of the plan is fixed, but there might be a one or two conditional states that provide different actions based on a condition.
A universal plan, on the other hand, is similar to a conditional plan except that it has a condition for every single state. This is different than a plan or conditional plan, which only needs to worry about the states that go from point A to point B. A universal plan will have a conditional set of actions at every single state in your environment. If your environment is your city, and your goal is to go from your work to your home, then you will have a conditional set of actions at every single intersection in your city. You will know what to do at every single state in your environment, instead of just a list of states which achieve your goal.
Reinforcement Learning focuses on stationary policies because they allow us to express an optimal behavior. This is very powerful, but is also very large which makes it difficult for large state spaces. Just imagine a continuous state space, which is infinitely large. Deep Reinforcement Learning is one solution to this largeness problem. We will go further into Deep Reinforcement Learning in another article.
Exploitation vs Exploration
A general theme that comes up in Reinforcement Learning Algorithms is exploitation vs exploration. If your algorithm is always exploiting its knowledge of the model, then it might only find a local optimum, since it is disregarding other optimal points which could yield higher returns. This is where exploration comes in. If you want to discover if there are other local optimums which are better than what you have already found, then you need to go exploring. This doesn’t mean that you should explore all the time or that you should exploit all the time, but you should find a balance between the two.
Reinforcement Learning isn’t the buzz word that you have been hearing. There are plenty of other popular buzzwords when it comes to Machine Learning. But this is the technique behind some of the most interesting examples of Machine Learning that you’ve heard about. If you’d like an overview of the trendier Machine Learning concepts, then read my other article on Machine Learning Buzz.