Intro to Applied Reinforcement Learning

Published in

back to the napkin

9 min readMar 26, 2020

Part of Dialexa’s: Back to the Napkin

While reinforcement learning (RL) is a hot topic in the data science community, there is a surprising lack of knowledge on how to run a successful reinforcement learning project. The machine learning team at Dialexa saw this gap as an opportunity to partner with Dallas AI to spread awareness in the DFW community. At Dialexa we’ve learned how to best implement and integrate RL through our experience designing and implementing RL-based solutions.

Reinforcement learning is like training a dog by giving it treats.

RL is a specialized and still developing form of machine learning. Instead of learning from examples like traditional supervised learning, RL strives to learn from acting in an environment and learning from feedback in the form of rewards. Sort of like training a dog with treats!

This article is the first in a series of posts on reinforcement learning. In this post, we present a brief history of RL and provide an example of a car learning to park utilizing RL. Our second article will discuss methods to frame problems with RL and how to apply RL to your business. Our entire presentation can be found here. The YouTube recording of the presentation is also linked at the end of this post.

History of RL

RL is founded on the well-known psychology concept of trial and error learning. Edward Thorndike was a key psychology researcher in this area and studied the effect of trial and error learning with animals — yes RL has its roots in animal psychology! It was Thorndike who coined the Law of Effect in 1898 and introduced the idea of reinforcement.

RL has leaped in popularity within the last decade due to the advances in deep learning (neural networks with several layers). There is even a movie on RL that documents the story of the surprising and incredible defeat of the world champion Go player, Lee Sedol. Sedol lost to a computer that used RL to master the game of Go. Sedol has since retired saying that “even if I become the number one, there is an entity that cannot be defeated” [1]. RL has been dubbed the gateway to general artificial intelligence thanks to feats like the triumph at Go.

But let’s take a step back. You may be wondering why RL is significant when computers were already beating the best chess players in the 1990’s. In 1997, IBM’s Deep Blue beat the world chess champion.

Deep Blue plays world champion chess player.

The success of Deep Blue was considered the most significant event in the overall AI field at the time, but now the AlphaGo defeat is considered even more monumental. While Go, frequently dubbed the hardest board game in the world, is a much more complex game than chess, the complexity is not the key differentiator between the events. Deep Blue used a human-defined heuristic function where concepts like piece value, opening theory, and king safety were preprogrammed. Deep Blue then used its massive computing power to iterate through different move combinations in order to choose and execute the best move. Deep Blue wasn’t truly artificial intelligence, it was merely executing expert-defined rules.

AlphaGo, on the other hand, didn’t require a human to tweak the heuristic rules — that task would have taken an infinite number of years! AlphaGo used reinforcement learning to teach itself to play Go. AlphaGo figured out the game on its own, and did so extremely well. You can now imagine why some see RL as the first step towards general artificial intelligence.

What is RL and where does it fit?

Now that you have a little context on the history and significance of reinforcement learning, let’s dive deeper into how RL algorithms actually work. RL is a pseudo supervised machine learning method. Think of something in between clustering (completely unsupervised — there are no labels and the algorithm learns the relationships in the data on its own) and classification (completely supervised — the algorithm is given data and labels). On the other side of the spectrum there are rules based engines, like Deep Blue, that don’t involve any learning.

The goal of the RL algorithm is to maximize future rewards. The algorithm doesn’t know what actions to take initially, but in the process of trying to maximize rewards, the algorithm learns what actions lead to rewards. This process can still be distilled to the simple concept of trial and error learning. When the RL algorithm does something good, it is rewarded. When the algorithm does something bad, it is penalized. In this way, the RL algorithm is guided in its learning process. It is important to note that the RL algorithm takes actions on its own and moves about the environment randomly as it learns to maximize rewards. A helpful analogy for the different types of machine learning is training a dog.

The different types of machine learning.

Rules engines are traditional computer science, similar to hard programming instructions into a robot dog. The robot dog is not truly learning any tricks; it is simply executing instructions. Deep Blue is a great example of a rules engine.

Supervised learning is like teaching a dog how to do something by example. Teaching by example is similar to providing an algorithm with labeled data to learn from.

Unsupervised learning is like a dog learning from other dogs. The dog has no direct supervision but it learns from other dogs, aka data points, around it.

Reinforcement learning is similar to teaching a dog how to behave by rewarding it with treats. The dog may not know how to behave initially, but it learns to associate certain actions with rewards in the form of treats.

The terminology and process of RL is quite different from other types of machine learning. Instead of features and labels, you have an agent and an environment. Sounds complicated? Below are the major key vocabulary terms for RL.

Key Vocabulary

Agent: Think of the agent as the algorithm. It takes actions in the environment.
Environment: This is where the agent exists, operates, and takes actions.
State: The situation the agent is in at a given moment in time. It changes based on the actions the agent takes.
Actions: The agent can take various actions to interact with the environment.
Rewards: Rewards are feedback that the agent receives from the environment after taking an action. Rewards can be positive or negative.

In summary, an agent takes an action that changes the state of the environment and obtains rewards from the environment. The agent then learns from the rewards it receives from the environment.

RL model variations

As with all machine learning algorithms, there are various forms of reinforcement learning algorithms. The two main versions of RL models are value-based models and policy-based models. They differ in how the agent interprets and estimates rewards and each has its own unique strengths.

Value-Based models quantify the value of each state in the environment by estimating the potential future rewards from that state. They are well suited for environments that have multiple different states but the actions that the agent can take are simple. Think of Pong, Rubik’s cube, or Atari games.
Policy-Based models estimate the correct action in the environment using previous reward patterns. The agent studies past rewards to see which action it should take next. These models are well suited for environments that have limited states but very complex actions. Some examples include lane assist or control tasks.

There are additional combinations of policy and value based models, such as hybrid models, but we won’t go into that here. Just know that there are multiple ways for the agent to calculate its next action.

RL environment example

In order to solidify these concepts with some hands-on RL code, here is a fun and simple example. The code for this example can be found on ourGithub. The goal in this RL problem is for the car to learn how to park by reversing into the parking spot. This code is run in the OpenAI Gym Highway Environment. Thanks to resources like OpenAI, it is easier than ever to implement and experiment with your own RL environments.

To put the problem in the context of the five RL vocabulary terms:

Agent: The agent is the car. The car is interacting with its environment and trying to maximize its rewards.
Environment: The environment is the parking lot. This is a very simple environment that is completely known.
States: The state is made up of the location of the car and the location of the green marker. The location (the state) of the car changes as it receives feedback and rewards from its environment. The location (the state) of the green marker is initialized at random.
Actions: There are four actions that the the car (the agent) can take. The car can move forward, in reverse, to the left, and to the right in the parking lot.
Rewards: There are a few defined rewards for the agent: parking quickly, parking within the lines, and reverse parking. The data scientist defines what the rewards are for the agent.

In summary, the car can take various movement actions to navigate the parking lot. It will learn to park by learning what maximizes the three different potential rewards.

Unsupervised learning would likely fail drastically in this scenario as the car would have no way to determine what actions it is supposed to take. Conversely, the reward function in RL provides guidance to the car. A rules based approach could possibly work, but programming all the potential states and actions in the environment is difficult and time consuming.

As demonstrated by this scenario, RL is an optimization problem at its core and is the ideal solution for this example. With RL, the car will learn by trial and error that some actions, or some combination of actions, lead to rewards while others don’t. Watching the car learn to park is pretty entertaining. Here are clips of the car at various training stages. The longer the algorithm has to train, the more opportunity it has to navigate the environment and learn how to maximize its rewards.

Car attempting to park after 1 minute of training: the agent is struggling! The car tries different actions to learn reward patterns and the rules of the environment.

Car after 1 minute of training.

Car attempting to park after 5 minutes of training: the agent is gaining some understanding about the environment and rewards. The car now knows that it needs to navigate to the green marker as fast as possible. However, it has a tendency to arrive at the marker haphazardly.

Car after 5 minutes of training.

Car attempting to park after 1 hour of training: the agent is beginning to park pretty well! The car learns to park within the lines and to reverse park.

Car after 1 hour of training.

We hope you find this brief history of and tutorial on reinforcement learning useful! Our next post will cover applying RL to your business problems. You can find our full applied RL presentation here.