Introduction to Reinforcement Learning

Free RL Course: Part 1

Nathan Weatherly
The Startup
4 min readJul 26, 2020

--

Artificial Intelligence has been a hot topic throughout the last decade, and for good reason. It is what drives the most complicated things in our society, from self driving cars to classifying PET scans, AI has the potential to help in every field imaginable. For this reason, I have decided to publish this free series on Reinforcement Learning through sequential articles that build on each other. It will go into detail on different Reinforcement Learning algorithms and teach you how to implement them in Python. No prior knowledge of Reinforcement Learning is necessary, but some knowledge of Matrices, Calculus, and basic Python are recommended so that you can grasp certain concepts. I will be sharing all of my code and data used in this series on the Github. With that being said, lets get started!

Reinforcement Learning is an approach to train AI through the use of three main things:

  • An Environment
  • An Agent
  • A Reward Function

Each of these things are crucial to making sure that the AI can effectively learn to complete a task. For this article, I will be using the popular Atari game “Breakout” as an example to show certain concepts.

Atari Breakout within the OpenAI Gym library

The first part of Reinforcement Learning is the environment. The environment is the world within which the Reinforcement Learning is taking place. Most of the time the Reinforcement Learning environment is identical to the environment that you are training an AI to perform in. For us, the environment would simply be the game of Atari Breakout because the goal is to train an AI that can play Atari Breakout.

The next part of Reinforcement Learning is known as the agent. The agent is the entity that interacts with and (hopefully) learns to correctly interact with the environment. In most Reinforcement Learning algorithms, the agent is simply the AI that will be trained. This holds true for our example of Atari Breakout.

The last, but most vital, part of Reinforcement Learning is the reward function. The reward function is an algorithm that returns a reward based on some change in the environment. In Atari Breakout, we want our agent to simply achieve as high a score as possible. We could represent this by giving the agent a reward whenever its score goes up. Reward functions are often abstract and can be defined in many ways depending on what you want the agent to do and not do within an environment. The reward for some step t is represented with the notation Rₜ.

There are several other terms that you should be familiar with:

  • Step — The unit of “time” within an environment. The current step is represented using the variable t while the previous step would be t - 1 and next step, t +1. When using specific integers to represent step, it starts at step 0 and increments up from there. In Atari Breakout, a step will could be represented as one frame.
  • Action — A way in which the agent interacts with its environment. It is represented with the notation A. Where t is the step at which the action was taken. In Atari Breakout, an action would be some movement of the platform.
  • State — The state of an environment at some specific time. It is represented with the notation Sₜ. Where t is the step at which the environment is in this state. In Atari Breakout, the state would be the current display of the game.
  • Policy — The policy is some algorithm or function that takes in the current state and outputs an action. The Agent uses this policy in order to decide what action it will take during a certain step. The policy is represented with the notation 𝜋(Sₜ).

Based on these terms the current process of our Reinforcement Learning algorithm would look something like this:

Graphic from Richard Sutton’s “RLbook”
  1. At step t the agent is given the state, Sₜ
  2. The agent plugs the state, Sₜ, into its policy and arrives at a selected action, aₜ. This is computed by the equation 𝜋(Sₜ) = Aₜ.
  3. The agent takes the computed action, aₜ, within the environment.
  4. The environment’s state, Sₜ, is updated to Sₜ ₊ ₁, as a consequence of the action, Aₜ, and the reward, Rₜ ₊ ₁, is calculated using the user defined reward function.
  5. Repeat.

You can probably already figure out the problem with this approach. The agent isn’t learning. The policy stays the same for all of the steps, so the agent will be stuck with its initial policy forever. Because of this, the agent also never makes use of the reward. In the next article, we will cover how the agent can learn and be optimized through value functions and the Bellman Equation.

--

--

Nathan Weatherly
The Startup

UVA CS/Math | Machine Learning | Mathematics | Statistics | Reinforcement Learning | Data Science