A brief introduction to Reinforcement Learning - through a practical, basic RL algorithm

Shaamyl Anwar
7 min readMar 16, 2021

What is Reinforcement Learning?

Reinforcement Learning is an exciting field of AI. It is behind technologies such as AlphaZero which mastered Chess, Go, and Shogi from scratch, beating a world champion in each case. In this article, we will take a quick but deep dive into the basics of RL and showcase a project that implements an RL algorithm in practice. All with the goal of providing significant intuition behind the powerful field to readers from all backgrounds.

Reinforcement Learning models the learning process in terms of an “agent” and “rewards”. Essentially, a specified agent (player in a video game, for example) takes actions in its environment to maximize expected reward in the long run. If the word “expected” triggers suspicions of probability and statistics being used, then your suspicions are correct. The actions and rewards received need not be deterministic. In fact, they normally aren’t. So, the agent needs to take into account the probabilities of receiving different rewards, taking different actions, and observing different states when deciding how to maximize reward.

However, for simplicity, we will focus on a deterministic example where there is no uncertainty about what action is taken and what rewards are received at each time step.

I snuck in an extra piece of terminology there. Time step is what RL practitioners use to describe a certain point in time in the environment. Right now, it seems pedantic to have specific terminology for such a fundamental concept. However, this terminology will help us in being precise later on.

There are two more pieces of terminology we need to be familiar with in order to understand our practical example of RL. A state is a set of features, at a particular time step, that represent the important characteristics of our environment. An example is a video game player’s physical position on a map. A policy is a set of rules an agent follows for any given state in the environment. For example, given the agent is at the edge of a map (its state), we might want it to follow the policy of walking towards the center of the map (a future state). It is worth it to spend a few moments and come up with examples of states and policies yourself to get familiar with these concepts.

Traditionally, when an agent starts out in an environment, it has no idea what policy to follow in order to maximize reward. Think of it as a newborn experiencing the world for the first time. Now, what information does the agent need in order to learn a policy that maximizes reward? It needs to know the state it is in (called s), the set of actions it can take (called a), and what rewards it will receive (called r). It also needs to know the future set of states it will reach corresponding to each action (called s’ for each action), so that it can account for future rewards, not just immediate ones.

With this information readily available, we can create an algorithm that:
1. Visits each state and takes each action.
2. Estimates each state’s value by measuring the rewards received and the values of its neighbor states.
3. Loops through all states multiple times to get good estimates of each state’s values.
(Step 2 is the hardest to understand and will require further qualification provided later)

This algorithm will help us get a good estimate of the real value of each state. Once we have a good estimate of value for each state, our reward-maximizing policy will be simple: for each state, take the action that moves you into the highest-value next state. Such a policy is called an optimal policy and finding it is the goal for most RL problems. We call this learning algorithm DP Value Iteration (DP stands for Dynamic Programming) because it, through many iterations, finds the real value of each state in our environment. These values then help us find the optimal policy.

Let’s understand step 2 of the algorithm in a bit more detail. Intuitively, we know that each state’s value should not just account for immediate reward, but the reward it expects in the future as well. This is because an action with immediately low-reward may still be worth taking because of large future rewards. To keep future reward into account, we look at immediate reward and a state’s neighbors’ values in step 2. But how can we calculate all future reward from a state by just looking at the state’s neighbors’ values?

We do this using a relation called the Bellman Equation. The convenient thing about the Bellman Equation is that it gives us a formula for our current state’s value only in terms of immediate reward and next states’ values. That is, we don’t have to consider all possible future states that come after immediate neighbor states. This is a recursive equation which forms the bedrock of all algorithmic approaches to RL. For our purposes, basic intuition about the Bellman Equation will suffice but you’ll see a lot more of it in your future studies! For extra-credit understanding, try to interpret the Bellman Equation:

Bellman Equation (Deterministic) [1]

With the proper notation, you can imagine how we can represent the learning algorithm above mathematically in terms of sets, functions, max operators, etc. Such mathematical representations become vital in understanding RL as you transition from basic to intermediate RL. We will not use such standard, mathematical representations in order to keep this article as accessible, and conceptual, as possible. However, I highly recommend getting comfortable with the notations when you’re studying RL in the future.

Implementing DP Value Iteration

Throughout this section, we will implement DP Value Iteration in the following agent-environment scenario:

We have an agent, called Shadow the Dog, who moves through a maze. Every time Shadow the Dog makes a move, it receives a small amount of negative reward. When it enters the green zone, it receives a large, positive reward and the game ends. So, naturally, to maximize reward, Shadow must get to the green state as quickly as possible. The red state is “lava” i.e the game ends with a large, negative reward. And the black states are “walls”, states that Shadow cannot go into. Can you figure out what is happening in the animation above? What do the numbers represent?

The graphics are created using the Java Graphics and Swing library. We will focus more on the core RL section of the program: the valueWithIteration() method. The text below will be code heavy and will reference variables and classes that will not have been defined here, but their function will be clear from the context, so don’t worry if it isn’t easy to follow completely. The goal here is to get the central intuition behind basic DP Value Iteration (and hence, basic RL).

Java code for training Shadow

In the code above, we first initialize our state-value function (represented by the 2-d array optimalValues) to arbitrary values. We can do this because, no matter what starting values we choose, DP Value Iteration guarantees convergence to true state values.

We then have a while loop that loops an iteration number of times. This is akin to step 3 of our algorithm, i.e repeating our estimating algorithm many times to get good estimates of each state’s values.

Inside each iteration of the while loop, we loop through the whole gridWorld array (each element is a state). This is equivalent to going to each state in step 1 of our algorithm. The intuition is, we will visit (for-loop) to each state and simulate taking each action in order to measure the rewards we get. Inside the for loops (i.e inside each state), we not only get the immediate reward but also the neighbor states using getAccessibleStates(). We then loop through all the neighbor states and find the maximum valued neighbor state (and its corresponding action). Now, we have all we need to calculate our state’s value using the Bellman Equation.

In the second last programming statement, we estimate a solution to the Bellman Equation directly by adding immediate reward (represented by girdWorld[i][j].reward) with the value of the highest-value neighbor state (called maxActionValue in the code). This is an almost direct translation of the Bellman Equation into code! Doing this whole procedure multiple times, we can get good estimates of our state-value function.

For completeness, after each iteration, the real code prompts Shadow to follow an optimal policy based on the current estimates of value. That’s why in the animation above, you see Shadow move after each iteration. We also multiply a gamma to the max state-value in the Bellman Equation line, which is called the discount factor and will be covered later on. Concisely, it controls how far-sighted Shadow will be.

That’s it! That’s how you can implement a basic RL algorithm from scratch. Don’t worry if you didn’t understand everything, this article was deceivingly dense. The aim was to cover a lot of material at a reasonable depth in order to provide a basic intuition about RL (this article only touches upon concepts explained in great detail throughout multiple chapters of traditional RL textbooks). For further reading, I recommend the many great online tutorials available and the Sutton and Barto book which is also available online. Feel free to read the parts that were confusing again, comment your thoughts below, suggest improvements and future directions for the article. Thank you for reading!

References:

[1] from https://medium.com/analytics-vidhya/bellman-equation-and-dynamic-programming-773ce67fc6a7

--

--