A Beginner’s Guide to Reinforcement Learning and its Basic Implementation from Scratch
Basics of Reinforcement Learning with Real-World Analogies and a Tutorial to Train a Self-Driving Cab to pick up and drop off passengers at right destinations using Python from Scratch.
Most of you must have probably heard of AI learning to play computer games on its own, a very popular example being Deepmind, which hit the news and took the world by storm when their AlphaGo program defeated the South Korean Go world champion in 2016.
So what’s the secret behind this major breakthrough? Hold your horses! You’ll understand this in a couple of minutes.
Let’s consider the analogy of teaching a dog new dog tricks. In this scenario, we emulate a situation and the dog tries to respond in different ways. If the dog’s response is a desired one, we reward it with dog food. Else, we convey in one way or the other that its response isn’t the desired one.
Now, every time the dog is exposed to the same situation, the dog executes a similar action with even more enthusiasm in expectation of more food. It is basically learning what to do from positive experiences. Similarly, it will tend to learn what not to do when coming face to face with negative experiences.
That’s exactly how Reinforcement Learning works in a broader sense!
- The dog is an agent that is exposed to an environment. The environment could be thought of your house, with you.
- The situations encountered are analogous to a state. An example of a state could be your dog standing and you using a specific word in a certain tone in your living room.
- Our agent reacts by performing an action to transition from one state to another. For example, your dog goes from standing to running to catch a stick.
- After the transition, it may receive a reward or penalty in return. You give a treat as a reward or simply say “No” as a penalty.
- The policy is the strategy of choosing an action, given a state, in expectation of better outcomes.
Now, putting all this together…
A Reinforcement Learning (RL) task is about training an agent that interacts with its environment. The agent transitions between different scenarios of the environment, referred to as states, by performing actions. Actions, in return, yield rewards, which could be positive, negative or zero. The agent’s sole purpose is to maximize the notion of cumulative reward over an episode, which is everything that happens between an initial state and a terminal state, where we decide the rewards which align with the tasks that we want to accomplish.
Hence, we reinforce the agent to perform certain actions by providing it with positive rewards, and to stray away from others by providing negative rewards. This is how an agent learns to develop a strategy or policy.
Reinforcement learning is one of three basic Machine Learning paradigms, alongside Supervised and Unsupervised Learning. It deals with exploitation or exploration.
A Few Important Things to Note…
- Being greedy doesn’t always work
There are things that are easy to do for instant gratification, and there are things that provide long term rewards. The goal is to not be greedy by merely looking for quick immediate rewards, but instead to optimize for maximum rewards over the whole training. - Sequence matters in Reinforcement Learning
The reward agent does not just depend on the current state, but the entire history of states. Unlike supervised and unsupervised learning, time is important here.
In a way, Reinforcement Learning is the science of making optimal decisions using experiences. Breaking it down, the process involves these simple steps:
- Observation of the environment
- Deciding how to act using some strategy
- Acting accordingly
- Receiving a reward or penalty
- Learning from the experiences and refining our strategy
- Iterate until an optimal strategy is found
Coming back to AlphaGo…
AlphaGo is a classic example of an RL agent, where the agent has learned how to play the game of Go to maximize its reward. In this tutorial, let’s understand Reinforcement Learning by actually developing an agent to learn to play a game automatically on its own.
Reinforcement Learning is not just limited to games!
It is used for managing stock portfolios and finances, for making humanoid robots, for manufacturing and inventory management, to develop general AI agents, etc…
Designing a Self-Driving Cab from Scratch
Let’s design a simulation of a Self-Driving Smart Cab. The major goal is to demonstrate, in a simplified environment, how you can use RL techniques to develop an efficient and safe approach for tackling this problem.
The Smart Cab’s job is to pick up the passenger at one location and drop them off in another. The things that we want our Smart Cab to take care:
- Drop off the passenger at the right location.
- Save passenger’s time by taking minimum time possible to drop off.
- Take care of the passenger’s safety and traffic rules.
There are different aspects that need to be considered here while modelling an RL solution to this problem: rewards, states, and actions.
1. Rewards
Since the agent (imaginary driver) is reward-motivated and is going to learn how to control the cab by trial and error experiences in the environment, we need to decide the rewards and/or penalties and their magnitude accordingly. Here a few points to consider:
- The agent should receive a high positive reward for a successful dropoff because this behaviour is highly desired.
- The agent should be penalized if it tries to drop off a passenger in wrong locations.
- The agent should get a slightly negative reward for not making it to the destination after every time-step.
- “Slightly” negative because we would prefer our agent to reach late instead of making wrong moves trying to reach to the destination as fast as possible.
2. State Space
In RL, the agent encounters a state and then takes action according to the state it’s in.
The State Space is the set of all possible situations our taxi could inhabit. The state should contain useful information the agent needs to make the right action.
Let’s say we have a training area for our Smart Cab where we are teaching it to transport people in a parking lot to four different locations (R, G, Y, B):
Let’s assume that the Smart Cab is the only vehicle in this parking lot. We can break up the parking lot into a 5x5 grid, which gives us 25 possible taxi locations. These 25 locations are one part of our state space. The current location state of our taxi is (3, 1).
There are 4 locations that we can pick up and drop off a passenger: R, G, Y, B or [(0,0), (0,4), (4,0), (4,3)]
in (row, col) coordinates. Our passenger is in location Y and they wish to go to location R.
If we have 1 additional passenger state of being inside the taxi, we can take all combinations of passenger locations and destination locations to come to a total number of states for our taxi environment; there are 4 destinations and five (4+1) passenger locations.
So, our taxi environment has 5×5×5×4=500 total possible states. (Taxi’s location — 5×5, the passenger’s location — 5, and destination location — 4)
3. Action Space
The action space is the set of all the actions that our agent can take in a given state. The agent encounters one of the 500 states and takes an action. The action in our case can be to move in a direction or decide to pickup/dropoff a passenger.
In other words, we have six possible actions:
south
north
east
west
pickup
dropoff
In the illustration above, the taxi cannot perform certain actions in certain states due to the walls. In the environment’s code, we will provide a -1 penalty for every wall hit, which will just rack up penalties causing the taxi to consider going around the wall.
Implementation with Python
Fortunately, OpenAI Gym has this exact environment already built for us.
OpenAI Gym provides different game environments which we can plug into our code and test an agent. The library takes care of API for providing all the information that our agent would require, like possible actions, score, and current state. We just need to focus just on the algorithm part for our agent.
We’ll be using the Gym environment called Taxi-V2
where all of the details explained above are pulled from.
Gym’s Interface
We need first install gym
. Executing the following code should work:
!pip install cmake 'gym[atari]' scipy
Once installed, we can load the game environment and render how it looks:
import gym
env = gym.make("Taxi-v2").env
env.render()
The core gym interface is env
, which is the unified environment interface.
The following are the env
methods that we would encounter in the code.
env.reset
: Resets the environment and returns a random initial state.
env.step(action)
: Moves through the environment by one timestep. Returns
- observation: Observations of the environment
- reward: If your action was beneficial or not
- done: Indicates if we have successfully picked up and dropped off a passenger, also called one episode
- info: Additional info such as performance and latency for debugging purposes
env.render
: Renders one frame of the environment (helpful in visualizing the environment)
Here’s our restructured problem statement (from Gym docs):
There are 4 locations (labeled by different letters), and our job is to pick up the passenger at one location and drop him off at another. We receive +20 points for a successful drop-off and lose 1 point for every time-step it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.
Let’s dive more into the environment -
env.reset() # reset environment to a new, random state
env.render()
print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))
Action Space Discrete(6)
State Space Discrete(500)
- The filled square represents the taxi, which is yellow without a passenger and green with a passenger.
- The pipe (‘|’) represents a wall which the taxi cannot cross.
- R, G, Y, B are the possible pickup/destination locations. The blue letter represents the current passenger pick-up location, and the purple letter is the destination.
We have an Action Space of size 6 and a State Space of size 500. RL learns to choose an action number from the possible actions 0–5 where:
0-south
1-north
2-east
3-west
4-pickup
5-dropoff
RL will learn the mappings of states to the optimal action to perform in that state by exploration, i.e. the agent explores the environment and takes actions based on rewards defined in the environment.
The optimal action for each state is the action that has the highest cumulative long-term reward.
Back to our illustration…
Let’s take our illustration, encode its state, and give it to the environment to render in gym
. Recall that we have our taxi at row 3, column 1, our passenger is at location 2, and the destination is location 0. Using the Taxi-V2
state encoding method, we can do the following:
state = env.encode(3, 1, 2, 0)
#(taxi row, taxi column, passenger index, destination index)
print("State:", state)env.s = state
env.render()
State: 328
We are using our illustration’s coordinates to generate a number corresponding to a state between 0 and 499, which turns out to be 328 for our illustration’s state.
Then we can set the environment’s state manually with env.env.s
using that encoded number. You can play around with the numbers and you'll see the taxi, passenger, and destination move around.
Reward Table
When the Taxi environment is created, there is an initial Reward table that has been created, called P
. We can think of it as a matrix that has the number of states as rows and number of actions as columns, i.e. states × actions matrix.
Since every state is in this matrix, we can see the default reward values assigned to our illustration’s state:
env.P[328]
Output:
{0: [(1.0, 428, -1, False)],
1: [(1.0, 228, -1, False)],
2: [(1.0, 348, -1, False)],
3: [(1.0, 328, -1, False)],
4: [(1.0, 328, -10, False)],
5: [(1.0, 328, -10, False)]}
This dictionary has the structure {action: [(probability, nextstate, reward, done)]}
.
A few things to note:
- The 0–5 corresponds to the actions (south, north, east, west, pickup, dropoff) the taxi can perform at the current state.
- In this env,
probability
is always 1.0. - The
nextstate
is the state we would be in if we take the action at this index of the dictionary. - All the movement actions have a -1 reward and the pickup/dropoff actions have -10 reward in this particular state. If we are in a state where the taxi has a passenger and it is on top of the right destination, we would see a reward of 20 at the dropoff action (5)
done
is used to tell us when we have successfully dropped off a passenger in the right location. Each successful dropoff is the end of an episode.
Note that if our agent chose to explore action two (2) in this state it would be going East into a wall. The source code has made it impossible to actually move the taxi across a wall, so if the taxi chooses that action, it will just keep accruing -1 penalties, which affects the long-term reward.
Entering Reinforcement Learning
We are going to use a simple RL algorithm called Q-learning which will give our agent some memory.
For More on Q-Learning refer my blog —
Implementing Q-Learning in Python
Training the Agent
First, we’ll initialize the Q-table to a 500×6500×6 matrix of zeros:
import numpy as np
q_table = np.zeros([env.observation_space.n, env.action_space.n])
We can now create the training algorithm that will update this Q-table as the agent explores the environment over thousands of episodes.
In the first part of while not done
, we decide whether to pick a random action or to exploit the already computed Q-values. This is done simply by using the epsilon
value and comparing it to the random.uniform(0, 1)
function, which returns an arbitrary number between 0 and 1.
We execute the chosen action in the environment to obtain the next_state
and the reward
from performing the action. After that, we calculate the maximum Q-value for the actions corresponding to next_state
, and with that, we can easily update our Q-value to the new_q_value
:
Output:
Episode: 100000
Training finished.Wall time: 30.6 s
Now that the Q-table has been established over 100,000 episodes, let’s see what the Q-values are at our illustration’s state:
q_table[328]
Output:
array([ -2.30108105, -1.97092096, -2.30357004, -2.20591839,
-10.3607344 , -8.5583017 ])
The max Q-value is “north” (-1.971), so it looks like Q-learning has effectively learned the best action to take in our illustration’s state!
Evaluating the Agent
It’s time to evaluate the performance of our agent…
Output:
Results after 100 episodes:
Average timesteps per episode: 12.3
Average penalties per episode: 0.0
We can see that the agent’s performance improved significantly and it incurred no penalties, which means it performed the correct pickup/dropoff actions with 100 different passengers.
Comparing our Q-Learning Agent to No Reinforcement Learning
Let’s see how much better our Q-learning solution is when compared to the agent just making random moves.
For Implementation without Reinforcement Learning refer my blog —
Solving the Self-Driving Cab Problem without Reinforcement Learning
With Q-learning, the agent commits errors initially during exploration but once it has explored enough, it can act wisely maximizing the rewards making smart moves.
We evaluate our agents according to the following metrics,
- Average number of penalties per episode (the lower the better)
- Average number of timesteps per trip (the lower the better)
- Average rewards per move (the higher the better)
Looks like our Q-Learning Agent has nailed it!
Tuning the hyperparameters
The values of alpha, gamma, and epsilon were mostly based on intuition and hit and trial, but there are better ways to come up with good values.
Ideally, all three should decrease over time because as the agent continues to learn, it actually builds up more resilient priors. Hence, it is extremely crucial to play around and experiment with the hyperparameters.
Conclusion and What’s Ahead
Alright! We began with understanding Reinforcement Learning with the help of real-world analogies. We then dived into the basics of Reinforcement Learning and framed a Self-Driving Cab as a Reinforcement Learning Problem using OpenAI’s Gym in python to provide us with a related environment, where we can develop our agent and evaluate it.
Then we observed how terrible our agent was without using any algorithm to play the game, so we went ahead to implement the Q-Learning Algorithm from Scratch. The agent’s performance improved significantly after Q-Learning.
Q-Learning is one of the easiest Reinforcement Learning algorithms. The problem with Q-Learning however is, once the number of states in the environment are very high, it becomes difficult to implement them with Q-table as the size would become very, very large.
State of the art techniques uses Deep neural networks instead of the Q-table (Deep Reinforcement Learning). The neural network takes in state information and actions to the input layer and learns to output the right action over time.
If you’d like to continue with this project to make it better, here are a few things you can add —
- Turn this code into a module of functions that can use multiple environments
- Tune alpha, gamma, and/or epsilon using a decay over episodes
- Implement a grid search to discover the best hyperparameters
“It turns out that Reinforcement Learning is a type of Machine Learning whose hunger for data is even greater than Supervised Learning. It is really difficult to get enough data for Reinforcement Learning Algorithms.”
— Andrew Ng
Further References to my Other Blogs:
Solving the Self-Driving Cab Problem without Reinforcement Learning