A Beginner’s Guide to Reinforcement Learning and its Basic Implementation from Scratch

Tanvi Penumudy
Nov 14, 2020 · 13 min read

Basics of Reinforcement Learning with Real-World Analogies and a Tutorial to Train a Self-Driving Cab to pick up and drop off passengers at right destinations using Python from Scratch.

Most of you must have probably heard of AI learning to play computer games on its own, a very popular example being Deepmind, which hit the news and took the world by storm when their AlphaGo program defeated the South Korean Go world champion in 2016.

So what’s the secret behind this major breakthrough? Hold your horses! You’ll understand this in a couple of minutes.

An Analogy of Reinforcement Learning

Let’s consider the analogy of teaching a dog new dog tricks. In this scenario, we emulate a situation and the dog tries to respond in different ways. If the dog’s response is a desired one, we reward it with dog food. Else, we convey in one way or the other that its response isn’t the desired one.

Now, every time the dog is exposed to the same situation, the dog executes a similar action with even more enthusiasm in expectation of more food. It is basically learning what to do from positive experiences. Similarly, it will tend to learn what not to do when coming face to face with negative experiences.

That’s exactly how Reinforcement Learning works in a broader sense!

  • The dog is an agent that is exposed to an environment. The environment could be thought of your house, with you.
  • The situations encountered are analogous to a state. An example of a state could be your dog standing and you using a specific word in a certain tone in your living room.
  • Our agent reacts by performing an action to transition from one state to another. For example, your dog goes from standing to running to catch a stick.
  • After the transition, it may receive a reward or penalty in return. You give a treat as a reward or simply say “No” as a penalty.
  • The policy is the strategy of choosing an action, given a state, in expectation of better outcomes.

Now, putting all this together…

A Reinforcement Learning (RL) task is about training an agent that interacts with its environment. The agent transitions between different scenarios of the environment, referred to as states, by performing actions. Actions, in return, yield rewards, which could be positive, negative or zero. The agent’s sole purpose is to maximize the notion of cumulative reward over an episode, which is everything that happens between an initial state and a terminal state, where we decide the rewards which align with the tasks that we want to accomplish.

Hence, we reinforce the agent to perform certain actions by providing it with positive rewards, and to stray away from others by providing negative rewards. This is how an agent learns to develop a strategy or policy.

Illustration of the Process of Reinforcement Learning

Reinforcement learning is one of three basic Machine Learning paradigms, alongside Supervised and Unsupervised Learning. It deals with exploitation or exploration.

A Few Important Things to Note…

  1. Being greedy doesn’t always work
    There are things that are easy to do for instant gratification, and there are things that provide long term rewards. The goal is to not be greedy by merely looking for quick immediate rewards, but instead to optimize for maximum rewards over the whole training.
  2. Sequence matters in Reinforcement Learning
    The reward agent does not just depend on the current state, but the entire history of states. Unlike supervised and unsupervised learning, time is important here.

In a way, Reinforcement Learning is the science of making optimal decisions using experiences. Breaking it down, the process involves these simple steps:

  1. Observation of the environment
  2. Deciding how to act using some strategy
  3. Acting accordingly
  4. Receiving a reward or penalty
  5. Learning from the experiences and refining our strategy
  6. Iterate until an optimal strategy is found

Coming back to AlphaGo

AlphaGo is a classic example of an RL agent, where the agent has learned how to play the game of Go to maximize its reward. In this tutorial, let’s understand Reinforcement Learning by actually developing an agent to learn to play a game automatically on its own.

Reinforcement Learning is not just limited to games!

It is used for managing stock portfolios and finances, for making humanoid robots, for manufacturing and inventory management, to develop general AI agents, etc…

Designing a Self-Driving Cab from Scratch

Let’s design a simulation of a Self-Driving Smart Cab. The major goal is to demonstrate, in a simplified environment, how you can use RL techniques to develop an efficient and safe approach for tackling this problem.

The Smart Cab’s job is to pick up the passenger at one location and drop them off in another. The things that we want our Smart Cab to take care:

  • Drop off the passenger at the right location.
  • Save passenger’s time by taking minimum time possible to drop off.
  • Take care of the passenger’s safety and traffic rules.

There are different aspects that need to be considered here while modelling an RL solution to this problem: rewards, states, and actions.

1. Rewards

Since the agent (imaginary driver) is reward-motivated and is going to learn how to control the cab by trial and error experiences in the environment, we need to decide the rewards and/or penalties and their magnitude accordingly. Here a few points to consider:

  • The agent should receive a high positive reward for a successful dropoff because this behaviour is highly desired.
  • The agent should be penalized if it tries to drop off a passenger in wrong locations.
  • The agent should get a slightly negative reward for not making it to the destination after every time-step.
  • “Slightly” negative because we would prefer our agent to reach late instead of making wrong moves trying to reach to the destination as fast as possible.

2. State Space

In RL, the agent encounters a state and then takes action according to the state it’s in.

The State Space is the set of all possible situations our taxi could inhabit. The state should contain useful information the agent needs to make the right action.

Let’s say we have a training area for our Smart Cab where we are teaching it to transport people in a parking lot to four different locations (R, G, Y, B):

Representation of a Smart Cab in a Parking Lot

Let’s assume that the Smart Cab is the only vehicle in this parking lot. We can break up the parking lot into a 5x5 grid, which gives us 25 possible taxi locations. These 25 locations are one part of our state space. The current location state of our taxi is (3, 1).

There are 4 locations that we can pick up and drop off a passenger: R, G, Y, B or [(0,0), (0,4), (4,0), (4,3)] in (row, col) coordinates. Our passenger is in location Y and they wish to go to location R.

If we have 1 additional passenger state of being inside the taxi, we can take all combinations of passenger locations and destination locations to come to a total number of states for our taxi environment; there are 4 destinations and five (4+1) passenger locations.

So, our taxi environment has 5×5×5×4=500 total possible states. (Taxi’s location — 5×5, the passenger’s location — 5, and destination location — 4)

3. Action Space

The action space is the set of all the actions that our agent can take in a given state. The agent encounters one of the 500 states and takes an action. The action in our case can be to move in a direction or decide to pickup/dropoff a passenger.

In other words, we have six possible actions:

  1. south
  2. north
  3. east
  4. west
  5. pickup
  6. dropoff

In the illustration above, the taxi cannot perform certain actions in certain states due to the walls. In the environment’s code, we will provide a -1 penalty for every wall hit, which will just rack up penalties causing the taxi to consider going around the wall.

Implementation with Python

Fortunately, OpenAI Gym has this exact environment already built for us.

OpenAI Gym provides different game environments which we can plug into our code and test an agent. The library takes care of API for providing all the information that our agent would require, like possible actions, score, and current state. We just need to focus just on the algorithm part for our agent.

We’ll be using the Gym environment called Taxi-V2 where all of the details explained above are pulled from.

Gym’s Interface

We need first install gym. Executing the following code should work:

!pip install cmake 'gym[atari]' scipy

Once installed, we can load the game environment and render how it looks:

import gym  
env = gym.make("Taxi-v2").env

The core gym interface is env, which is the unified environment interface.

The following are the env methods that we would encounter in the code.

env.reset: Resets the environment and returns a random initial state.

env.step(action): Moves through the environment by one timestep. Returns

  • observation: Observations of the environment
  • reward: If your action was beneficial or not
  • done: Indicates if we have successfully picked up and dropped off a passenger, also called one episode
  • info: Additional info such as performance and latency for debugging purposes
  • env.render: Renders one frame of the environment (helpful in visualizing the environment)

Here’s our restructured problem statement (from Gym docs):

There are 4 locations (labeled by different letters), and our job is to pick up the passenger at one location and drop him off at another. We receive +20 points for a successful drop-off and lose 1 point for every time-step it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.

Let’s dive more into the environment -

env.reset() # reset environment to a new, random state 
print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))
Action Space Discrete(6)
State Space Discrete(500)
  • The filled square represents the taxi, which is yellow without a passenger and green with a passenger.
  • The pipe (‘|’) represents a wall which the taxi cannot cross.
  • R, G, Y, B are the possible pickup/destination locations. The blue letter represents the current passenger pick-up location, and the purple letter is the destination.

We have an Action Space of size 6 and a State Space of size 500. RL learns to choose an action number from the possible actions 0–5 where:

  • 0-south
  • 1-north
  • 2-east
  • 3-west
  • 4-pickup
  • 5-dropoff

RL will learn the mappings of states to the optimal action to perform in that state by exploration, i.e. the agent explores the environment and takes actions based on rewards defined in the environment.

The optimal action for each state is the action that has the highest cumulative long-term reward.

Back to our illustration…

Let’s take our illustration, encode its state, and give it to the environment to render in gym. Recall that we have our taxi at row 3, column 1, our passenger is at location 2, and the destination is location 0. Using the Taxi-V2 state encoding method, we can do the following:

state = env.encode(3, 1, 2, 0) 
#(taxi row, taxi column, passenger index, destination index)
print("State:", state)
env.s = state
State: 328

We are using our illustration’s coordinates to generate a number corresponding to a state between 0 and 499, which turns out to be 328 for our illustration’s state.

Then we can set the environment’s state manually with env.env.s using that encoded number. You can play around with the numbers and you'll see the taxi, passenger, and destination move around.

Reward Table

When the Taxi environment is created, there is an initial Reward table that has been created, called P. We can think of it as a matrix that has the number of states as rows and number of actions as columns, i.e. states × actions matrix.

Since every state is in this matrix, we can see the default reward values assigned to our illustration’s state:



{0: [(1.0, 428, -1, False)],
1: [(1.0, 228, -1, False)],
2: [(1.0, 348, -1, False)],
3: [(1.0, 328, -1, False)],
4: [(1.0, 328, -10, False)],
5: [(1.0, 328, -10, False)]}

This dictionary has the structure {action: [(probability, nextstate, reward, done)]}.

A few things to note:

  • The 0–5 corresponds to the actions (south, north, east, west, pickup, dropoff) the taxi can perform at the current state.
  • In this env, probability is always 1.0.
  • The nextstate is the state we would be in if we take the action at this index of the dictionary.
  • All the movement actions have a -1 reward and the pickup/dropoff actions have -10 reward in this particular state. If we are in a state where the taxi has a passenger and it is on top of the right destination, we would see a reward of 20 at the dropoff action (5)
  • done is used to tell us when we have successfully dropped off a passenger in the right location. Each successful dropoff is the end of an episode.

Note that if our agent chose to explore action two (2) in this state it would be going East into a wall. The source code has made it impossible to actually move the taxi across a wall, so if the taxi chooses that action, it will just keep accruing -1 penalties, which affects the long-term reward.

Entering Reinforcement Learning

We are going to use a simple RL algorithm called Q-learning which will give our agent some memory.

For More on Q-Learning refer my blog —

Introduction to Q-Learning for the Self-Driving Cab Problem

Implementing Q-Learning in Python

Training the Agent

First, we’ll initialize the Q-table to a 500×6500×6 matrix of zeros:

import numpy as np
q_table = np.zeros([env.observation_space.n, env.action_space.n])

We can now create the training algorithm that will update this Q-table as the agent explores the environment over thousands of episodes.

In the first part of while not done, we decide whether to pick a random action or to exploit the already computed Q-values. This is done simply by using the epsilon value and comparing it to the random.uniform(0, 1) function, which returns an arbitrary number between 0 and 1.

We execute the chosen action in the environment to obtain the next_state and the reward from performing the action. After that, we calculate the maximum Q-value for the actions corresponding to next_state, and with that, we can easily update our Q-value to the new_q_value:


Episode: 100000
Training finished.
Wall time: 30.6 s

Now that the Q-table has been established over 100,000 episodes, let’s see what the Q-values are at our illustration’s state:



array([ -2.30108105,  -1.97092096,  -2.30357004,  -2.20591839,
-10.3607344 , -8.5583017 ])

The max Q-value is “north” (-1.971), so it looks like Q-learning has effectively learned the best action to take in our illustration’s state!

Evaluating the Agent

It’s time to evaluate the performance of our agent…


Results after 100 episodes:
Average timesteps per episode: 12.3
Average penalties per episode: 0.0

We can see that the agent’s performance improved significantly and it incurred no penalties, which means it performed the correct pickup/dropoff actions with 100 different passengers.

Comparing our Q-Learning Agent to No Reinforcement Learning

Let’s see how much better our Q-learning solution is when compared to the agent just making random moves.

For Implementation without Reinforcement Learning refer my blog —

Solving the Self-Driving Cab Problem without Reinforcement Learning

With Q-learning, the agent commits errors initially during exploration but once it has explored enough, it can act wisely maximizing the rewards making smart moves.

We evaluate our agents according to the following metrics,

  • Average number of penalties per episode (the lower the better)
  • Average number of timesteps per trip (the lower the better)
  • Average rewards per move (the higher the better)
Comparison of both our Agents (with and without RL)

Looks like our Q-Learning Agent has nailed it!

Tuning the hyperparameters

The values of alpha, gamma, and epsilon were mostly based on intuition and hit and trial, but there are better ways to come up with good values.

Ideally, all three should decrease over time because as the agent continues to learn, it actually builds up more resilient priors. Hence, it is extremely crucial to play around and experiment with the hyperparameters.

Conclusion and What’s Ahead

Alright! We began with understanding Reinforcement Learning with the help of real-world analogies. We then dived into the basics of Reinforcement Learning and framed a Self-Driving Cab as a Reinforcement Learning Problem using OpenAI’s Gym in python to provide us with a related environment, where we can develop our agent and evaluate it.

Then we observed how terrible our agent was without using any algorithm to play the game, so we went ahead to implement the Q-Learning Algorithm from Scratch. The agent’s performance improved significantly after Q-Learning.

Q-Learning is one of the easiest Reinforcement Learning algorithms. The problem with Q-Learning however is, once the number of states in the environment are very high, it becomes difficult to implement them with Q-table as the size would become very, very large.

State of the art techniques uses Deep neural networks instead of the Q-table (Deep Reinforcement Learning). The neural network takes in state information and actions to the input layer and learns to output the right action over time.

If you’d like to continue with this project to make it better, here are a few things you can add —

  • Turn this code into a module of functions that can use multiple environments
  • Tune alpha, gamma, and/or epsilon using a decay over episodes
  • Implement a grid search to discover the best hyperparameters

“It turns out that Reinforcement Learning is a type of Machine Learning whose hunger for data is even greater than Supervised Learning. It is really difficult to get enough data for Reinforcement Learning Algorithms.”

— Andrew Ng

Further References to my Other Blogs:

Solving the Self-Driving Cab Problem without Reinforcement Learning

Introduction to Q-Learning for the Self-Driving Cab Problem

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Tanvi Penumudy

Written by

CS Undergrad at Bennett University

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store