Q-LEARNING with TAXI V3 OpenAI

Hilal Müleyke YÜKSEL
4 min readDec 6, 2022

--

What was Q-Learning before moving on to the coding part of the project?

Q-LEARNING

Q-learning is a model-free, off-policy reinforcement learning that will find the best course of action, given the current state of the agent. Depending on where the agent is in the environment, it will decide the next action to be taken.

In this project, I want to build a driverless taxi that can pick up passengers from a set of fixed points and drop them off at another location, avoiding obstacles and getting there as quickly as possible.

Let’s create our environment with the help of OpenAI

  • BLUE letter: Where we need to pick someone up from.
  • MAGENTA letter: Where that passenger wants to go.
  • SOLID lines: Walls that taxi cannot cross.
  • FILLED RECTANGLE: TAXI

My little world here, which i have called “streets”, is 5x5 grid. State of this word at any time can be defined by:

  • Where taxi is (5x5=25 locations)
  • What the current destination is (4 possibilities)
  • Where the passenger is (5 possibilities: at one of the destinations, or inside the taxi)

So there are a total 25x4x5=500 possible states that describe my world.

For each state, there are six possible actions:

  • Move South, East, North or West
  • Pick up passenger
  • Drop off a passenger

Q-Learning will take place using the following rewards and penalties at each state:

  • A successfull drop-out yieds +20 points
  • Every time step taken while driving a passenger yields -1 point penalty
  • Picking up or dropping off at an illegal location yields -10 point penalty

Moving across a wall just is not allowed at all.

Let’s define an initial state, with taxi location (2,3), passenger at pick-up location 2 and destination at location 0:

Let’s examine rewards table for initial state:

Each row corresponds to a potential action at this state:

  • move South, East, North or West
  • pick-up or dropoff

Four values in each row are probability assigned to that action. Next state that results from that action, The reward for that action, and whether that action indicates a successful dropoff took place.

So for example, moving North from this state would put us into state number 368, incur a penalth of -1 for taking up time, and does not result in successful dropoff.

Now I will create a Q-Table to train my model. So what is a Q-Table?

Q-TABLE: Q-Table is just a fancy name for a simple lookup table where we calculate the maximum expected future rewards for action at each state. Basically, this table will guide us to the best action at each state.

Now we need to train our model. At high level, we will train over 10.000 simulated taxi runs. For each run, we will step through time, with a 10% chance at each step of making a random, exploratory step instead of using the learned Q values to guide our actions.

So now we have a table of Q-Values that can be quickly used to determine optimal next step for any given state. Lets check table for our initial state above:

Actions:

Also check:

https://github.com/muleykeyy/Q-Learning-Taxi-Problem

And more learn about Reinforcement Learning:

https://medium.com/@hmuleykey/introduction-to-reinforcement-learning-dc3c77b53c5c

--

--