Maze game with Reinforcement Learning

Mehul Gupta
Data Science in your pocket
5 min readJun 8, 2019

--

Reinforcement Learning is becoming one of the most popular techniques in Machine Learning today. Gaming has been often associated with it & hence I would be providing an idea (with code link as well) to create your first game!

Maze Runner is basically a maze game with obstacles defined. You want the Hero to reach the other end as shown in the image on its own & yes, Reinforcement Learning will do that! I hope you know the basic terminologies associated with Reinforcement Learning. If not, check here

Game Environment

Here, red stars mark our beginning & end point, 4 black dots represent blocks that can’t be trespassed. Aim is to reach from lower to upper star. Simple!

Let's point out our requirements (in my opinion) —

1. Hero should not go out of the board, hence penalty should be heavy for wrong dimensions.

2. Hero should not collide with blocks (marked as 1 in the code), again with a heavy penalty.

3. Hero should not go back to an already visited point.

4. When Hero reaches the destination, the game should stop.

5. Whenever a hero gets trapped in a loop or a position he can’t escape, terminate with a penalty on that state.

LOTS OF CODE🤓

git repo link

Now that we are familiar with the problem statement, its time to code. But you would be wondering where to start? Like, what would be the code flow? or How would the hero move? Let's divide the task into smaller steps and work on the one by one.

  1. Setting up the Environment

First of all, we need an environment where all this happens. For example — in the game of chess, the chessboard is our environment. You can go for a grid image/blank image for this purpose & if you use grids over it ((matplotlib.pyplot.grid(‘True)), you will get your environment. Now for computation purposes, we would be taking an N X N array & using this array, will use markers over our image to visualize moves.

2. What is Q-Learning?

In Q-learning, we don’t follow the same policy for getting rewards for both present & future states i.e we might use e-greedy for present rewards but only greedy approach for future rewards. More can be explored here.

For simplicity, we can take that it is a combination of 2 things, Q-Table & Q-Function. These two components are discussed below

Q-Table: A Q-Table is of utmost importance. Let me guide you on what is a Q-Table & how to use it.

What I would be considering is that, any space in the environment is a state, hence in our environment (N x N array) there would be N x N states. For Example, (2,3), (3,5), etc would be stated in a 6x6 array.

Q-Table: We can perform 4 actions per state i.e. left, up, right, down. Hence our Q-Table would be a mapping of each state with each action & its value i.e. Action X State array. For instance, if 4 actions can be performed with 36 states present, the Q-Table is 4X36 array. After each action, we update the value corresponding to the current action: state mapping using the formula:

Q(Action:A, State:S) += alpha(reward + beta * Max(Q(Action:A1, Next_State:S1))-Q(Action, State))

where,

  • Q() → returns value mapped for an action on a state
  • alpha & beta are constants

Max(Q(Action, Next_State:S)) represents the max value for State S1 on performing any action A1 yielding max value, whereas Next_State is achieved by performing Action A on State S. Say, for example, if action ‘left’ is performed on stage 23 & it reaches state 25 after this, then the equation becomes

Q(‘left’, 23)+= alpha* reward_for_performing_’left’_on_23 + beta*Max(Q(any_action A1, 25))-Q(‘left’, 23))

However, we still need to define our game’s end. It is simply that when we reach the topmost corner, I win.

3. Define Required Functions

Now, a major question! How to use all this information for automation of the game? Before that, we need to define certain functions like:

A) check() → for checking invalid conditions & returning rewards accordingly.

B) visualize() → for using N x N array to visualize our game on a blank image using matplotlib.

C) reset() → Whenever any episode ends(whether we reach or loop is formed due to which termination needed), resets the all variables(environment array but not q_table) & initial starting state.

D) state_updater() → to update state & position whenever reward is >0 else remain in the same state i.e. at the same point.

Let's have a look at the code flow

  1. Create an environment,q_table, load image (blank or grid). Aa prior knowledge of image coordinates (to use grids) is necessary.
  2. Mention obstacles on environment, create reset(), check(), state_updater(), and visualize()
  3. Take a loop for 200 episodes(episode refers to 200 instances, one instance is completed when the game is completed or terminated), call reset() & do the following in the inner loop

3-a) Using e-greedy method to generate action_to_be_taken(using some probability, either take max action: state pair or random action over given state)

3-b) Calculate new coordinates & using check(), get reward & update q_table

3-c) if reward>0, update state else remain in same state using state_updater()

3-d)store previous state & present state in variables so as if the state doesn’t changes(for example S1) for about 20–30 iterations, penalize the action: state pair which lead to state S1.like if performing ‘up’ on 23 gives me 25 which get stuck, I should penalize both states(a bit tricky, I myself working on it)

4. Call visualize() once 1 episode over.

And you should expect a result like this after 50 iterations(Below I have attached all 50 observation as well) i.e for the 51st episode

And you are ready with your first game!! A lot of cool stuff can be added to this on which I am working & would be updating soon…..

Observations for 50 episodes are embedded in the video below

The entire code can be explored here.

--

--