Using Reinforcement Learning techniques to build an AI bot for the game Flappy Bird
Authors: Sneha Bhakare, Videsh Suman
Note: This work was submitted in a group of two, to fulfil the requirements of the course Introduction to Machine Learning (EE 769) at IIT Bombay.
Overview:
Flappy Bird is a very popular and addictive game where the principle challenge is to consistently make the bird dodge through the gaps.
You can try your skills here: http://flappybird.io/
Learning to play games has been one of the popular topics researched in AI today. Solving such problems using game theory, search algorithms, etc. require careful domain specific feature definitions, making them averse to scalability. The goal here is to develop a more general framework to learn game specific features and solve the problem.
“Logic will get you from A to B. Imagination will take you everywhere.” ~ Albert Einstein
We chose this game because of its difficulty for an average human to play.
The goal for this project is to explore the simple Q-learning technique and Deep Q-Network (DQN) to build a learned agent for the game that can play on it’s own for as long time as possible.
Let’s understand some theoretical concepts before the implementation.
Reinforcement Learning:
An aspect of Artificial Intelligence where an agent learns how to behave in an environment by performing actions and learning from the corresponding rewards
Reinforcement learning differs from standard supervised learning, in the sense that correct input/output pairs need not be presented, and sub-optimal actions need not be explicitly corrected. Instead the focus is on sequential decision making. What that means is, given the current input, you make a decision, and the next input depends on your this decision. In supervised learning, the decisions you make is either in a batch setting, or in an online setting, do not affect what you see in the future.
You might want to read this Wikipedia link about Markov Decision Processes (MDP) before moving ahead.
Q-Learning:
It is a technique that evaluates what action should be taken based on an action-value function which determines the value of being in a certain state and taking a certain action at that state.
As Wikipedia describes it,
Before learning begins, Q is initialised to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time t the agent selects an action a_t, observes a reward r_t, enters a new state s_t+1(that may depend on both the previous state s_t and the selected action), and Q is updated. The core of the algorithm is a simple value iteration update, using the weighted average of the old value and the new information.
In our context, each state can be defined using 3 independent parameters:
- horizontal distance of the bird from the pipe
- vertical distance of the bird from the pipe
- vertical speed of the bird
Only two possible legal actions to Flap or Not.
Intuitive Rewards:
- Crashing — Negative (harsh)
- Surviving — Minimally positive
- Crossing a pipe — Positive
Goal: To maximise the long term reward
Deep Q-Network (DQN):
This learning algorithm was first developed by Google DeepMind to play Atari 2600 video games (2015). They demonstrated how an AI based model learned to play those games by observing just the screen pixels and receiving a reward when the game score incremented. The results were quite remarkable and demonstrated that such an algorithm is generic enough to play various games.
Here, the Q function can be represented with a neural network (since it is exceptionally good for learning features for highly structured data) that takes the state (last four game screens stacked) and action as input, returning the corresponding Q-value. This approach has the advantage where if we want to perform Q-value update or pick an action with highest Q-value we just need to do one forward pass through the network and have all Q-values for all the actions immediately.
Experience Replay: This can be coined the most important trick in the context of DQNs to get the network to convergence with a more stable learning. Due to the stochastic nature of game play, approximating Q-values using non linear functions (neural network) is not very stable. Experience Replay can be understood as some sort of a fixed memory with a particular size (hyperparameter), from which random mini-batches are sampled that act as input for the neural network. This helps in breaking the similarity of continuous training samples, ensuring some sort of diversity which helps to drive the network to local minimum. Besides, this replay memory also makes training process similar to supervised learning and simplifies debugging and testing of the algorithm in some sense.
Exploration vs. Exploitation:
This is a productivity strategy used in RL, to decide the iterations spent on exploiting it’s currently learned model versus exploring other new possibilities so as to ensure the convergence to a global optimum rather than the local one.
The agent tends to perform exploration in which it tries various actions and observe rewards for these actions. As a Q-function converges, it returns more consistent Q-values and the amount of exploration decreases. Q-learning incorporates the exploration as part of the algorithm. However, this exploration is greedy, it settles with the first effective strategy it finds. A simple and effective fix to this problem is to introduce a hyperparameter ε which determines the probability to choose between exploration or exploitation (the ε-greedy policy).
Implementation:
Let’s now dive into the implementation of these models.
Q-Learning :
Discretized the State Space for horizontal and vertical distances from the pipe to 5x5 grids (faster learning).
After every timestep, store the transition into moves (list of all the transitions of current game)
# store the transition from previous state to current state
state = [xdist,ydist,vely]
self.moves.append([self.previous_state,self.previous_action,state,0])
self.previous_state = state
Rewarded States:
- Crash
- Penultimate state
- Last Flap
- Alive
- Crossing
This corresponds to the main training function to update the Q-values. Take a look at the comments.
# reverse the list of moves (last to first)
history = list(reversed(self.moves))
# flag corresponding to the last state
first = True
# flag corresponding to the penultimate state
second = True
#flag corresponding to the last jump before crash
jump = True
if history[0][1] < 69:
jump = False
for move in history:
[x,y,v] = move[0]
action = move[1]
[x1,y1,z1] = move[2]
reward = move[3]
# penalize last 2 states before crash
if first or second:
reward = -1000000
if first:
first = False
else:
second = False
# penalize last jump before crash
if jump and action:
reward = -1000000
jump = False
# update the Q-value
self.qvalues[x,y,v,action] = (1- self.learning_rate) * (self.qvalues[x,y,v,action]) + (self.learning_rate) * ( reward + (self.discount_factor)*max(self.qvalues[x1,y1,z1,0],self.qvalues[x1,y1,z1,1]))
self.moves = []
We also implemented ε-greedy policy in the Q-Learning model as well, a related snippet of which has been attached in next section of DQN.
Deep Q Network:
Game Setup (changed for faster convergence):
- Removed (blackened) the dynamic colourful background
- Removed the score readout, keeping each frame simple and clear
- Removed sound the effects
- Removed the base of the pipes (same for every frame, thus redundant)
- Fixed the bird colour (as opposed to randomly assigned colour previously)
Image Preprocessing:
def process(input):
# convert the input from rgb to grey
image = skimage.color.rgb2gray(input)
# resize image to 80x80 from 288x404
image = skimage.transform.resize(image,(80,80), mode='constant')
# return image after stretching/shrinking its intensity levels
image = skimage.exposure.rescale_intensity(image,out_range=(0,255))
# scale down pixels values to (0,1)
image = image / 255.0
return image
State: Stack of 4 preprocessed images
# preprocess the image and stack to 80x80x4 pixels
image1 = process(image1)
image1 = image1.reshape(1, image1.shape[0], image1.shape[1], 1) #1x80x80x1
input_image1 = np.append(image1, input_image[:, :, :, :3], axis=3)
Network Architecture:
We have used convolutional neural network for a higher level of feature learning during training.
Take a look at this Wikipedia link in case you’re not familiar to this.
Input: State (stack of 4 images (80x80 size))
model = Sequential()
model.add(Conv2D(32, (8, 8), padding='same', strides=(4, 4), input_shape=(80,80,4)))
model.add(Activation('relu'))
model.add(Conv2D(64, (4, 4), padding='same', strides=(2, 2))) model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3), padding='same', strides=(1, 1))) model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(num_actions))
The very first state is initialized by 4 identical images from the same frame due to unavailability of 4 different ones.
The bird takes action according to the ε-greedy policy as shown below.
# get an action (epsilon greedy policy)
if random.random() <= epsilon:
action = random.randrange(num_actions)
else:
q = model.predict(input_image)
action = np.argmax(q)
The ε value is degraded in a linear fashion from 0.1 to 0.001 in 3,000,000 timesteps.
The next(action) returns the next image of the bird, provided the action.
# take selected action and get resultant state
image1, score, reward, alive = game.next(action)
Preprocess this image as mentioned above to get the next state.
Note: Add the first few thousands (3200 in our case) of transitions into the replay buffer without updating the model as this would ensure a minimal amount of transitions in the replay buffer before the training actually starts.
Add the above transition to the replay buffer and update the model as shown below.
# sample a minibatch of size 32 from replay memory minibatch = random.sample(replay, 32)
s, a, r, s1, alive = zip(*minibatch)
s = np.concatenate(s)
s1 = np.concatenate(s1)
targets = model.predict(s)
targets[range(32), a] = r + discount*np.max(model.predict(s1), axis=1)*alive
loss += model.train_on_batch(s, targets)
Results and Inferences:
We tried to plot some graphs, revealing the robustness of our models.
Simple Q-Learning:
Left Plot:
- Reward: +2, 1, -1000
- Training time: approx 33 hrs
- Average Score: 185 (last 100 iterations)
Right Plot:
- Reward: +5, 1, -100000 (harsher)
- Training time: approx 16 hrs
- Average Score: 760 (last 100 iterations)
From the above plots, we can infer that harsher policy gives better results with much lesser training.
Q-Learning (ε-greedy policy):
- Reward: +2, 1, -1000
- Training time: approx 12 hrs
- Average Score: 106 (last 100 iterations)
On comparing with the earlier plot with same reward policy, we can infer that though the convergence is quite faster (~20000 iterations) than that (~32000) of the previous one. Also this model converges to a lower average value than the previous one in direct contradiction to the motivation for using the above stated ε-greedy strategy for the DQN model. We really don’t have an answer to this comparison.
Deep Q-Network:
- Reward: +1, 0, -1
- Training: approx 13 hrs on GPU (for ~0.15M timesteps)
- Average Score (trained model): 26
Some references suggest that training DQN models for ~3M timesteps can give great results. However, due to limited time and computation capacity we couldn’t train this model for long enough to be able to reproduce those results.
GPU Access:
The experiments with DQN were run on a p2.xlarge AWS EC2 Instance for the Mumbai region using the Deep Learning AMI (Ubuntu).
I have written this another blog post on how to get access to AWS with free 150$ credits if you’re an enrolled University Student having a University linked email address and ID card.
Acknowledgement:
This work wouldn’t have been possible without the contribution and immense support of the other groupmate, Sneha Bhakare.
I would also like to thank Prof. Amit Sethi (Course Instructor: EE 769) and the Course TAs, Khushhall Chandra and Deepak Anand, for their invaluable guidance.
Github Repository:
To reproduce the models you can refer to our project repository.
References:
- Raw Environment: https://github.com/sourabhv/FlapPyBird
- Human-level Control through Deep Reinforcement Learning. Nature, 529–33, 2015.
- Blog: Using Keras and Deep Q-Network to Play FlappyBird
- Github Repos: DeepLearningFlappyBird
- Deep Reinforcement Learning: Lecture by Stanford University
- Deep Q-Networks: Lecture by Volodymyr Mnih
- Wikipedia
- The invaluable Internet