Using Reinforcement Learning techniques to build an AI bot for the game Flappy Bird

Authors: Sneha Bhakare, Videsh Suman

9 min readMay 6, 2018

Note: This work was submitted in a group of two, to fulfil the requirements of the course Introduction to Machine Learning (EE 769) at IIT Bombay.

Overview:

Flappy Bird is a very popular and addictive game where the principle challenge is to consistently make the bird dodge through the gaps.
You can try your skills here: http://flappybird.io/

Learning to play games has been one of the popular topics researched in AI today. Solving such problems using game theory, search algorithms, etc. require careful domain specific feature definitions, making them averse to scalability. The goal here is to develop a more general framework to learn game specific features and solve the problem.

“Logic will get you from A to B. Imagination will take you everywhere.” ~ Albert Einstein

We chose this game because of its difficulty for an average human to play.

The goal for this project is to explore the simple Q-learning technique and Deep Q-Network (DQN) to build a learned agent for the game that can play on it’s own for as long time as possible.

Let’s understand some theoretical concepts before the implementation.

Reinforcement Learning:

An aspect of Artificial Intelligence where an agent learns how to behave in an environment by performing actions and learning from the corresponding rewards

Reinforcement learning differs from standard supervised learning, in the sense that correct input/output pairs need not be presented, and sub-optimal actions need not be explicitly corrected. Instead the focus is on sequential decision making. What that means is, given the current input, you make a decision, and the next input depends on your this decision. In supervised learning, the decisions you make is either in a batch setting, or in an online setting, do not affect what you see in the future.

You might want to read this Wikipedia link about Markov Decision Processes (MDP) before moving ahead.

Q-Learning:

It is a technique that evaluates what action should be taken based on an action-value function which determines the value of being in a certain state and taking a certain action at that state.

As Wikipedia describes it,

Before learning begins, Q is initialised to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time t the agent selects an action a_t, observes a reward r_t, enters a new state s_t+1(that may depend on both the previous state s_t and the selected action), and Q is updated. The core of the algorithm is a simple value iteration update, using the weighted average of the old value and the new information.

In our context, each state can be defined using 3 independent parameters:

horizontal distance of the bird from the pipe
vertical distance of the bird from the pipe
vertical speed of the bird

Only two possible legal actions to Flap or Not.

Intuitive Rewards:

Crashing — Negative (harsh)
Surviving — Minimally positive
Crossing a pipe — Positive

Goal: To maximise the long term reward

Deep Q-Network (DQN):

This learning algorithm was first developed by Google DeepMind to play Atari 2600 video games (2015). They demonstrated how an AI based model learned to play those games by observing just the screen pixels and receiving a reward when the game score incremented. The results were quite remarkable and demonstrated that such an algorithm is generic enough to play various games.

Here, the Q function can be represented with a neural network (since it is exceptionally good for learning features for highly structured data) that takes the state (last four game screens stacked) and action as input, returning the corresponding Q-value. This approach has the advantage where if we want to perform Q-value update or pick an action with highest Q-value we just need to do one forward pass through the network and have all Q-values for all the actions immediately.

Experience Replay: This can be coined the most important trick in the context of DQNs to get the network to convergence with a more stable learning. Due to the stochastic nature of game play, approximating Q-values using non linear functions (neural network) is not very stable. Experience Replay can be understood as some sort of a fixed memory with a particular size (hyperparameter), from which random mini-batches are sampled that act as input for the neural network. This helps in breaking the similarity of continuous training samples, ensuring some sort of diversity which helps to drive the network to local minimum. Besides, this replay memory also makes training process similar to supervised learning and simplifies debugging and testing of the algorithm in some sense.

Exploration vs. Exploitation:

This is a productivity strategy used in RL, to decide the iterations spent on exploiting it’s currently learned model versus exploring other new possibilities so as to ensure the convergence to a global optimum rather than the local one.

The agent tends to perform exploration in which it tries various actions and observe rewards for these actions. As a Q-function converges, it returns more consistent Q-values and the amount of exploration decreases. Q-learning incorporates the exploration as part of the algorithm. However, this exploration is greedy, it settles with the first effective strategy it finds. A simple and effective fix to this problem is to introduce a hyperparameter ε which determines the probability to choose between exploration or exploitation (the ε-greedy policy).

Implementation:

Let’s now dive into the implementation of these models.

Q-Learning :

Discretized the State Space for horizontal and vertical distances from the pipe to 5x5 grids (faster learning).

After every timestep, store the transition into moves (list of all the transitions of current game)

# store the transition from previous state to current state  
state = [xdist,ydist,vely]
self.moves.append([self.previous_state,self.previous_action,state,0])   
self.previous_state = state

Rewarded States:

Crash
Penultimate state
Last Flap
Alive
Crossing

This corresponds to the main training function to update the Q-values. Take a look at the comments.

# reverse the list of moves (last to first)
history = list(reversed(self.moves))     
# flag corresponding to the last state
first = True   
# flag corresponding to the penultimate state
second = True
#flag corresponding to the last jump before crash   
jump = True   
if history[0][1] < 69:    
    jump = False   
    for move in history:    
        [x,y,v] = move[0]    
        action = move[1]    
        [x1,y1,z1] = move[2]     
        reward = move[3]    
        # penalize last 2 states before crash    
        if first or second:     
            reward = -1000000     
            if first:      
                first = False     
            else:      
                second = False    
        # penalize last jump before crash    
        if jump and action:     
            reward = -1000000     
            jump = False
        # update the Q-value 
        self.qvalues[x,y,v,action] = (1- self.learning_rate) * (self.qvalues[x,y,v,action]) + (self.learning_rate) * ( reward + (self.discount_factor)*max(self.qvalues[x1,y1,z1,0],self.qvalues[x1,y1,z1,1]))   
        self.moves = []

We also implemented ε-greedy policy in the Q-Learning model as well, a related snippet of which has been attached in next section of DQN.

Deep Q Network:

Game Setup (changed for faster convergence):

Removed (blackened) the dynamic colourful background
Removed the score readout, keeping each frame simple and clear
Removed sound the effects
Removed the base of the pipes (same for every frame, thus redundant)
Fixed the bird colour (as opposed to randomly assigned colour previously)

Image Preprocessing:

def process(input): 
    # convert the input from rgb to grey 
    image = skimage.color.rgb2gray(input) 
    # resize image to 80x80 from 288x404 
    image = skimage.transform.resize(image,(80,80), mode='constant') 
    # return image after stretching/shrinking its intensity levels 
    image = skimage.exposure.rescale_intensity(image,out_range=(0,255)) 
    # scale down pixels values to (0,1) 
    image = image / 255.0 
    return image

State: Stack of 4 preprocessed images

# preprocess the image and stack to 80x80x4 pixels 
image1 = process(image1) 
image1 = image1.reshape(1, image1.shape[0], image1.shape[1], 1) #1x80x80x1 
input_image1 = np.append(image1, input_image[:, :, :, :3], axis=3)

Network Architecture:

We have used convolutional neural network for a higher level of feature learning during training.

Take a look at this Wikipedia link in case you’re not familiar to this.

Input: State (stack of 4 images (80x80 size))

Our CNN (inspired by the DeepMind Paper’s architecture)

model = Sequential() 
model.add(Conv2D(32, (8, 8), padding='same', strides=(4, 4), input_shape=(80,80,4))) 
model.add(Activation('relu')) 
model.add(Conv2D(64, (4, 4), padding='same', strides=(2, 2))) model.add(Activation('relu')) 
model.add(Conv2D(64, (3, 3), padding='same', strides=(1, 1))) model.add(Activation('relu')) 
model.add(Flatten()) 
model.add(Dense(512)) 
model.add(Activation('relu')) 
model.add(Dense(num_actions))

The very first state is initialized by 4 identical images from the same frame due to unavailability of 4 different ones.

The bird takes action according to the ε-greedy policy as shown below.

# get an action (epsilon greedy policy)       
if random.random() <= epsilon:            
    action = random.randrange(num_actions)        
else:            
    q = model.predict(input_image)                   
    action = np.argmax(q)

The ε value is degraded in a linear fashion from 0.1 to 0.001 in 3,000,000 timesteps.

The next(action) returns the next image of the bird, provided the action.

# take selected action and get resultant state        
image1, score, reward, alive = game.next(action)

Preprocess this image as mentioned above to get the next state.

Note: Add the first few thousands (3200 in our case) of transitions into the replay buffer without updating the model as this would ensure a minimal amount of transitions in the replay buffer before the training actually starts.

Add the above transition to the replay buffer and update the model as shown below.

# sample a minibatch of size 32 from replay memory                minibatch = random.sample(replay, 32)             
s, a, r, s1, alive = zip(*minibatch)             
s = np.concatenate(s)             
s1 = np.concatenate(s1)             
targets = model.predict(s)              
targets[range(32), a] = r + discount*np.max(model.predict(s1), axis=1)*alive             
loss += model.train_on_batch(s, targets)

Results and Inferences:

We tried to plot some graphs, revealing the robustness of our models.

Simple Q-Learning:

Plots of Iterations vs Scores for different Reward policies

Left Plot:

Reward: +2, 1, -1000
Training time: approx 33 hrs
Average Score: 185 (last 100 iterations)

Right Plot:

Reward: +5, 1, -100000 (harsher)
Training time: approx 16 hrs
Average Score: 760 (last 100 iterations)

From the above plots, we can infer that harsher policy gives better results with much lesser training.

Q-Learning (ε-greedy policy):

Reward: +2, 1, -1000
Training time: approx 12 hrs
Average Score: 106 (last 100 iterations)

On comparing with the earlier plot with same reward policy, we can infer that though the convergence is quite faster (~20000 iterations) than that (~32000) of the previous one. Also this model converges to a lower average value than the previous one in direct contradiction to the motivation for using the above stated ε-greedy strategy for the DQN model. We really don’t have an answer to this comparison.

Deep Q-Network:

Plot for ~160 Runs of the trained DQN model

Reward: +1, 0, -1
Training: approx 13 hrs on GPU (for ~0.15M timesteps)
Average Score (trained model): 26

Some references suggest that training DQN models for ~3M timesteps can give great results. However, due to limited time and computation capacity we couldn’t train this model for long enough to be able to reproduce those results.

GPU Access:

The experiments with DQN were run on a p2.xlarge AWS EC2 Instance for the Mumbai region using the Deep Learning AMI (Ubuntu).

Configuration of the EC2 instance used here.

I have written this another blog post on how to get access to AWS with free 150$ credits if you’re an enrolled University Student having a University linked email address and ID card.

Acknowledgement:

This work wouldn’t have been possible without the contribution and immense support of the other groupmate, Sneha Bhakare.

I would also like to thank Prof. Amit Sethi (Course Instructor: EE 769) and the Course TAs, Khushhall Chandra and Deepak Anand, for their invaluable guidance.

Github Repository:

To reproduce the models you can refer to our project repository.

sumanvid97/FlappyBird-AI

Contribute to FlappyBird-AI development by creating an account on GitHub.

github.com

References:

Raw Environment: https://github.com/sourabhv/FlapPyBird
Human-level Control through Deep Reinforcement Learning. Nature, 529–33, 2015.
Blog: Using Keras and Deep Q-Network to Play FlappyBird
Github Repos: DeepLearningFlappyBird
Deep Reinforcement Learning: Lecture by Stanford University
Deep Q-Networks: Lecture by Volodymyr Mnih
Wikipedia
The invaluable Internet