How I built an AI to play Dino Run
Artificial Intelligence faces more problems than it can currently solve and one such problem is learning to handle an environment with no data sets.
Update: After some modifications and a GPU backed VM, I was able to improve the scores to 4000. Please refer this article for details
A 2013 publication by DeepMind titled ‘Playing Atari with Deep Reinforcement Learning’ introduced a new deep learning model on similar lines for reinforcement learning, and demonstrated its ability to master difficult control policies for Atari 2600 computer games, using only raw pixels
as input. My project was inspired from few implementations of this paper. I will try to explain the basics of Reinforcement Learning and dive deep into the code snippets for hands on understanding.
Before we begin, as a prerequisite, I’m assuming you have basic knowledge of Deep Supervised Learning and Convolutional Neural Networks which are essential for understanding the project. Feel free to skip to code section if you’re familiar with Reinforcement Learning and Q-Learning .
A child learning to walk
This might be a new word for many but each and every one of us has learned to walk using the concept of Reinforcement Learning(RL) and this is how our brain still works. A reward system is a basis for any RL algorithm. If we go back to the analogy of child’s walk, a positive reward would be a clap from parents or ability to reach a candy and a negative reward would be say no candy. The child then first learns to stand up before starting to walk. In terms of Artificial Intelligence, the main aim for an agent, in our case the Dino , is maximize a certain numeric reward by performing a particular sequence of actions in the environment. The biggest challenge in RL is the absence of supervision (labelled data) to guide the agent. It must explore and learn on its own. The agent starts by randomly performing actions and observing the rewards each action brings and learns to predict the best possible action when faced with a similar state of the environment.
We use Q-learning, a technique of RL, where we try to approximate a special function which drives the action-selection policy for any sequence of environment states
Q-learning is a model-less implementation of Reinforcement Learning where a table of Q values is maintained against each state, action taken and the resulting reward. A sample Q-table should give us the idea how the data is structured. In our case, the states are game screenshots and action jump and do nothing[0,1]
We take advantage of the Deep Neural Networks to solve this problem through regression and choose an action with highest predicted Q-value.To know more about Q-learning, please refer the Reading Section at the end.
A vanilla Reinforcement Learning implementation has few problems for which we introduce additional parameters to learn things better.
Absence of labelled data makes the training using RL very unstable. To create our own data,we let the model play game randomly for few thousand steps and we record each state, action and reward. We train our model on batches randomly chosen from these experience replays.
Exploration vs Exploitation problem arises when our model tends to stick to same actions while learning, in our case the model might learn that jumping gives better reward rather than doing nothing and in turn apply an always jump policy. However, we would like our model to try out random actions while learning which can give better reward. We introduce ɛ, which decides the randomness of actions. We gradually decay its value to reduce the randomness as we progress and then exploit rewarding actions.
Credit Assignment problem can confuse the model to judge which past action was responsible for current reward. Dino cannot jump again while mid-air and might crash into a cactus, however, our model might have predicted a jump. So the negative reward was in fact a result of previously taken wrong jump and not the current action. We introduce Discount Factor γ, which decides how far into the future our model looks while taking an action. Thus, γ solves the credit assignment problem indirectly. In our case the model learned that stray jumps will inhibit it’s ability to jump in the future when we set γ=0.99
Few additional parameters that we will be using later
- Python 3.6
- Chromium driver for Selenium
GAME PLAY FRAMEWORK
Now that we have a basic understanding of what we are doing, let’s implement things with python.
Dino Run or T-Rex run is an endless runner game in Chrome browser which is available to play when you’re offline aka ‘the game you don’t usually like to see.
You can, however, launch the game by pointing your browser to chrome://dino or just by pulling the network plug. An alternate approach is to extract the game from the open source repository of chromium as we intend to modify the game for faster learning.
Selenium, a popular browser automation tool, was used to send actions to the browser and get different game parameters like current score.
Now that we have an interface to send actions to the game, we need a mechanism to capture the game screen. Turns out selenium is capable of capturing screenshots but is very slow. A single frame took around 1sec for capture and processing.
The PIL and OpenCV gave best performance for screen capture and pre-processing of the images respectively, achieving a descent frame-rate of 5 fps.
Now you may think 5fps is too low but trust me it’s enough for playing the game. We are actually going to use 4 frames per time-frame, just enough to teach the model to interpret the speed of the Dino.
Dino Agent Module
The module controls our Dino with the help of game module. It has few additional methods to check the agent state.
Game State Module
This module is used directly by the Network for performing actions and getting new states.
The original game has many features like the Dino, variable game-speed, obstacle types, clouds, stars, ground textures, etc. Learning all of them at once would consume a lot of time and might even introduce unwanted noise during training. I modified the game’s source code to get rid of few visual elements including high score and game-over panel and clouds. I limited the obstacles to a single type of cactus, and kept the speed of the runner constant.
The raw image captured has a resolution of around 1200 x 300 with 3 channels. We intend to use 4 consecutive screenshot as a single input to the model. That makes our single input of dimensions 1200x300x3x4. Now I just have an i7 CPU with no GPU which cannot handle this size and play the game simultaneously. So I used the OpenCV library to resize, crop and process the image. The final processed input was of just 40x20 pixels, single channel and only edges highlighted using Canny edge detection.
Then we can stack 4 images to create a single input. The final input dimensions are 40x20x4. Note that we have cropped the agent out because we don’t need to learn the agent’s features but only the obstacles and the distance from edge.
So we got the input and a way to utilize the output of the model to play the game, now let’s look at the model architecture.
We use a series of three Convolutional layers which are flattened onto a Dense layer of 512 neurons. Don’t be surprised by missing pooling layers. They are really useful in image classification problems like ImageNet where we need the network to be insensitive to the location of object. In our case, however, we care about the position of obstacles.
Our output has a shape equal to the number of possible actions. The model predicts a Q-value ,also know as discounted future reward, for both the actions and we choose the one with highest value. The method below returns a model built using Keras with tensorflow as back-end.
LET THE TRAINING BEGIN
The real magic happens here
This module is the main training loop . The code is self explanatory with comments. Here are few things that are happening in the game
- Start with no action and get initial state(s_t)
- Observe game-play for OBSERVATION number of steps
- Predict and perform an action
- Store experience in Replay Memory
- Choose a batch randomly from Replay Memory and train model on it during training phase
- Restart if game over
We use the model on batches randomly chosen from Replay Memory
We can launch the entire training process by calling the method below.
I trained my model for around 2 million frames for a week. 1st million steps were used for fine tuning the game parameters and fixing bugs. The last million training frames showed improvement in game scores reaching a maximum score of 265 till now. We can observe that loss has stabilized for the last million steps and stays low with minute fluctuations.
CURRENT LIMITATIONS & NOISE
Even though the model starts to perform well as compared to initial steps, few limitations hamper the ability to learn faster and score better. The Dino does not consistently score high due to random frame drops that occur while learning on a CPU only system. Moreover, the small size of image (40x20) coupled with the current model architecture might lead to some loss of features and slower learning.
The current model was trained only on CPU for which some game features were stripped. GPU training would give a consistent frame rate while accommodating more features. This model, however, set a base for future work where this learning can be transferred to other models minimizing the initial learning noises in a new environment.
This project’s source code is available on GitHub under the MIT License. Feel free to use, modify or contribute.
A CPU based implementation repository :
An RL implementation in Keras. Contribute to ravi72munde/Chrome-Dino-Reinforcement-Learning development by creating an…
A GPU based implementation repository :
DinoRunTutorial - Accompanying code for Paperspace tutorial "Build an AI to play Dino Run"
- Introduction to Reinforcement Learning
- Reinforcement learning using CNN
- Demystifying Deep Reinforcement Learning
- Q-learning Introduction
- Q-learning and exploration
The sole motivation of this article is to learn Artificial Intelligence and its application in real world. Using this as a proof of concept, we intend to explore and tackle more problems with similar approach.
I aim to make this a living document, so any updates and suggested changes can always be included. Feedback is welcomed. If you enjoyed reading this article, please share your views in the comment section below to show your support.
Disclaimer :All the code in this as well as any other document discussing this work is released under the MIT License https://opensource.org/licenses/MIT The content in this as well as any other document linked with this project is released under The Creative Commons Attribution 3.0 License https://creativecommons.org/licenses/by/3.0/us/
Message from Vimarsh: I got a lot of questions about building an AI portfolio. This article is suitable start for it. Ravi brings his interesting perspective to help our readers. Ravi is an Information Systems Graduate at Northeastern University, Boston. He is a budding Data Scientist with keen interest in Machine Learning and Artificial Intelligence. His perspective is paramount to helping folks build an AI portfolio.
Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!
Subscribe to the Acing AI/Data Science Newsletter. It is FREE! Reducing the entropy in data science. Helping you with…
Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.