How to play Google Chrome Dino game using reinforcement learning

Iustina Ivanova
Deelvin Machine Learning
8 min readAug 24, 2021

This article discusses the main principles of reinforcement learning (RL) and demonstrates how to automatize a well-known Google Chrome Dino game using the Stable-Baseline python-based framework.

We approach the RL problem using the Markov decision process[1]. RL is a special type of machine learning method which consists of an agent, states which this agent can ‘visit’ at a certain point in time, and a numerical reward gained through the agent’s actions at a certain state. Each action leads to a certain reward (number gained by the agent), but the agent does not know the reward for the actions beforehand, therefore, it learns it using the search (trial and error process). Certain actions can lead to the next state where the subsequent rewards should be learned. Thus, to achieve the maximum reward for the current state, the method should identify all possible actions, the path which will guarantee this maximum reward, and, once the path is learned, to perform the action which maximizes the reward. The search and the learned reward are the main features of RL.

One of the main RL challenges is achieving a trade-off between exploration and exploitation. To maximize the reward, the agent must pick the action among the ones which it has already tried in the past and found to be effective(exploitation), at the same time it requires the agent to search for all possible actions which haven’t been tried yet (exploration). There exist different methodologies allowing to balance between those two phases (for example, applying a threshold on the number of steps to perform in the searching phase).

One of the most notable examples of RL is a tic-tac-toe game. There are two agents in the game: a person who starts the game, and a person who moves afterward. The reward is the likelihood of winning the game by putting the naught or cross to some field cells. The reward can be learned through the search and all possible moves.

RL is often modeled as Markov decision processes, where the agent, the states, and the reward for each action from the possible states are included. Markov decision process variables are:

S — finite set of states, A — finite set of actions, P — transition function, R — reward function, gamma — discount factor[2]

In this article, we apply reinforcement learning for a Google Chrome Game called TRex. We model the game using the following schema:

Reinforcement learning variables for modeling the agent’s behavior: the agent is a dino, who can perform one action (jump), and Google Chrome screen is the environment, which allows learning rewards through observations.

In this game, the agent (the dinosaur) can perform three actions: jump, bend over or do nothing. The environment is the screen of Google Chrome, and a reward is a number given for the steps during which the game was still playing, where the step is defined as several frames in a set of all frames taken from the browser game window.

We will use Deep Q-Networks (DQN), which researchers have previously applied to play Atari Games [3]. DQN is a special type of convolutional network where the input is four consecutive images of 84x84 pixels. Three hidden layers, the output, are single for each of the actions. The number of valid actions is between 1 and 2.

We chose the Stable-baseline framework for RL training to run the learning process in python (available at https://stable-baselines.readthedocs.io). This library is used to train other similar games, such as Atari games, Lunar Lander, CartPole. First of all, one needs to define the environment for training. Our environment basically consists of a sequence of images generated from the screenshot made from the Google Chrome interface. For this purpose, we use the selenium library for python:

from selenium import webdriver

To run the Chrome browser, selenium requires the executable file to be inside the project folder: you can download it for the operating system which you are using the following link, and then you need to unzip the file in a project home folder, where you would run python code to train the model.

The game either starts automatically when there is no internet connection or can be run via the web address in Google Chrome browser: chrome//dino.

The state, in this case, is four consecutive images cropped in front of the dinosaur (size of the initial images is 480x300 px):

The schema of generating the input for the DQN network. Input is given as four consecutive frames from the game (here, we show frames №511, №512, №513, №514)

Furthermore, to define our own environment, we follow the instructions given on the library readme page. We name the environment as EnvironmentChromeTRex and inherit the class of the environment given by default:

class EnvironmentChromeTRex(gym.Env):

def __init__(self, ...):
# action_space: valid actions
# set of actions: do nothing, jump, down
self.action_space = spaces.Discrete(3)
# observation_space: valid observations
self.observation_space = spaces.Box()
# reward_range: min and max possible rewards
# default (-inf; +inf)

Furthermore, inside __init__ method of this environment, we create additional variables to connect to the browser:

class EnvironmentChromeTRex(gym.Env):

def __init__(self,
screen_width, # width of the compressed image
screen_height, # height of the compressed image
chromedriver_path: str="chromedriver"
):
self.screen_width = screen_width
self.screen_height = screen_height
self.chromedriver_path = chromedriver_path
self.num_observation = 0

self.action_space = spaces.Discrete(3) # set of actions: do nothing, jump, down
self.observation_space = spaces.Box(
low=0,
high=255,
shape=(self.screen_width, self.screen_height, 4),
dtype=np.uint8
)
# connection to chrome
_chrome_options = webdriver.ChromeOptions()
_chrome_options.add_argument("--mute-audio")
_chrome_options.add_argument("disable-infobars")

self._driver = webdriver.Chrome(
executable_path=self.chromedriver_path,
options=_chrome_options
)
self.current_key = None
# current state represented by 4 images
self.state_queue = deque(maxlen=4)

To capture the image from the browser, we use the ‘canvas’ class, where the game is rendered by default:

<canvas class=”runner-canvas” width=”1200" height=”300" style=”width: 600px; height: 150px;”></canvas>

with the following function:

def _get_image(self):
LEADING_TEXT = "data:image/png;base64,"
_img = self._driver.execute_script(
"return document.querySelector('canvas.runner-canvas').toDataURL()"
)
_img = _img[len(LEADING_TEXT):]
return np.array(
Image.open(BytesIO(base64.b64decode(_img)))
)

The function to crop the image captured with the code before is:

def _next_observation(self):
image = cv2.cvtColor(self._get_image(), cv2.COLOR_BGR2GRAY)
image = image[:500, :480] # cropping
image = cv2.resize(image, (self.screen_width, self.screen_height))
self.state_queue.append(image)

if len(self.state_queue) < 4:
# during the start, we copy the images to make the sequence of 4
return np.stack([image] * 4, axis=-1)
else:
return np.stack(self.state_queue, axis=-1)

return image

The reward for each step in this game equals 1 if the game is still running, and -1, otherwise. To check whether the game is still running, we use the Runner object state: Runner.instance_.crashed, where Runner is an object rendered by the game into a browser window.

def _get_done(self):
return self._driver.execute_script("return Runner.instance_.crashed")

The observation and the learning part is defined in the function step, where the environment would send to the browser element ‘body’ one of the actions and calculate the reward according to the action it took:

def step(self, action: int):
self._driver.find_element_by_tag_name("body") \
.send_keys(self.actions_map[action])

obs = self._next_observation()

done = self._get_done()
reward = .1 if not done else -1

time.sleep(.015)

return obs, reward, done, {"score": self._get_score()}

Where actions are defined in the environment as:

self.actions_map = [
Keys.ARROW_RIGHT, # do nothing
Keys.ARROW_UP, # jump
Keys.ARROW_DOWN # down
]
action_chains = ActionChains(self._driver)
self.keydown_actions = [action_chains.key_down(item) for item in self.actions_map]
self.keyup_actions = [action_chains.key_up(item) for item in self.actions_map]

Each time, when the game finishes, we will reset the game by the following method:

def reset(self):
# Resets the environment to an initial state and returns an initial observation.
try:
self._driver.get('chrome://dino')
except WebDriverException:
print("page down")
WebDriverWait(self._driver, 10).until(
EC.presence_of_element_located((
By.CLASS_NAME,
"runner-canvas"
))
)
# trigger game start
self._driver.find_element_by_tag_name("body").send_keys(Keys.SPACE)

return self._next_observation()

To learn the actions of the agent, we will use the following schema:

Schema to train the model via RL. The first step is to define the model, the second step is to learn action for every possible sequence of 4 images by computing the reward, the last step is to predict the test data.

The Model definition step is the following:

from stable_baselines.common.policies import CnnPolicy
from stable_baselines import PPO2
checkpoint_callback = CheckpointCallback(
save_freq=5000,
save_path='./.checkpoints/',
name_prefix=save_path,
)
model = PPO2(
CnnPolicy,
env,
verbose=1,
tensorboard_log="./.tb_chromedino_env/",
)

The training model via observation is:

model.learn(
total_timesteps=2000000,
callback=[checkpoint_callback]
)
model.save(save_path)

Video made during the training:

Video of a training process during several steps (6 min 27 sec).

Prediction on test data:

images = []

obs = env.reset()
img = model.env.render(mode='rgb_array')

for i in tqdm(range(500)):
images.append(img)
action, _states = model.predict(obs, deterministic=True)
obs, rewards, dones, info = env.step(action)

img = env.render(mode='rgb_array')
Model prediction on a test set after 1980000 steps of training (maximum score ~258).

The learned weights for testing can be found here.

The entropy loss for 1 980 000 steps is plotted via tensorboard command:

tensorboard --logdir ./.tb_chromedino_env/
The entropy loss has reached a minimum at step 1 600 000, and still can be trained more to give a lower error.

I attach the script to run the model here if someone wants to run the code in their machine. To train, one should change the variable do_train to be True:

do_train = True

The training was running for 24 hours in a virtual machine with 6 parallel CPU processes.

In conclusion, we would like to add that the chosen method has some limitations. After 1 980 000 steps and 24 hours of training we tested the trained model on a real game and were able to gain a score of 258. Compared to a result that could be achieved by an average human playing this game, this is a relatively low score. For example, I received the 479 scores playing this game by hand for the first time. Therefore, we suggest that the method could possibly be improved by training a larger number of steps or by using other frameworks for RL[5], such as:

The score is achieved by an individual (myself).

This project was conducted by Deelvin. Check out our Deelvin Machine Learning blog for more articles on machine learning.

References:

  1. Richard S. Sutton and Andrew G. Barto. “Reinforcement Learning: An Introduction.”, 2015.
  2. Çağlar Gülçehre. Lectures notes. “Deep Reinforcement Learning in the Real. World: Offline RL.”, Deep Learning Summer School 2021.
  3. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller. “Playing Atari with Deep Reinforcement Learning”, 19 Dec. 2013.
  4. Ngoc Nguyen. “Tutorial: Build AI to play Google Chrome Dino game with Reinforcement Learning in 30 minutes”, published online 15 June 2021: https://luungoc2005.github.io/blog/2020-06-15-chrome-dino-game-reinforcement-learning/.
  5. Mauricio Fadel Argerich, “5 frameworks for reinforcement learning on python”, published online 4 June 2020: https://towardsdatascience.com/5-frameworks-for-reinforcement-learning-on-python-1447fede2f18.

--

--