My Journey Into Deep Q-Learning with Keras and Gym

11 min readJun 29, 2017

This post will show you how to implement Deep Reinforcement Learning (Deep Q-Learning) applied to play an old Game: CartPole.

I’ve used two tools to facilitate my task:

OpenAI Gym: which provides a simple interface for interacting with the environment of a lot of old video games (they’ve got a good collection of Atari games).
Keras: which is capable of running on top of the Deep Learning library.

In the end, we will create an “AI” that learns by itself in less than 100 lines of code.

I will explain this without requiring the reader have any prerequisite background about Deep Reinforcement Learning.

The code used for this article is on GitHub.

What is Reinforcement Learning?

Reinforcement Learning is a type of machine learning. It allows you to create an AI agent which will learn from the environment (input / output) by interacting with it. It will learn by trial and error. After a lot of tries, it will have enough experience to succeed in the environment.

This type of machine learning is very close to our way of learning things as Humans. For example, it is like when we learn to walk: we have tried several times to put one foot in front of the other, but it is only after a lot of failures and observations of our environment that we succeed to walk.

In the picture, the Agent represents our AI Agent which will act on the environment. After each action, the Agent will receive a reward (positive or negative) and the new state of the environment in order to choose which action it will perform.

What is Deep Reinforcement Learning?

Google’s DeepMind published its famous paper Playing Atari with Deep Reinforcement Learning.

At the end of 2013, Google introduced a new algorithm called Deep Q Network (DQN). It demonstrated how an AI agent can learn to play games by just observing the screen. The AI agent can do so without receiving any prior information about those games.

It was pretty impressive and this paper opened a new era of what is called ‘Deep Reinforcement Learning’, which is a mix of Deep Learning and Reinforcement Learning.

Click to Watch: DeepMind’s Atari Player

In the Deep Q Network algorithm, a neural network is used to perform the best action based on the environment (usually called “State”).

We have a function called Q Function, which is used to estimate a potential reward based on a State. We call it Q(State, action), where Q is a function which calculates the expected future value based on the State and action.

An old game from our childhood: CartPole

For this post, I’ve picked a “simple” game because training an agent to play a complex game may take a while (from few hours to a whole day).

CartPole’s goal is to balance a pole connected with one joint on the top of a moving cart.

I am using a tool called OpenAI Gym, which is a game simulator. Therefore, it provides us with usable variables (the State, angle of the pole, position of the cart, …) instead of providing pixel information.

In order to interact with the game, our agent can move the cart by performing a series of actions of 0 or 1 to the cart, pushing it to the left or to the right.

Gym lets us focus on the “brain” of our AI Agent by making all the interactions with the game environment really simple:

# INPUT
# action can be either 0 or 1# OUTPUT
# next_state, reward and info are what we need for training
# done is a boolean telling whether the game ended or notnext_state, reward, done, info = env.step(action)

Using Keras To Implement a Simple Neural Network

Keras.io, a high-level neural networks API, capable of running on top of either TensorFlow, CNTK or Theano

This post is not about Deep Learning or Neural Network. Therefore, we will consider Neural Network as a black box algorithm that approximately maps inputs to outputs.

Neural Network is basically an algorithm that learns on the basis of pairs of examples (input and output data), detects some kind of patterns, and predicts the output based on an unseen input data.

The neural network we are going to use in this post is similar to the diagram above. It will have one input layer that receives 4 pieces of information and 3 hidden layers. And we are going to have 2 nodes in the output layer since there are two buttons (0 and 1) for the game.

Keras is a high-level neural networks API, written in Python and capable of running on top of either TensorFlow, CNTK or Theano. “It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.”

Keras makes it really simple to implement a basic neural network.

0# Initialization

# Neural Network for Deep Q Learning
# Sequential() creates the foundation of the layers.model = Sequential()# ‘Dense’ is the basic form of a neural network layer
# Input Layer of state size(4) and Hidden Layer with 24 nodesmodel.add(Dense(24, input_dim=self.state_size, activation=’relu’))# Hidden layer with 24 nodesmodel.add(Dense(24, activation=’relu’))# Output Layer with # of actions: 2 nodes (left, right)model.add(Dense(self.action_size, activation=’linear’))# Create the model based on the information abovemodel.compile(loss=’mse’, optimizer=Adam(lr=self.learning_rate))

1# Training of our neural network

In order to understand and predict our neural network, we have to feed it with inputs.

In order to do so, Keras provides the method fit(), which feeds input and output pairs to the model. Then the model will train on the basis of those data to estimate the output based on the input.

This training process makes the neural network able to predict the reward value from a state.

model.fit(state, reward_value, epochs=1, verbose=0)

2# Prediction

After training, the model can now predict the output from unseen input. When you call thepredict() function on the model, the model will predict the reward of the current State based on the data you trained.

prediction = model.predict(state)

Deep Q Network Implementation

In games, the reward is related to performance. It is often related to a number: the score.

For the CartPole, there is no score. The reward is based on how long the player survives. Keep the Cart up AND inside the screen. Survival is not as “exact” as a number, so the intuition plays an important role: imagine a situation where the pole is pushed to the right. The player has two choices, push the right button or the left one. In order to survive longer, he should push the right button. In DQN, the direct translation of this is that the reward of pushing the right button will be higher than pushing the left button.

In the DQN algorithm, there are also two very important parts: the remember and replay methods. Both are pretty simple concepts and can be better explained as how we live a situation as humans: you remember what you did after performing each action and when you have enough elements, you are trying to recreate the situation in your mind. And it always finishes with “I should have done this that way”.

0# Global Parameters

learning_rate - This indicates how much neural network learns from the loss between the target and the prediction in each iteration.
gamma - This is used to calculate the future discounted reward.
exploration_rate - At the beginning, the lack of experience of our agent makes us choose randomly an action and when the agent gets more experienced, we let it decide which action to undertake.
exploration_decay - We want to decrease the number of explorations as it gets better and better at playing games.
episodes - This indicates how many games we want the agent to play in order to train itself.

1# How do we logically represent this intuition to survive longer?

*Mathematical representation of Q-learning*

The loss is a value that indicates how far our prediction is from the actual target. For example, the prediction of the model could indicate that it sees more value in pushing the left button when in fact it can gain more reward by pushing the right button.

Our goal is to decrease the loss, which is the gap between the prediction and the target.

We first pick randomly an action, and observe the reward. It will also result in a new State.

Keras takes care of the most difficult tasks for us. In this formula, we only have to calculate the target.

# import numpy as np
# amax return the maximum of an array or maximum along an axistarget = reward + gamma * np.amax(model.predict(next_state))

In the function fit(), Keras subtracts the target from the neural network output and squares it. Then it also applies the learning rate we defined when we initialized the neural network.

This function is decreasing the difference between our prediction and the target by the learning rate. And as we repeat the updating process, the approximation of the Q-value converges to the true Q-value: the loss will decrease and the score will grow higher.

2# Remember

One of the most important steps in the learning process is to remember what we did in the past and how the reward was bound to that action. Therefore, we need a list of previous experiences and observations to re-train the model with those previous experiences.

We will store our experiences into an array called memory and we will create aremember() function to append state, action, reward, and next state to the array memory.

memory.append((state, action, reward, next_state, done))

And the remember() function will simply store states, actions and resulting rewards into the memory:

def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

3# Replay

Now that we have our past experiences in an array, we can train our neural network. Let’s create a function replay(). We cannot afford to go through all our memory, it will take too many ressources. Therefore, we will only take a few samples (sample_batch_size and here set as 32) and we will just pick them randomly.

sample_batch = random.sample(self.memory, sample_batch_size)

To make the agent perform well in mid-term and long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get.

In order to implement that, we will usegamma. In such a way, our DQN agent will learn to maximize the discounted future reward on the given State.

def replay(self, batch_size):
        sample_batch = random.sample(self.memory, sample_batch_size)
        for state, action, reward, next_state, done in sample_batch:
            target = reward
            if not done:
              target = reward + self.gamma * np.amax(self.brain.predict(next_state)[0])
            target_f = self.brain.predict(state)
            target_f[0][action] = target
            self.brain.fit(state, target_f, epochs=1, verbose=0)
        if self.exploration_rate > self.exploration_min:
            self.exploration_rate *= self.exploration_decay

4# Act

Our agent will randomly select its action at first by a certain percentage, called ‘exploration rate’ (or ‘epsilon’). At the beginning, it is better for the DQN agent to try different things before it starts to search for a pattern.

When our DQN agent has enough experience, the agent will predict the reward value based on the current State. It will pick the action that will give the highest reward.

np.argmax() is the function that returns the index of the highest value between two elements in the act_values[0]. For example, it may look like this: [0.21, 0.42], each number representing the reward of picking action 0 and 1. In this situation, it will return 1.

def act(self, state):
        if np.random.rand() <= self.exploration_rate:
            return random.randrange(self.action_size)
        act_values = self.brain.predict(state)
        return np.argmax(act_values[0])

Let’s code!

0# DQL Agent

class Agent():
    def __init__(self, state_size, action_size):
        self.weight_backup      = "cartpole_weight.h5"
        self.state_size         = state_size
        self.action_size        = action_size
        self.memory             = deque(maxlen=2000)
        self.learning_rate      = 0.001
        self.gamma              = 0.95
        self.exploration_rate   = 1.0
        self.exploration_min    = 0.01
        self.exploration_decay  = 0.995
        self.brain              = self._build_model()def _build_model(self):
        # Neural Net for Deep-Q learning Model
        model = Sequential()
        model.add(Dense(24, input_dim=self.state_size, activation='relu'))
        model.add(Dense(24, activation='relu'))
        model.add(Dense(self.action_size, activation='linear'))
        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))if os.path.isfile(self.weight_backup):
            model.load_weights(self.weight_backup)
            self.exploration_rate = self.exploration_min
        return modeldef save_model(self):
            self.brain.save(self.weight_backup)def act(self, state):
        if np.random.rand() <= self.exploration_rate:
            return random.randrange(self.action_size)
        act_values = self.brain.predict(state)
        return np.argmax(act_values[0])def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))def replay(self, sample_batch_size):
        if len(self.memory) < sample_batch_size:
            return
        sample_batch = random.sample(self.memory, sample_batch_size)
        for state, action, reward, next_state, done in sample_batch:
            target = reward
            if not done:
              target = reward + self.gamma * np.amax(self.brain.predict(next_state)[0])
            target_f = self.brain.predict(state)
            target_f[0][action] = target
            self.brain.fit(state, target_f, epochs=1, verbose=0)
        if self.exploration_rate > self.exploration_min:
            self.exploration_rate *= self.exploration_decay

1# main() function

class CartPole:
    def __init__(self):
        self.sample_batch_size = 32
        self.episodes          = 10000
        self.env               = gym.make('CartPole-v1')self.state_size        = self.env.observation_space.shape[0]
        self.action_size       = self.env.action_space.n
        self.agent             = Agent(self.state_size, self.action_size)def run(self):
        try:
            for index_episode in range(self.episodes):
                state = self.env.reset()
                state = np.reshape(state, [1, self.state_size])
                done = False
                index = 0
                while not done:
#                    self.env.render()
                     action = self.agent.act(state)
                     next_state, reward, done, _ = self.env.step(action)
                     next_state = np.reshape(next_state, [1, self.state_size])
                     self.agent.remember(state, action, reward, next_state, done)
                     state = next_state
                     index += 1
                print("Episode {}# Score: {}".format(index_episode, index + 1))
                self.agent.replay(self.sample_batch_size)
        finally:
            self.agent.save_model()if __name__ == "__main__":
    cartpole = CartPole()
    cartpole.run()

2# Training time!

We can start our script. Gym OpenAI limits the maximum score at 501.

And remember that at the beginning, our DQL Agent will explore by acting randomly. You will be able to see its progression through the displayed score.