Reinforcement Learning w/ Keras+OpenAI: The Basics

6 min readJul 26, 2017

Reinforcement learning has been heralded by many as one of the gateway technologies/concepts to have emerged from the theoretical studies of machine learning. We’ll go through a very quick overview of reinforcement learning before we diving into the code.

Quick Background

Reinforcement learning (RL) is a general umbrella term for any algorithm that does not require explicit pairs of data and their corresponding desired labels as is the case in traditional supervised learning but requires some numeric indication of “how a sample is.” This quality of the “goodness” of a sample has no meaning in an absolute sense. You can imagine this generically as your score in a video game. If the screen displays a score of “218,” that presumably carries absolutely no meaning to you, the gamer, unless you are aware of how difficult or easy it is to earn a point and what score you start with. And that’s basically the extent of background we’ll be delving into: there will be more in-depth discussions of RL in the future, but the code we go through in this post is a very basic example of RL and so does not entail any further meddling in the subject theory.

Keras Notes

For anyone just getting started in AI/ML programming, welcome! The field has grown so much in the past years that it is quite overwhelming to jump in just now. But there’s still plenty of time to get involved and learn in this massive field! In line with that, Keras is the library I’ll primarily be using for my tutorials to come, including this one. Keras is essentially a wrapper library for Tensorflow and Theano. Its interface is quite similar to that exposed by tflearn but is slightly more generic in its applicability to Theano as the backend. Take note though! The dimensions in Theano are slightly different from those in Tensorflow. I would recommend, therefore, to adjust your Keras to use TF as its backend to avoid any frustrations with dimensions going forward (should be the default when you install, if you have TF installed already).

You could also just as easily do this with TF, but Keras gives us the nice flexibility when getting started of not having to keep track of dimensions through convolutions and all that crap. Anyway, enough words: time to move to the code!

Code

We’ll be exploring the most basic OpenAI environment here: the CartPole! As a final quick note, you can find the instructions to install OpenAI’s gym package here: https://gym.openai.com/docs. Just running “sudo pip install gym” should work on most platforms.

The CartPole environment has a very simple premise: balance the pole on the cart.

Data Collection

The first part of any machine learning problem is gathering the data, and this one is no different. Luckily, OpenAI’s gym environment provides a very straightforward way of gathering data: we can essentially just run through the simulation many times and take random steps every time. OpenAI Gym environments are structured around two main parts: an observation space and an action space. We observe the former from the environment and use that to determine how best to update it through an action, i.e. based on the current state of the pole (observation), determine whether to move the cart left or right (action).

As a result, we need to take an action that fits in the scope of the allowable actions of the action space, which is of size 2 in this case (left or right). We take the output space to be one-hot encoded, the reason being we want the neural net to eventually predict the probability of moving left vs. right given the current state of the environment. In this case, we could get away with just having the output be a single 1x1 float matrix (i.e. a scalar) and round it for our final result, but the one-hot encoding practice can be more widely applied.

So, to accumulate the actions and corresponding observations, a first thought may simply be:

for _ in range(10000):
    observation = env.reset()
    training_sampleX, training_sampleY = [], []
    for step in range(sim_steps):
        action = np.random.randint(0, 2)
        one_hot_action = np.zeros(2)
        one_hot_action[action] = 1
        training_sampleX.append(observation)
        training_sampleY.append(one_hot_action)
        
        observation, reward, done, _ = env.step(action)
        if done:
            break
    trainingX += training_sampleX
    trainingY += training_sampleY

However, if we were to train on this, the final predictor would likely do no better than random chance. After all, “garbage in, garbage out”: we would be doing no more than feeding the neural net a collection of both good and bad samples and expecting it to solely learn from the good. If we take a step back, however, this is completely implausible, since a single sample is indistinguishable from any other, even comparing those that come from good trials and those from poor trials.

So, instead, we’ll be only looking at the samples that result in trials with high scores. That is, we want to filter the samples to only allow those that eventually result in high scores in their trials. In this case, we arbitrarily chose 50 to be the “minimum cutoff” to be considered a “good trial,” and only select those samples:

def gather_data(env):
    min_score = 50
    sim_steps = 500
    trainingX, trainingY = [], []    scores = []
    for _ in range(10000):
        observation = env.reset()
        score = 0
        training_sampleX, training_sampleY = [], []
        for step in range(sim_steps):
            action = np.random.randint(0, 2)
            one_hot_action = np.zeros(2)
            one_hot_action[action] = 1
            training_sampleX.append(observation)
            training_sampleY.append(one_hot_action)
            
            observation, reward, done, _ = env.step(action)
            score += reward
            if done:
                break
        if score > min_score:
            scores.append(score)
            trainingX += training_sampleX
            trainingY += training_sampleY    trainingX, trainingY = np.array(trainingX), np.array(trainingY)
    print("Average: {}".format(np.mean(scores)))
    print("Median: {}".format(np.median(scores)))
    return trainingX, trainingY

Model Definition

Now that we have the data, we need to go about defining the model. Before doing any machine learning problem, it is always worthwhile stepping back to consider what it is we’re modelling, specifically what the expected inputs and desired results are. In our case, we’ll be receiving the current state of the environment (i.e. the “observations” from before) and wish to predict the probabilities of moving in each of the two directions. From this, we can easily figure out which of the two to take by taking the max arg.

The model we use here is a very simple one: several fully-connected layers (a.k.a. Dense layers in Keras). These are often the final layers used in deep CNNs (Convolution Neural Networks), since they are the ones that combine all the feature maps or input layers into the final scalar values. Fully-connected layers essentially make up the backbone of neural networks and are what allow them to effectively map high dimensional functions, ignoring all the modern enhancements with convolutions, LSTMs, Dropout, etc….

The only one of these enhancements that is relevant here is Dropout, since it helps ensure we do not overfit on the training data. So, we essentially sandwich a Dropout layer between each fully connected mapping to make sure that no one layer of the mapping becomes reliant on any small subset of connections that are specifically apparent in the training data.

Finally, we need to determine the loss function that we’ll train against. Since we encoded the output space as a one-hot 2D vector, the natural choice becomes categorical cross entropy, given that we wish to identify the result as either left ([1,0]) or right ([0,1]). I won’t go in-depth about what cross entropy entails, but it’s a very worthwhile function to understand, given its prevalence in these sorts of problems. From a high level, the cross entropy is, given two distributions (a true underlying distribution and our model thereof), a measure of how much information we need to convey something drawn from the true distribution by the model distribution.

Therefore, we define the model as:

from keras.models import Sequential
from keras.layers import Dense, Dropoutdef create_model():
    model = Sequential()
    model.add(Dense(128, input_shape=(4,), activation="relu"))
    model.add(Dropout(0.6))
    
    model.add(Dense(256, activation="relu"))
    model.add(Dropout(0.6))
    
    model.add(Dense(512, activation="relu"))
    model.add(Dropout(0.6))
    
    model.add(Dense(256, activation="relu"))
    model.add(Dropout(0.6))
    
    model.add(Dense(128, activation="relu"))
    model.add(Dropout(0.6))
    model.add(Dense(2, activation="softmax"))
    
    model.compile(
        loss="categorical_crossentropy",
        optimizer="adam",
        metrics=["accuracy"])
    return model

A few more subtle technical points of the model: each of the layers of the model have ReLU activations to allow the model to train more rapidly than with saturating activation functions, such as tanh and sigmoid. The model will likely also train in these cases, but would take far longer to converge than if using ReLU activation.

Prediction

From there, we can simply get our training data, train the model, and iterate through several trials to see how well our model performs!

import gym
import numpy as npfrom data import gather_data
from model import create_modeldef predict():
    env = gym.make("CartPole-v0")
    trainingX, trainingY = gather_data(env)
    model = create_model()
    model.fit(trainingX, trainingY, epochs=5)
    
    scores = []
    num_trials = 50
    sim_steps = 500    for trial in range(num_trials):
        observation = env.reset()
        score = 0
        for step in range(sim_steps):
            action = np.argmax(model.predict(
                observation.reshape(1,4)))
            observation, reward, done, _ = env.step(action)
            score += reward
            if done:
                break
        scores.append(score)
    
    print(np.mean(scores))

Full Code

With that step-by-step, here is the complete source code of the OpenAI Cartpole Keras implementation!

Keep an eye out for the next Keras+OpenAI tutorial!

Comment and click that ❤️ below to show support!

Reinforcement Learning w/ Keras+OpenAI: The Basics

Comment and click that ❤️ below to show support!

Written by Yash Patel