Practical Reinforcement Learning pt. 4

A Simple Implementation

Published in

Coinmonks

9 min readOct 7, 2018

Introduction

This article continues from the previous article in this series. In this article we will concentrate on building a simple python based implementation of the ideas that have introduced so far.

The Problem

Let’s set the stage. We want to teach our robot (located in the upper left hand corner) to make its way to it’s owner (the little stick figure) without hitting any of the tables. To keep things simple, our robots world is divided into grid squares, if it occupies the same square of as the owner we will consider it successful. If it ever occupies the same square as a table, we will consider it a failure.

Now, if we have full access to the map above, the obvious solution to this problem would be to use a path finding algorithm (as mentioned in previous articles) — however we will assume the agent has no prior knowledge of the world it finds itself in and will restrict ourselves to using the RL techniques so far introduced in articles 1, 2, and 3.

Design

A good place to start is to break down the problem into its respective components. There are four major components that we’ll split our implementation into:

Environment: Represents the world that the Agent will interact with. We’ll model this as a 5 by 5 grid where “blank” spaces are transverse, “T” spaces are tables, and the “G” space will represent the person.
Agent: Represents the Agent itself. This will take in the current state (produced by the environment) and output the next action to be taken.
Model: Represents the Q-Value function for the agent. In our case this will be represented as a simple table of state action pairs holding the expected rewards for each. Other, more complicated representations are possible and will be explored in later articles.
Q — Learning: A container to represent the Q-Learning algorithm itself. This manages the learning process and the interface between the Agent and the Environment.

Ok, we have a high level approach to our we can build a simple RL system, let’s start diving into some python code.

Environment

As the environment is central to any RL problem, we will start by building the code for it. Typically our environment needs to handle a few behaviours:

It needs to track the current state of the simulation ( if we are using a physical robot then we would only need track those states required to interface with the robots sensors)
It needs to provide an interface between the agent and the environment. The agents needs to be able to perform actions and observe it’s current state. ( Insofar as the state can be observed )
We need a way to reset the environment so that we can run another training episode.

For our purposes we will represent the environment as a five by five matrix of tokens as mentioned in the previous section. We provide the agent an interface to the environment via the next method.

The next method takes in an action and outputs a tuple populated with the following information:

state: This will be represented as the current grid co-ordinates of the agent in the world.
action: This is just returning the same action as was provided. This could later be useful if we want to model actions probabilistically. I.e. The robot asked to go up, but really went left. In most use cases (included probabilistic ones, we will ignore this).
reward: The reward received from the environment. For this specific environment the rewards are -1 for hitting a table, -0.1 to move through a blank space, and a nice juicy reward of 100 for reaching the owner.
done: This marks whether an episode has reached a terminal state, for example if the agent runs into a table we want to reset the episode.

Python code to implement this environment is included below. A helper function “draw_self” has also been included in this class in order to facilitate visualization of the learning process.

# Environment
# Observation:
#    state
#    action
#    reward
#    done
class Environment():

    # Encoding:
    # "*": agent position
    # " ": empty square
    # "T": Table
    # "G": Goal
    def __init__(self):
        self.agent_position = (0, 0)
        self.map = [
            [" ", " ", " ", " ", " "],
            [" ", " ", "T", " ", " "],
            [" ", "T", " ", " ", " "],
            [" ", " ", " ", " ", "G"],
            [" ", " ", " ", " ", " "]
        ]

    def draw_env(self):

        x = self.agent_position[1]
        y = self.agent_position[0]

        last_token = self.map[y][x]

        self.map[y][x] = "*"
        print '----------------------'
        for l in self.map:
            print l

        self.map[y][x] = last_token

    # get the token from the current position
    def get_token(self):
        x = self.agent_position[1]
        y = self.agent_position[0]

        return self.map[y][x]

    # reward mapping:
    #  " " ->  0
    #  "T" -> -1
    #  "G" -> +1
    def reward(self):
        token = self.get_token()

        if token == " ":
            return -0.1

        if token == "T":
            return -1

        if token == "G":
            return 100

        return 0

    # clamp a value between 0 and 4
    def clamp_to_map(self, value):
        if value < 0:
            return 0

        if value > 4:
            return 4

        return value

    # action:
    #   UP, DOWN, LEFT, RIGHT
    #   state_position, action, reward, done
    def next(self, action):

        start_position = self.agent_position

        x = self.agent_position[1]
        y = self.agent_position[0]

        # move the agent
        if action == "U":
            y = y - 1

        if action == "D":
            y = y + 1

        if action == "L":
            x = x - 1

        if action == "R":
            x = x + 1

        # clamp it to the environment
        x = self.clamp_to_map(x)
        y = self.clamp_to_map(y)

        self.agent_position = (y, x)

        # determine the reward
        reward = self.reward()

        # is episode complete ?
        token = self.get_token()
        done = (token == "G" or token == "T")

        return (start_position, action, reward, done)

    # sets the agent position back to (0,0)
    def reset(self):
        self.agent_position = (0, 0)

Agent

Alright, it’s time to build our Agent! In the design being followed in this article, our Agent is responsible for taking in the current state of the environment and choosing the next action that will be taken.

Choosing the next action brings in the so called exploration factor. As explained in the previous articles this is a value between 0 and 1 that represents the likelihood that the agent will take a random action. The approach used for this implementation will have this value start at 1 and reduce slightly after each learning iteration. The reader is encouraged to play with this value to see if they can get better results!

In order to choose an action, we will need to model that rewards that the agent expected to get from the environment. As it is helpful to have the flexibility to change out models, the model is taken in as a parameter. It is expected that the model will expose a predict method that takes in the current state and outputs the next action.

Code for the Agent object is shown below:

# Agent:
#   model as Model
#   state as State
#   exploration as Float
class Agent():

    # needs a model to represent the rewards
    def __init__(self, model, start_state, exploration):
        self.model = model
        self.state = start_state
        self.exploration = exploration

    # encoding
    #   0 <- UP
    #   1 <- RIGHT
    #   2 <- LEFT
    #   3 <- DOWN
    def get_action(self, action_id):
        if action_id == 0:
            return "U"

        if action_id == 1:
            return "R"

        if action_id == 2:
            return "D"

        return "L"

    def next_action(self, env):
        # test against the current exploration constant
        prob = np.random.random()
        action_id = None

        if prob < self.exploration:
            action_id = np.random.choice(4)
        else:
            action_id = self.model.predict(self.state)

        # get the action token
        action = self.get_action(action_id)
        observation = env.next(action)

        self.state = observation[0]

        # return the observation
        return observation

    def reduce_exploration(self):
        self.exploration = self.exploration ** 0.99

Model

For this problem we are going to keep the model as simple as possible. We will represent the Q-Values as a table of state / action pairs. For example, if the Agent is in position (0,0) we could have the corresponding Q values in the table:

UP: 0.0
LEFT: 0.0
RIGHT: 0.4
DOWN: 0.3

In addition to representing the Q values, we need our model to able to predict the best action based on the known Q values and the current state. This is done (as suggested by the previous articles) by choosing the action that maximizes the value for the current state. For example, based on the above table the model would predict an action of UP.

Lastly, we need the model to be able to update itself. To this end we will use the equation that was introduced in the previous article:

Q(s,a) = Q(s,a) + step_size * ( (R + gamma * Qmax(s1, a1) - Q(s,a) )

In the code below, step_size maps onto alpha and gamma maps onto the discount_factor.

Also included in this code is a policy method which is used to help visualize the final policy produced by the model after training.

# Model
class Model():

    def __init__(self, discount_factor, alpha):
        self.discount_factor = discount_factor
        self.actions_options = ("U", "R", "D", "L")
        self.alpha = alpha
        self.Q = {}

        # initialize the actions for all states to zero
        for y in range(5):
            for x in range(5):
                state = (y, x)

                self.Q[state] = {}

                for a in self.actions_options:
                    self.Q[state][a] = 0


    def predict(self, state):

        actions = self.Q[state]

        max_key = None
        max_val = float('-inf')
        for k, v in actions.items():
            if v > max_val:
                max_val = v
                max_key = k

        return max_key

    def update(self, state, action, reward, state2, action2):
        lastQ = self.Q[state][action]
        self.Q[state][action] = self.Q[state][action] + self.alpha * (reward + self.discount_factor * self.Q[state2][action2] - self.Q[state][action])

        return np.abs(lastQ - self.Q[state][action])

    def policy(self, map):
        policy = []

        for y in range(5):
            l = []

            for x in range(5):
                action = self.predict((y, x))

                if map[y][x] != " ":
                    action = map[y][x]

                l.append(action)

            policy.append(l)

        return policy

Q-Learning Algorithm

In the next code snippet we have the Q-Learning algorithm itself. For simplicity, its been split into two parts. The episode method is executes a single episode in the environment, running the agent until it hits the terminal state and updating the model after each action.

train_agent iterates the episodes until the change in Q-values is sufficiently low, or the the simulation reaches 1000 iterations (this hard cap was put in place to guarantee that the system would halt). After training is complete, the resulting policy is output to the terminal for the users examination.

def episode(agent, env):

    done = False

    state = (0, 0)
    observation = agent.next_action(env)
    action = observation[1]

    highest_delta = 0

    while not done:
        #  state_position, action, reward, done
        observation = agent.next_action(env)

        state2 = observation[0]
        action2 = observation[1]
        reward = observation[2]
        done = observation[3]

        delta = agent.model.update(state, action, reward, state2, action2)
        highest_delta = max(delta, highest_delta)

        state = state2
        action = action2

        if done:
            agent.model.Q[state][action] = reward

    return highest_delta

def train_agent(agent, env):
    done = False
    max_iterations = 1000
    i = 0

    while not done:
        change = episode(agent, env)
        env.reset()
        done = (change < 0.005)

        i = i + 1
        if i == max_iterations:
            done = True        agent.reduce_exploration()    policy = agent.model.policy(env.map)

    for l in policy:
        print l


if __name__ == '__main__':
    grid_world = Environment()
    agent_model = Model(discount_factor=0.98, alpha=0.1)
    agent = Agent(agent_model, (0,0), 1.0)
    train_agent(agent, grid_world)

Discussion and Results

Above is the output of our work so far. We can see that if the agent starts at position (0,0) our policy is to travel all the way to the right, and then go down to the goal. Since our agent always starts at (0,0) in our current scenario, this is the only policy that is relevant.

Next Steps

This article introduced a very basic RL solution for a simple problem. Thus far, despite the title of the series, its hard to say that anything so far as been practical. In the next few articles I’ll start introducing ways that we can extend the ideas here to ones that could potentially work on a real robot.

Until then,

Share and Enjoy!

Get Best Software Deals Directly In Your Inbox