Train your dog using TF-Agents

Kshitiz Rimal
Deep Learning Journal
16 min readJun 1, 2020
Actions taken by agent/dog during 1 episode of gameplay and corresponding rewards and penalties received on each action step.
Actions taken by agent/dog during 1 episode of gameplay and corresponding rewards and penalties received on each action step.

Deep Reinforcement Learning is a type of Reinforcement Learning algorithm which uses Deep Neural Networks that helps agents in making decisions. TF-Agents is a framework and a part of TensorFlow ecosystem that allows us to create such Deep Reinforcement Learning algorithms easily. In this blog I will explore basic of Deep Reinforcement Learning using TensorFlow Agents or TF-Agents in short by creating a simple custom game to illustrate the concepts and steps that goes in creating it using the framework.

Overview of the game

Reinforcement Learning is used in scenario where the system is required to make decisions based on the input they receive. To better understand this concept, let’s take an example of a custom game that we will create and let’s call it ‘Dog in a park’. The game is very simple, we have a dog who is the main player of this game and his name is Kiko. There are few challenges ahead for Kiko. He is in a section of a park where he isn’t supposed to be. But our Kiko is a very greedy dog, and he got there by following the smell of the bones lying on the ground. Every second he spend there is very risky and he needs to get out of that section as soon as possible. To make matter worst, there are already some robots present in that section to secure the area and if they see him, he is definitely in trouble. Now, the challenge for Kiko is, he needs to eat as much bones as he could find and as quickly as possible before reaching the exit and that too without being noticed by any of the robots.

State of the game where player/dog is in initial position.
State of the game, where player/dog is in initial position, dotted squares are empty places where player can move and position marked as ‘X’ is the exit position the player needs to be, to complete the game successfully.

Agent, Action, State and the Environment

Kiko here, who is the main player of this game and whom we are trying to train is what is called an agent. The section of the park where Kiko is in, is called an environment. Our agent, interacts with the environment by exploring it, which basically means, going left, right, up or down from the position he is in and eventually collecting all the available bones and reaching the exit sign. The steps he takes, i.e Left, Right, Up or Down, they are called actions. An agent take one action in each step and because of that he moves from one position to another, making a transition and changing the state of the environment.

Agent takes an action of value ‘1’, to go from his initial position to the next state.
Agent takes an action of value ‘1’, to go from his initial position to the right which eventually changes the state from 0 to 1.

Observation

Every time the agent takes an action, he gets to know where he is now in that environment, which is called an observation. Observation is a subset of the state which the agent receives. Observation is given to the agent from the environment. Sometimes if the environment is really simple, the state is the observation and when the environment is really complex its very hard to map out and send every detail of that state to the agent, therefore the agent in that case might only know few details about the new state of the environment. Here, in our case, the observation is the layout of the park, where the position of the agent in that new observation is given, position of the robots, positions of the remaining bones and location where the exit sign is located.

Rewards & Penalties

Reinforcement Learning is driven by rewards and penalties. If you are familiar with Deep Learning or Supervised learning, you might know that to train a neural network, for each example its associated labels are also given. Allowing the neural network to measure difference between predictions it made compared to the real targets. In reinforcement learning we don’t have any labels, the agent needs to learn by exploring the environment all on its own and figure out what actions it needs to perform so that it can satisfy the rules of the environment. To do that, rewards and penalties are given to the agent. Usually the rewards and penalties are scalar numbers given to the agent based on the actions it takes.

In our case, the first criteria is that, we want Kiko to finish the game as quickly as possible, therefore every action he takes (left, right, up or down) he receives a penalty of -0.3, forcing him to move quickly by choosing the optimum actions and reach the exit sign. As soon as he finds the exit sign, he receives the reward of 10 and the game is complete. We also know that Kiko needs to collect as many bones as he could. So, to give him that incentive, he receives a reward of 1.0 every time he collects a bone. We also want Kiko to avoid being detected by the robots, therefore if he places himself on the position where there already is a robot, he receives the penalty of -0.1 and he is immediately out of the game. Same rule applies when Kiko makes a move that is outside the boundary of the environment, that is position which is less that the initial position of the agent and the position beyond the exit sign.

Action taken and reward received by the agent during the gameplay at each step.
Action taken and reward received by the agent during the gameplay at each step.

In Reinforcement Learning, there is a concept of discounted rewards, which basically allows the agent to motivate regarding the long term bigger reward it will receive if he takes the optimum actions and completes the game properly and avoid falling in the trap of going for immediate bigger reward which seem bigger at the moment but in longer run it will really hamper the overall total reward it will receive. In our case, we basically want Kiko to finish the game by reaching the exit sign by collecting as much bones as he can and avoid taking any steps that may at the time seem very promising, like going for a nearby bone but if he takes that step it can decrease the overall total reward it will receive at the end of the game.

Deep Reinforcement Learning and Policies

In Reinforcement Learning we train our agents to maximize the rewards it will get by training it to choose the optimum action in each step. In Deep Reinforcement Learning, the task of choosing or predicting the optimum action in each step is done using Deep Neural Networks. When an agent learns to choose optimum actions in each step, then we can say that it has developed a policy using which it can take optimum actions. So a policy is basically a rule which an agent develops using learning algorithms that allows it to choose the optimum actions. In Deep Reinforcement Learning that policy is learned using Deep Neural Networks.

Illustration of Deep neural network as a brain of a dog which is an agent of the game.
Illustrating Deep Neural Network used by the agent is similar to the brain that is helping the dog decide which action to take.

In our case, the dog is the agent and the deep neural network it uses will act like a brain for him to decide which action to take next.

Step, Episode and Iterations

Generally when training a Reinforcement Learning agent we come across these terms. So let me very briefly explain what each of them basically means. An agent take one action in each step, so in general after taking one action the state can change to the next one. An episode contains several of these steps and the episode ends when either the game is successfully complete or somehow while taking any of the action in any of the step, the agent violates certain rule and the game ends. In our case, while moving through the park if Kiko places himself on the position of one of the robot the game ends, which ends that episode. An iteration can contain multiple episodes.

Creating the game engine

Let’s start by creating the game engine for this game using Python. Its a good practice to always separate the pure game engine from the TF-Agents environment. The game engine will consists of all the game logic and nothing more and TF-Agents environment will only consists of environment related setups for Reinforcement Learning and for game logic it will interact with the game engine directly.

First of all, let’s setup criteria that will define legal and illegal moves that the player can make. For that let’s make an Enum class with proper variables.

Here, as you can see, we have 5 criteria for every move the player wishes to make on each step. We will define when to expect each of these move in our game engine.

Our game engine is a python class which consists of these main functions.

  1. Init
  2. Reset
  3. Is Spot Last
  4. Move Dog
  5. Game Ended
  6. Game State

Init

Here all the initialization code will be placed, code for setting up the state of the game, initial location of the player, locations where bones are placed and locations where robots are present and a flag variable to inform if the game has ended.

For the sake of simplicity, the state of the game is a 1 dimensional array of length 36, where each position is either an empty position where player can move, the position where bones are placed and positions where robots and the player itself is present. To make it a simple in terms of representation, we have denoted the position where the player is placed with the value of 1, positions where robots are present with the value of 2 and the positions for bones with the value of 3. For empty positions the value of 0 is given. The exit sign position is at the index of 35 which is the last position in this array. This type of representation is easy to construct and we can also easily reshape it into a 6x6 array to better visualize it as above when required, with each value for player, empty position, robots and bones with their proper icons.

Reset

It consists of all the reset code which allows the game engine to go back to the initial state if the game has ended.

Is Spot Last

This method checks if the selected position is the last position in the game state and returns true if it is so.

Move Dog

This method deals with the movement of the player dog and responsible for returning proper ActionResult response for each move made.

First it checks if the selected position for the move is the last position on game state and if so, then moves the player to the last position and return the ActionResult of game complete response and sets the _game_ended flag to be true. Then it checks if the selected position is an invalid one, i.e if the position number is less than zero or greater than 35, if so the _game_ended flag is set to true and ActionResult of illegal move is returned. Similarly, it checks the given position, if the position contains robot in that position or if it has bone or it’s just a normal move towards an empty position and returns proper responses and sets the _game_ended flag likewise.

Game Ended and Game State method

These methods return the current _game_ended flag value and _state value when called.

With that we have successfully created our simple game engine. Now, let’s create a TF-Agents custom environment that will use this game engine.

TF-Agents Custom Environment

The TF-Agents custom environment, which is a python class that extends the PyEnvironment class from tf_agent.environments.py_environment.

Above we mentioned about observations, rewards and actions. In addition these concepts TF-Agents utilizes what it calls a TimeStep. Basically a TimeStep contains 4 things that represents the current state of the environment. First two are Observation and reward which is same as mentioned above, in addition to these it returns step_type which inform in which state the game is currently running, 0 being the first step in the episode, 1 for intermediate steps and 2 for final step. This allows us to check if the game episode has ended or not. The TimeStep also contains an attribute called discount which contains the discount factor to use at the returned time step. Discount factor of 1 means there will be no discount at all.

We need to provide following 5 methods in this class to make it a TF-Agents environment.

  1. Init
  2. Reset
  3. Action Spec
  4. Observation Spec
  5. Step

Init and Reset

Init consists of initialization code that sets Action Spec, Observation Spec, Game state and any further code to initialize in custom environment. Action Spec and Observation Spec are specifications for action space and observation space of the environment. Here, in our case the observation spec consists of a 1-dimensional array of length of 36 with each value in it as an integer and can have value between 0 (for empty positions) and 3 (for bone positions) i.e with total values of 0,1,2 and 3. In TF-Agents this can be specified using array_spec.BoundedArraySpec, which can be imported from tf_agents.specs. Similarly, Action spec is also a type of BoundedArraySpec where integer values between 0 to 3 are allowed, signifying the direction of the player wishes to move. There is a _game variable present that basically calls the current game state from the game engine we just created.

In addition to these, there is one custom variable called _action_values, which is a dict type that basically maps the action taken by the agent to proper position on the environment so as to move the agent to that new position.

In reset method we basically reset the game calling the game engine’s reset method and we call TF-Agents Time Step restart method to restart the TimeStep with the new game state.

Action Spec & Observation Spec

These methods return environment’s action and observation specs

Step Method

Step method is responsible for main function of the environment, it handles how on each step the environment should respond. First it checks if the game has ended or not, if so then it calls the reset method from game’s engine. Then it maps the action provided by the agent to proper board positions and passes that to move_dog method of the game’s engine to place the player on that position. Then by comparing the response from that method, it either transition the step to the next one or terminates the step with proper rewards and penalties.

After defining game’s logic code and TF-Agents environment specification, now let’s create the environment. We also need to validate the newly created environment to check if everything written there is okay and do not throw any error while training the agent for any number of episodes. To do that we can utilize the validate_py_environment method from tf_agents.environments.utils.

After validating the environment, we need to convert it to TensorFlow environment. And while doing so we generally create two environments, one used while training the agent and for evaluating the agent after training it.

TF-Agents Architecture

TF-Agents Training Architecture
TF-Agents Training Architecture (Source: https://www.youtube.com/watch?v=U7g7-Jzj9qo)

We now know general idea on how an agent learns in a Reinforcement Learning environment. To use TF-Agents to train such agents, it has its own architecture that allows us to quickly train an agent using the environment and rewards and penalties that we set up. Let’s break down above TF-Agents architecture and go through it one by one.

Policy model & Agent

The Policy model that we are going to use is called Deep Q Network or DQN in short. This neural network takes the observation or in our case trajectories (which is a concise representation of transition from one step to another) and gives the optimum action the agent can take to maximize the reward. The DQN that we will use has 3 hidden layer with 32, 64 and 128 output neurons respectively. In TF-Agents to create a DQN is a very straight forward process, we use QNetwork to create such network by passing number of hidden layers as an argument along with our environment’s observation spec and action spec.(q_network can be imported from tf_agents.networks)

To train this QNetwork, we now need to set up our DQN agent. This agent will optimize this QNetwork using the optimizer of the choice and the loss function specified. The agent is also responsible for exploration of the player to try out new actions and movements and to exploit known optimum actions that previously worked.

In Reinforcement Learning there is a concept of Exploration vs Exploitation, which basically means that we need to find a good balance in the player of the game which explores new avenues so that it can learn new actions and policies and at the same time it can utilize the learned optimum actions so that the rewards can be increased.

In our case, we are using Epsilon greedy policy (1-E) to do that. This policy is used with the Policy learned from the neural network. Initially the value of Epsilon denoted by E is set to one, therefore resulting value is 0, (1–1). Which means when this resulting terms is used with the policy of neural network, it cancels everything learned from neural network and utilizes random actions to take part in the game. Which leads to rapid explorations. As the game progresses, the value of E is decreased and eventually it is set to 0.01, which means during the later part of the game play the policy learned from the neural network is heavily used with small possibility for random explorative actions. By the end of the training process it can be assumed that the agent has explored many random actions and at the same time now it is using the neural network policy to maximize the rewards.

Replay Buffer

To train our agent, the agent gets the trajectories (input for the network) from the replay buffer. Replay buffer acts as a repository of trajectories and when required agent gets these trajectories from previous game play in batches that can be used by DQN agent to train its model policy.

It’s always good idea to make the size of the replay buffer as large as possible so that many past trajectories can be stored, providing the agent with diverse past experiences to learn from.

In order to get these trajectories into replay buffer, we utilize another component called Observer which uses replay buffer’s add batch to add it to the replay buffer.

Driver and Observer

Driver is responsible for going through the environment and to collect the trajectories based on agent’s learned policy. Basically, Driver get the agent’s collect policy and goes through the environment to collect those trajectories based on that. After collecting the trajectories, it passes that trajectories to observers. From that point on, it’s the responsibility of the observer to do whatever required to do with those trajectories. In our case, firstly the observer get those trajectories and add it to the replay buffer.

We also need to have some metrics so that we can know how our training is progressing. Therefore, we can also add different training metrics as observers so that the trajectories can be used there as well. These metrics used here are available under tf_agents.metrics. The first metric returns average reward and the second metric returns the steps taken in average on each episode.

This driver is used while training the agent during our training loop.

Now, let us first collect some random samples from the environment using some random policy that can be used to fill up the replay buffer so that when we initialize the training process, it has sufficient examples the agent can take from and begin the training process.

Now let’s create dataset out of that replay buffer we just added our trajectories to. To do that we can simply call as_dataset method and it will make it a dataset of TF.Data type.

Training the agent

Now, let’s train our agent using the trajectories collected in replay buffer.

In our case we train the agent for 150000 iterations.

Now, let’s look at the curves that we can obtain from above training process and see how well our agent performed during this training.

Average Return vs Average Episode Length
Average Return vs Average Episode Length

As we can see, the training is highly fluctuating and its definitely not an ideal one. But we can see that the general trend here and understand that the agent is in fact gradually learning as in average the Average Return is increasing and Average Episode Length most of the time is between not too low and not too high value.

Evaluating and Visualizing

We can evaluate our model on the evaluating environment that we previously created and we can also visualize an episode and see how our player is taking each step and completing the game.

To visualize the gameplay, what we can do is replace the zeros with the empty space icon, 1 with the dog icon, 2 with the icon for the robot and 3 with the bone icon and reshape the 1d array into 6x6 array for visualization and display it using the pandas data frame.

The result is a step by step visualization of each action taken by our player, reward he received on each step and how long it took him to finish the game along with the final reward.

Step by Step visualization of the evaluation process.
Step by Step visualization of the evaluation process

It took our player total of 14 steps to complete the game and total reward he got was 12.6.

As you can see, its very straight forward with TF-Agents to train Deep Reinforcement Learning agents and you can easily evaluate results with many in-built metrics and as shown in this blog you can easily visualize the evaluation results as well.

I hope you enjoyed this post and if you have any comment or feedback please leave them so that I can improve my understanding further. Stay safe and have fun using TF-Agents!

Final code can be found here:

Here is part 2 of this post:

References

  1. https://www.mikulskibartosz.name/categories#Reinforcement-learning
  2. https://github.com/ageron/handson-ml2/blob/master/18_reinforcement_learning.ipynb
  3. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, Aurelien Geron (Chapter 18: Reinforcement Learning)
  4. http://tensorflow.org/agents/
  5. https://www.youtube.com/watch?v=U7g7-Jzj9qo

--

--

Kshitiz Rimal
Deep Learning Journal

AI Developer, Google Developers Expert (GDE) on ML, Intel AI Student Ambassador, Co-founder @ AI for Development: ainepal.org, City AI Ambassador: Kathmandu