Ultimate Guide to Reinforcement Learning Part 2 — Training

Daniel Brummerloh
Towards Data Science
12 min readNov 12, 2020

--

In this comprehensive article series, we will build our own environment. Later, we will train a neural network using reinforced learning. Finally, we will crate a video showing the AI playing the environment.

The complete code of the environment, the training and the rollout can be found on GitHub: https://github.com/danuo/rocket-meister/

What we will cover:

Part 1 — Creation of a playable environment with Pygame

Link: https://medium.com/@d.brummerloh/ultimate-guide-for-reinforced-learning-part-1-creating-a-game-956f1f2b0a91

  • Create an environment as a gym.Env subclass.
  • Implement the environment logic through the step() function.
  • Acquiring user input with Pygame to make the environment playable for humans.
  • Implementing a render() function with Pygame to visualize the environment state.
  • Implementing interactive level design with Matplotlib.

Part 2 — Starting the Training

  • Define suitable observations while understanding the possibilities and challenges.
  • Define suitable rewards.
  • Training neural networks with the gym environment.
  • Discussion of the results

This is the second part of the series, covering the training of the neural network. Before we can start the training though, we will have to further specify the API between the environment and the AI.

Requirements

As the model we are going to train is relatively small, the training can be performed on a consumer level desktop CPU in a reasonable amount time (less than a day). You do not need a powerful GPU or access to a cloud computing network. The python packages used in this guide are listed below:

Python 3.8.x
ray 1.0
tensorflow 2.3.1
tensorflow-probability 0.11
gym 0.17.3
pygame 2.0.0

Observation

The observation is the feedback given from the environment back to the agent or the neural network. It is really the only thing, the agent can see to derive it’s next action. More importantly, the agent does not have a memory. It’s decision will solely be based on the observation of the current state.

Defining suitable observation is essential a achieve good training results. In our present example, defining the observation might be a trivial task, and yet there are several options we are going to explore. This might not be the case with other machine learning projects, where the development of suitable observations could be a challenging and crucial task.

Before we discuss the requirements of suitable observations, let’s work with the most intuitive approach first: Since the rocket shall not crash into the boundary, it makes sense to use the clearance as an observation. Therefore, we calculate the distance between the rocket and the environment in various angles (-90°, -60°,…,+90°) as seen in the following picture.

Normalization Now, we need to make sure that the value range of each observation is [-1,1]. This procedure is called normalization and is not mandatory. However, most neural networks will benefit from normalized values. This is, because most neural networks have an inverse tangent function at the end of their calculation. In that case, the normalized value range is more suitable numerically.

Normalization of the observation.

One way to achieve the normalization is by applying linear interpolation. An easy way to implement this is by using the following numpy function:

obs_norm = np.interp(obs, [0, 1000], [-1, 1])

Uniqueness of Observations In mathematical terms, think of the model as a deterministic function f that calculates the actions [a] based on the observations [o]. In this example, there are n observations and 2 actions:

There is a lot of theory to explore here, but the important implication of this is as follows: If two different situations or states require two different actions for success, then their respective observation have to be different too. Only when observations differ, the agent can produce two different actions. So what does that mean exactly? Let’s have a look at an example:

The two scenarios displayed in the picture below show the rocket at the very same position. Therefore, the distances [l1,…,l7] (the observations) are identical. However, the rocket in the left scenario has a much higher velocity. Since the velocity is not part of the observations, the agent does not know the rocket is too fast.

Different states with equal observations (observation set 7)

Respectively, appropriate actions to perform would be:

  • Left scenario: Decrease speed, turn right.
  • Right scenario: Increase / maintain speed, turn right.

Since the observation is the same for both scenarios, the neural network will inevitably perform the same action for both scenarios. Therefore, it simply cannot perform the adequate action in both scenarios, only one of either two can be fulfilled at most. Consequently, the solely use of the distances as the entire observation o=[l1,…,l7] is not a good idea. For testing purposes, we will make this our first iteration for the observations.

First iteration of observations o7:

Next, we will extend this set. From the recent consideration, we already concluded, that the rocket’s speed needs to be known to the neural network. Therefore, the speed magnitude will do the job:

Second iteration of observations o8:

Yet again, we can easily come up with two scenarios, that require different actions despite yielding equal observations. The direction, in which the velocity is vectored, is not deductible from this set of observations. Thus, the moving direction is unknown and another observation is required.

Different states with equal observations (observation set 8)

Obviously, we need to make the velocity within the observation directional. As a small remark, simply passing the velocity in x- and y-direction respectively would not work. The absolute orientation of the rocket is not known either, so the problem would simply be changed to another one. Therefore, the relative angle between the rocket’s orientation and it’s velocity is proposed as an additional observation:

Working with angles is a little tricky. First of all, we need to decide if we want to work with degree or radiant. Second of all, if the angle does not lie within the range of -180°< α <+180°, we need to shift the value by 360° to get back into the range.

Third iteration of observations o9:

Last but not least, we will provide some kind of navigation aid. For the reward function, we will later define goals along the track that will gain reward when reached. The direction of next goal is the vector perpendicular to the next goal and therefore indicating the direct route to said goal. The angular difference between the rocket’s orientation and the goal vector is the 10th and final observation:

Fourth iteration of observations o10:

Before we evaluate the different sets of observations, keep in mind, that the neural network does not know the meaning or context of the defined observations. However, it doesn’t have to. The goal of machine learning is to find numeric correlation between observations and successful actions. For this, the context of the data is irrelevant.

Evaluation of observations

All four variants have been tested by conducting training with the SAC agent for 2 million steps total. Don’t worry, we will go through the steps of starting the training later. As for now, let’s have a look on the consequences induced by the choice of the observations first. By looking at the graphs, we can see that o9 & o10 perform much better than the other two. The set o9 yields the best results, which comes at a small surprise. However, we shall not jump to conclusions and keep the following in mind: First of all, 2 Million steps is not much. While the graphs of o7 and o8 appear to converge, the final performance of o9 and o10 cannot be determined by this test. It it very much possible, that o10 outperforms o9 after longer training. Even the stagnation of o7 and o8 is uncertain. If this wasn’t a fun project, longer training should be entertained.

Performance comparison between the four different sets of observations.

Reward function

As stated earlier, the reward function is part of the environment and is said to be maximized by the agent in Reinforced Learning. Coming up with a good reward function is way harder than you would think. A problem occurs, when maximizing the reward function does not perfectly align with what you actually want the AI to do. Having a reward function that only values behavior that follows your intention is actually way harder than expected.

There is actually a really good video by Rob Miles [Youtube] , talking about an AI designed to collect stamps, that might start a world war in the process of doing so. The video is very entertaining and insightful at the same time.

Back to the definition of our reward function: Let’s assume we give more reward at higher velocity. It doesn’t take a lot of creativity, to predict, that the agent will most likely launch into the wall at the highest speed possible. Sure, a slower trajectory, that actually passes through the course, would yield a higher total reward. It is always tedious though, to abandon the currently most most successful strategy, to explore a totally different approach. In mathematical terms, the given high speed crash approach is a local maximum in the space of all possible approaches.

It is a fundamental capability of agents to break away from local maxima and to further optimize it’s strategy. However, it is not guaranteed that an agent actually manages to do so. Therefore, it would be wise to simply not create this kind of obstacle by not including the velocity in the reward function or by not giving it a lot of weight. Instead, we choose to set up checkpoints that give reward when passed.

Setting up checkpoints

To calculate the rewards, I have implemented a total of 40 checkpoints along the track. They are an easy way to track progress along the track and also allow for counting completed laps.

Furthermore, we propose three variants to deduce reward from the checkpoints passed by the rocket. The variants are listed below and compared afterwards:

Variant 1: Static reward for each checkpoint

The static variant will reward each checkpoint reached with a constant amount.

# for each goal 1 point
reward_total += 1

Variant 2: Dynamic reward for each checkpoint

The dynamic variant will reward each checkpoint reached with slightly less than one point. The longer it takes to reach the next checkpoint, the more of the one point rewarded will be deducted. Depending on the time taken, the reward varies between 1 and 0.9 points. The variable steps holds the amount of steps executed since reaching the last goal. This variant indirectly depends on the rocket’s velocity.

# for each goal give:
reward_total += max(1, (500 - steps)) / 500

Variant 3: Continuous reward for each goal

Similar to variant 1, this continuous variant will reward each goal with exactly one point. However, the reward is allocated continuously in every step performed. If the rocket is located exactly between the goals 3 and 4, the total reward gained so far is 3.5 points. When going backwards, the reward is reduced respectively. The calculation of this reward variant is quite tedious, but you can have a look at the code in the GitHub repository.

Comparison

As we can see, after long training all reward functions give similar good results. All variants result in a neural network that can finish the course reasonably well. This does not come at a big surprise, as these reward functions are very similar. We can see though, that the training with the dynamic reward had a higher mean reward for a significant part of the training.

Training

To train a neural network on our environment, we will be using Reinforcement Learning (RL), one of the domains of Machine Learning. While there are many ways to do deploy RL, we will be using a framework called Ray.

The core functionality of Ray is to provide a multiprocessing framework, allowing code to run in parallel on multiple CPU cores or even on multiple machines. This is very helpful, as it enables us to train our neural network with multiple agents / environments at the same time. Ray also happens to include a Reinforcement Learning library, so that we can conduct training with little to no programming involved. We just need to know the API, which admittedly is barely documented and complex at times.

Installing Ray on Windows 10, OSX and Linux

Ray just got released in version 1.0, finally adding the long awaited supported for Windows 10. Now, you can install Ray on any major platform by running a simple pip install:

pip install --upgrade ray

Getting Started with Ray

The main reason for us using Ray is the included library RLlib dedicated to RL. It has a great number of state of the art machine learning agents implemented. As seen in the overview below, most agents support both, TensorFlow 2 and Pytorch. It is entirely up to you, which framework you want to use. If you are unsure, simply go with TensorFlow.

Machine Learning agents provided by RLlib.

Something great about ray is that there are a lot of training agents implemented already, and all of them can be used the same way. This allows us to train the network with 14 different agents by only changing a string in between.

The documentation of Ray can be a bit overwhelming and confusing with the ubiquitous lack of examples. Still, the library very powerful and totally worth learning, as tedious tasks can be performed with a small amount of steps. If we were to start training on the Gym Environment CartPole-v0 using the algorithm PPO, all we have to do is to execute these two lines of code:

from ray import tune
tune.run('PPO', config={"env": "CartPole-v0"})

If you run into an error, you are most likely missing the package tensorflow-probability package. To install, run:

pip install --upgrade tensorflow-probability

To train the network on a custom environment (i.e. an environment, that is not part of the gym package, we need to modify the env keyword inside the config dictionary. Instead of the environment’s name string, we can pass the environment class. For reference, see the code in start_ray_training.py:

https://github.com/danuo/rocket-meister/blob/master/start_ray_training.py

Results

Eventually, 9 different agents were used to train different neural networks. Not all agents from the previously shown list could be employed, as some agents kept crashing and others were only suited for discrete actions (remember, we are using continuous actions). Here are the results of the training:

We see that some agents perform quite well and manage to pass through the course better than humans could. Also keep in mind, that all agents are untuned, i.e. they are run with default parameters. Their performance might improve a lot better when other parameters are chosen.

Conclusion

There are other factors that make direct comparison between the agents difficult. In this testing scenario, each agent is trained with 15 million steps total. Note, that not all of the agents compute equally fast. If I had run each training for the same amount of time, the results could be different. Also, the training probably did not converge for most of the agents, so a longer training could improve the policies further. Fast learning agents are great, but one could argue that the final performance after very long training can be more important.

Also be advised, that SAC is not as superior, as it might seem from the shown tests. The SAC agent does score significantly higher than the other agents. However, it has to be believed that these arguably great results are subject to overfitting. That means, that the agent actually memorizes the track instead of actually learning how to control the rocket. If the environment is modified, the agent is believed to fail, as there is no general knowledge on maneuvering in unknown surroundings. To prevent overfitting, training should be undertaken in a dynamic environment, that changes in each iteration. The rocketmeister environment features a level generator, maybe you want to experiment with that! For further information, have a look into the readme:

Thanks for reading!

--

--