My first experience with deep reinforcement learning

Image from

Note: This article assumes previous knowledge on the basics behind neural networks and Q-learning

About six months ago I saw myself in the need of deciding a topic for my undergraduate thesis project. Since there wasn’t much of AI in my major’s curriculum I chose to do research in that field to gain some knowledge. Now, I had to decide which AI subtopic I wanted to work on and it clearly became clear to me which one it should be.

I have always been fascinated by neural networks and their ability to learn to approximate any function at all. I have always thought that this is an absolutely remarkable feature since many (if not all) problems can be modeled as a function (i.e. something that takes some input, does some processing, and produces some output). It seems to me that, while we are still far from getting there, neural networks could play a very important role in the path toward the ultimate goal of reaching a general AI.

On the other side, in the last years a small company called DeepMind — now owned by Google — had shown great advances in reinforcement learning, and specifically what it is calling deep reinforcement learning (i.e. combining neural networks with reinforcement learning). In the case of Q-learning the principle behind this is that since neural networks are very good function approximators then why not use them to approximate the Q-function? Deep learning with Q-learning is a very cool concept since other techniques that were used before to approximate the Q-function quickly became unfeasible once the state representation grew in dimensionality. Using the described technique enabled DeepMind to make an algorithm capable of playing many Atari games better than professional human players while not explicitly coding the logic and rules of each game [1]. In other words, the algorithm learned by itself what it was best to do by just looking at the pixels of the game, the score, and given the ability to choose an action (i.e. manipulate the controls of the game) like any human player would be able to.

But more than this, reinforcement learning is another of the fascinating sides of machine learning since it resembles the way we humans learn. Everything we do in life gets us a reward in return, be it positive or negative. Doing a good job will get us the approval of our colleagues, our boss, money, or even a smile from who we benefited with the job. Those things feel good, our brains release dopamine so we want to do them again, a positive reward. But getting into a car crash doesn’t feel good, so the next time we will try to be more careful since we don’t want that to happen again. We want to maximize our rewards, we learn, and we do it by reinforcement. With experience we get better at doing something, just as reinforcement learning algorithms do.

That said, with both deep learning and reinforcement learning we can model a huge variety of the problems we as humans face every day, and this is what makes them very interesting. These techniques are what power systems like autonomous cars for example. Could they be the answer to achieving a general AI? Only time will tell, but they are certainly getting us to interesting things.

Now, for the project…

From the results that DeepMind published in one of its papers, one of the graphs looked like this:

Comparison of DeepMind's DQN with the best reinforcement learning algorithms in the literature [1]

The above graph shows how DeepMind’s algorithm performed with respect to “the best reinforcement learning methods in the literature”. However, the interesting part is the line in the middle which shows how the algorithm performed in comparison to professional human players. The performance of the algorithms were normalized with respect to the performance of the human players (100% level). As you can see the performance of the algorithm with games like Ms. PacMan was really low. They don’t specifically mention the reason behind this but it seems to be related to the relatively long-term planning that the game requires, combined with the fact that Q-learning as it is commonly implemented is known to have these kinds of temporal limitations.

After reading the publication some questions came to me from the approach that DeepMind was having, specifically with the fact that they were were using the very pixels of the game as the state representation. This is remarkable since it is the same information that our brains receive as input, and it is also very good in the sense that it generalizes very well for other games. However, I had the doubt about what would happen if we gave the agent more “calculated” information, i.e. a state representation composed of information other than the pixels. What kind of impact did the state representation have in the learning process and result? This is when I decided to work with a game (PacMan), write a deep Q-learning agent in Python and look for answers.

To clarify, I wasn’t pretending to improve the performance that DeepMind achieved in the games below the human line. What I’m trying to show is how the questions that turned into my senior thesis came to be, and it was that “human-level” line that sparked those questions in me.

Now, finding what was the impact of changing the state representation was the main objective of the research at first, however, one more question arose during the process that I thought was worth investigating, namely: how did (or if) both, varying the topology of the neural network and having a persisted experience replay file before beginning training affected the learning process and the result of the algorithm.

To expand a little on the second part, experience replay is one of the tricks that has been discovered to be one of the most important optimizations to make that will enable the neural network to learn in a reasonable time (or even converge). This is because this technique breaks the concept of continuity between any two transitions while giving the network a chance to also reinforce its knowledge of previous experiences more efficiently. What I wanted to know was, given that experience replay is helpful, could it be also helpful to have a large pre-populated (persisted) experience replay memory right from the start? Could this help the algorithm to get to convergence faster than having to populate the replay memory each time from zero?

At this point I would be telling you much of what is written in the report [2], so I would encourage you to read the paper if you find the research interesting. In a nutshell though, I discovered that all three aspects affect the learning process considerably. Firstly, I could see that the state representation should be as simple as possible (but complete) since simpler state representations are considerably easier to train on. Secondly, I found that having a pre-calculated, large, persisted replay memory has the potential to improve learning speed notably but one should take some precautions so having one does not bias what the agent learns (e.g. when past experiences greatly outnumber new experiences). Lastly, I could also see that changing the topology of the neural network does have an important impact in the learning result. The hypothesis that I could extract is that larger networks take considerably longer to train and were not able to train on time, so one should choose an appropriate topology (i.e. one that is complex enough to be able to approximate the Q-function for the particular problem, but not more complex than that).

The experience

I had a lot of fun with this project, but what’s more, I learned a lot. Now, from my experience, I think reinforcement learning has the potential to be very powerful, especially in combination with neural networks. However, this combination is what can make the process a little frustrating if you are expecting your model to learn in a matter of a couple of hours and then win every game you play against it. The reality is that neural networks, while very powerful, can require very fine tuning to actually learn something. But more than this, you have to be very careful, for example, with the parameters you choose, the topology, and the activation functions as some of these aspects can represent the difference between a neural network that does a very good job in a reasonable time and a neural network that doesn’t learn anything. In summary, getting a good model can require many optimizations and dedication, however when you achieve one the results can be very surprising (as groups like DeepMind have shown us).

Another aspect that can get difficult to handle is the computational complexity of training a model of this kind. If you have a good GPU sitting on your desk you can improve training time by quite a lot, since neural networks are really benefitted by massive parallelism. However, not many have a GPU to spare, so testing can become a little tedious as each training session can take several hours or even days depending on the problem.

To sum up, you will learn a lot from doing different experiments. Nonetheless, if you plan to get immersed in deep reinforcement learning (puns not initially intended) I would recommend to first have a good understanding of neural networks, some patience, and a good machine / GPU could also be very helpful depending on the problem.

Finally, why PacMan?

Since it’s a game I really like — I mean, who doesn’t?