Video games. Something I’ve spent hours and hours on. I play them almost every day. But sometimes it’s just impossible to beat that next level. As a kid interested in AI and a fellow gamer, I decided to create a machine learning model to beat all the games I couldn’t.👍
Before we start, try this cool trick: type “atari breakout” in google and go to images!
To create an algorithm to play Atari games at an expert level I got some help from Deep Q learning. But before I even get into that we need to cover some basics to understand what’s going on. If you’re new to machine learning check out this article first!
Quick Run Through On Reinforcement Learning
Reinforcement learning is a type of AI that uses a reward and punishment system where the goal is to maximize the reward received. There are 4 main components to this type of learning: The agent, environment, action, and reward.
Let’s try to understand this better first.
Imagine you’re a 10-year-old kid who needs to study for an upcoming math test. As a 10-year-old you don’t want to study but if you do, your mom will give you candy for even worksheet you complete.
In this example, you, the kid are the agent, the math test coming up represents the environment, each question/worksheet represents the states, deciding the work or not is the action, and the reward is getting candy every time you finish a worksheet.
Q learning is a way for the agent to decide what action to take. At first, the agent has no idea what to do. Since this isn’t supervised learning and the agent has no past experience it won’t know what actions to take to maximize rewards. (think back to the kid example, he can’t get the chocolates if he doesn’t know what he’s gettings them for). This is where Q-tables come in!
Since there is no policy in q-learning the goal of Q-tables is to find the best policy or course of action at each state. The Q in Q-tables stands for “quality”(quality of the action). Essentially the goal of a Q-table to plot out what actions to take to get the highest reward.
Imagine a mouse stuck in a maze.
It can take 4 actions: 1 step left, right, up, or down. This is where a Q-table comes in hand. The Q-table tells us which step to take at each state to get the cheese most efficiently. At the start of a new game the table will look something like this:
Calculating The Values For Each Action
The Q-table ultimately needs to tell us what action to take to achieve the highest possible reward. This is done through the Q-learning algorithm which uses the Bellman equation.
But…what action can we take in the beginning, if every Q-value equals zero? And what if there’s a better route the mouse hasn’t explored yet?
That’s where the exploration/exploitation trade-off comes in handy.
As the name indicates, this helps Q-tables decide weather to explore and try out new routes, or to stick to one and exploit its benefits/rewards.
Striking a balance between the two is critically important. Firstly, the variant space needs to be sufficiently explored such that the strongest variant is identified. By first identifying then continuing to exploit the optimal action you are maximizing the total reward that is available to you from the environment. However, you also want to continue to explore other feasible variants in case they provide better returns in the future.
This is where we use the epsilon greedy strategy, often written as ε, the greek symbol for epsilon.
“Greedy” in the epsilon greedy strategy, stands for what you probably think it does. Let’s go back to the mouse example. After initializing a set of trials, 20 or so trials of different routes, the mice can begin to gradually get greedy and start using only the best out of the 20 routes it found for most of its attempts.
In other words:
if we set e=0.05, the algorithm will exploit the best variant 95% of the time and will explore random alternatives 5% of the time. And this is quite effective in practice.
For now, that’s all you need to know to understand Deep Q Networks!
Diving Deeper Into Deep Q Learning
This is where things get interesting.😎
This is how I played created an algorithm to play breakout at an expert level!
Q learning+reinforcement learning+neural network=Deep Q networks
Quick Note: For Atari games, since they are much bigger than a simple mouse maze, it’s hard to use a Q-table for each action and pixel. It’s simply inefficient and not scalable. For more complex games, we use neural networks to make it easy and more efficient to learn and get better at the game. With these neural networks, we significantly reduce the number of possibilities from billions to only millions. This is done by the neural network approximating Q-values for each action, given a state.
In this step, we simply reduce the complexity of the frames and stack frames. First, let’s reduce the complexity.
Look at the image below. What do you think we can remove?
A few things we can do is greyscale the image, reduce the size, crop the frame to remove unnecessary stuff(like the number at the top). After we have done that we can stack 4 frames. Why do we stack 4 frames? Well, imagine a picture of 2 cars facing each other on the road. From a single image, you can’t tell whether the cars are parked or about to crash, but if you’re given a few frames of it from a video you can easily tell if the cars are moving or not. Similarly, it’s easier for our DQN to make assumptions.
The 4 frames received are then processed by convolutional networks. These convolutional layers allow you to exploit some spatial properties across those frames.
The next part of our network is Experience Relay. It helps us with 2 important things:
- reducing correlated data
- Better use of previous experiences
If we train the network in sequential order, we risk our agent being influenced by the effect of correlation. Since not much happens from frame to frame the network’s weights and bias change influencing the agent’s actions.
By sampling from the replay buffer at random, we can break this correlation. This prevents action values from oscillating or diverging catastrophically.
Essentially, the sample transitions are stored, which will then be randomly selected from the “transition pool” to update the knowledge.
Making the Network Better With the Loss Function
Remember how above I talked about updating our Q-tables with the Bellman equation. This time we want to update our neural nets weights to reduce the error.
The error (or TD error) is calculated by taking the difference between our Q_target (maximum possible value from the next state) and Q_value (our current prediction of the Q-value)
And that’s about it. We have our very own DQN that can play atari games from doom to pong! 😎
Limitations of DQNS
Deep reinforcement learning is surrounded by mountains and mountains of hype. And for good reasons! Reinforcement learning is an incredibly general paradigm, and in principle, a robust and performant RL system should be great at everything. Merging this paradigm with the power of deep learning is a perfect blend. Deep RL is one of the closest things that will look anything like AGI, and that’s the kind of dream that fuels billions of dollars of funding.
Unfortunately, it doesn’t work too well yet.
Now, it is 100% possible to overcome these limits. If I didn’t believe in reinforcement learning, I wouldn’t be working on it. But there are a lot of problems in the way. Here are just a few of them:
Reinforcement Learning Usually Requires a Reward Function
- RL has an annoying tendency to overfit to your reward, leading to things you didn’t expect. This is why Atari is such a nice benchmark for RL. However, using RL for other tasks may prove inefficient.
Deep Reinforcement Learning Can Be Sample Inefficient
- A DeepMind paper does a study over several incremental advances made to the original DQN architecture, demonstrating that a combination of all advances gives the best performance. It exceeds human-level performance on over 40 of the 57 Atari games attempted.
- By looking at the graph you can tell how long it takes to train atari games. RainbowDQN passes the 100% threshold at about 18 million frames. This corresponds to about 83 hours of play experience.
- DQNs are a powerful tool that can learn and pick up strategies faster than humans
- Deep q networks are essential reinforcement learning + neural networks
- Preprocessing is an important step to limiting the unnecessary information exposed to our network.
- Experience relay making more efficient use of observed experience
- RL is extremely useful however we still have a few obstacles in the way before perfecting DQNs/ RL
If you enjoyed reading this article, follow me to stay updated on my future articles. Also, feel free to share this article with others!
Follow me on Medium and LinkedIn to stay updated with my progress in AI.
If you’d like to know what I’m currently working on and my experiences with AI and similar projects subscribe to my newsletter for free! http://eepurl.com/gFbCFX