I’m bad at playing games. Like really bad at playing games.
It sucks, even more, when your friends make fun of how terrible you are. I mean, I guess it makes sense. People tend to lose respect in you if you are consistently roasted over the internet by 12-year-old Fortnite gamers. It’s okay, I just tell myself “You are better than this”.
Those days are now in the past, and since then I have pledged to myself that one day I will CRUSH the dreams of all 12-year-old gamers alike by beating them at their own game. Unbeknownst to their tiny brains is the true power of Deep Q Learning. My crusade against premature teens and their victory royale is on.
Okay, I’m not serious about destroying Fortnite gamers (even though it is preferred), but what I am serious about is leveraging Deep Q Learning to do some impressive stuff, like learning to play games.
It’s simple. In reinforcement learning, there are states, actions, and rewards based on the environment the agent is in.
- States represent the location
- Actions represent a movement in the environment
- Rewards are how the agent is penalized or encouraged.
The idea behind Q-learning is to maximize expected reward in the face of a stochastic environment. What this means is to take actions with the best outcome based on a changing environment (stochastic).
An analogy commonly given is training a dog. To train a dog you might want to give treats (rewards) for doing what you wanted it to do (actions). Let’s say you want the dog to learn to go fetch. You would reward the dog after fetching the stick. And at each location (states) you’d check if the dog has fetched the stick. For the dog, their goal is to maximize their expected number of treats (rewards). So he will perform actions based on the maximum predicted number of treats rewarded. He’s taking a one-step look-ahead method. Predict the outcomes of the future and attempt to maximize those rewards.
That’s effectively the intuition of Q-Learning.
To make the agent even smarter is to have a discounted reward. Okay, what does that mean?
To make the agent even smarter is to have a discounted reward. Okay, what does that mean? Well imagine the dog analogy, the reality is what the dog does now is more important to what it does in the future. Actions performed closer to the present are more important than the actions performed in the future. So mathematically we add a decaying value to our expected reward.
The intuition of Q Learning can be easily represented in a mathematical equation. Q(s`, a) is the “quality” of taking action to move to the next state. It’s the expected reward, based on the next reward Rt+1 and the maximum outcome of the next future states. That makes sense since you’d want to get the best foreseeable action.
DEEP Q Learning
The problem with games is that the agent must take in images as perception. Therefore, we need to use a convolutional neural network to add precision. Images of the game are fed into the neural network and the 6 possible actions are taken. Cool that works!
However, the problem is getting the MAX Q value and finding the loss between the actions. To solve this problem is to create a SECOND neural network. This is called the target network. This neural network calculates the maximum future possibility and calculates the loss between future action and the actual action. This can be represented in the formula below.
After that, all we need to run gradient descent on the loss function. Easy? Let’s do this in code.
My Code To Play Space Invaders
First, create CNN. If you need a refresher on how I coded the CNN check out my article on it. It’s the basics required to understand DQN’s.
Secondly, you’d want to create the agent. The important aspect of the agent is the choice between exploitation and exploration. An agent MUST explore their environment before taking steps to exploit the environment. So we have an exploration rate, and as time goes on the exploration rate decays. This switches the agent’s mode between aggressive and passive. Neat! This is how we coded that idea.
The training loop used to a little complex. You want to update the target network every 3-time steps. By doing so you create a more accurate loss function. You’d also want to calculate the loss between the future Q value and the previous Q value and reduce the loss between the two values. You loop through the training process iteratively for 50 games and per game, the actions are performed until you have lost a life.
DQN’s are a new and exciting way to use CNN’s and generally produce more interesting performances. This method is similar to how AlphaStar (developed by DeepMind ) was able to defeat top players in the world. Although DeepMind used some more advanced topics, the intuition is the same. Perhaps one day we can train computers with general intelligence and maybe one day I will destroy everyone at playing games.
Key Take Aways
- Humans aren’t great at everything
- DQN’s can be trained to perform complex tasks
- DQN = CNN + Q-learning
- The future is yet to come
Before You Go
Connect with me on LinkedIn
Check out my website: http://peterma.ca/
Feel free to reach out to e-mail me with any questions: firstname.lastname@example.org
And clap the article if you enjoyed 😊