In today’s article, I’m going to introduce you to the hot topic of Deep Q Networks and how it works. I’ll go over my model and explain key concepts to building a reinforcement learning algorithm.
Creating The Game
For this game, we’ll be playing as a jet which can take 4 actions — up, down, forward, backwards — and needs to dodge as many missiles as possible. The goal is to achieve the highest possible reward by surviving the longest time.
To create this game I used pygame which is a cross-platform set of Python modules designed for writing video games.
For this specific game, you won’t need many modules/libraries, just pygame and built-in modules like time and random.
We set up 3 classes using sprites; the enemy, the player, and the clouds(optional). For the full code check out my Github repository and to learn more about pygame check out this tutorial: https://realpython.com/pygame-a-primer/
How Does it Work?
When it comes to reinforcement learning it’s simply how an agent ought to take actions in an environment to maximize the reward(score). Markov decision process (MDPs) is a framework used to model an agents’ decision making. MDPs are a core concept of reinforcement learning. To understand the basics or what RL and DQNs read this first: How I Built An Algorithm to Takedown Atari games!
Markov decision processes
Markov decision processes are used in almost every reinforcement learning problem. To grasp the concept of MDPs we need to look into Markov properties and Markov processes.
Markov property states: “The future is independent of the past given the present”
The formal definition is:
Essentially this tells us the previous states and events are not required to know future states — the present state captures all information necessary. Knowing this we can figure out values and make decisions.
Markov Reward Process
An MRP is a tuple (S, P, R, 𝛾) where S is a finite state space, P is the state transition probability, R is a reward function where,
Rs = 𝔼[Rt+1 | St = S],
it says how much immediate reward we expect to get from state S at the moment, and 𝛾(gamma) is the discount factor that tells our agent how much it should care about future rewards. For example, if gamma is 0 or close to zero, our agent will only care about current rewards and become short-sighted. If gamma is closer to 1, our agent will care about future rewards and maximize how long it will survive even if it means sacrificing short-term rewards.
Markov Decision Process
A Markov Decision Process is an extension to a Markov Reward Process as it contains decisions that an agent must make. All states in the environment are Markov.
A Markov Decision Process consists of a tuple (S, A, P, R, 𝛾)
- S is a set of states (finite)
- A is a set of action (finite)
- P is a state transition probability(how likely it is a new state will occur)
- R is a reward function (positive or negative rewards based on the action and state)
- 𝛾 is a discount factor (short-sighted rewards vs future rewards)
Now that we have our MDP we can use the bellmen equation to determine the value of each state and action. The bellman is a key concept when it comes to taking the right actions given a state.
Value function: V(s) = maxa(R(s,a)+𝛾V(s’))
The equation above tells us that the value of a given state s is equal to the reward of the max action in (s, a) plus the discounted value of state s’. where s’ is the state we will end up in if we take action a.
This equation tells us the Q values of a state-action pair. The equations above only works for an environment without uncertainty. If it’s a stochastic environment the equations above won’t be true. To account for the randomness we slightly change our equations by adding in the transition probability to the next states and an expected reward.
Note: For many reinforcement problems including our game, figuring out the value of every state is not scalable — there is too much happening at once and will take up a lot of computational power. Therefore, we must use a neural network to approximate Q values and state values. The neural network is updated by calculating the TD error.
Reinforcement Learning and Policies
In Reinforcement Learning, we have two main components: the environment (our game) and the agent (the jet). Every time the agent performs an action, the environment gives a reward to the agent using MRP, which can be positive or negative depending on how good the action was from that specific state. The goal of the agent is to learn what actions maximize the reward, given every possible state. For this specific game, we don’t give the agent any negative reward, instead, the episode ends when the jet collides with a missile. The agent receives a +1 reward for every time step it survives. Along the way, the agent will pick up certain strategies and a certain way of behaving this is known as the agents’ policy.
The neural network for my model consists of 3 fully connected layers with 256 neurons. This model doesn’t require a CNN or any preprocessing because we can get the states and positions of everything without the need for image detection. Moreover, we apply a ReLU activation function after each layer which flattens all values under 0 and remains linear for all values above 0. Since ReLU is so simple it allows faster computation, therefore, less training time.
We also stack 4 frames so the model can detect the change in motion. Without stacking frames, the model won’t be able to predict future events accurately. For example, imagine a picture of 2 cars facing each other. With only one frame you can’t tell if the cars are moving or parked. Thus, you can’t predict whether the cars will crash or not. But if you’re given 4 frames you can easily identify movement and predict what's about to happen.
If you enjoyed reading this article, follow me to stay updated on my future articles. Also, feel free to share this article with others!
Follow me on Medium and LinkedIn to stay updated with my progress in AI.
If you’d like to know what I’m currently working on and my experiences with AI and similar projects subscribe to my newsletter for free! http://eepurl.com/gFbCFX