Deep Q-Networks

A road to Deep Reinforcement Learning

Published in

GDGC VIT Bhopal

5 min readDec 12, 2021

Reinforcement Learning is the popular kid in the group of machine learning. Thanks to the latest breakthrough by companies like OpenAI and systems like AlphaGo, Reinforcement Learning has caught the eyes of many in the gaming industry.

Just the other day, I went a trip down the nostalgic lane playing a classic game, Hill Climb Racing. A confused driver, riding different vehicles aimlessly on different terrains. Well, this got me thinking if I could make the driver smart enough that it could travel on its own without crashing.

I am sure you did not expect me to talk about self-driving cars when you read the title. We will tackle the entire concept of Deep Q-Networks with examples inspired by the game.

If you have ever let your hands on such games, you might have observed that the game follows two primary aims, do not crash, and keep moving forward. Let us break them into independent problems:

Cartpole Balancing
Mountain Climbing

Before we dive into tackling the problems, it is important for us to understand the basics of Reinforcement Learning and Q-Networks.

Rather than going with how wiki defines it, let us quickly understand what is reinforcement learning. It is a type of learning where actions performed by the agent on the environment are rewarded on the basis of their effects. The rewards influence the future actions of the agent.

Favorable action, more reward (positive).
Unfavorable action, less reward (in some cases negative).

The Deep Q-Learning Algorithm is one of the core concepts of Reinforcement Learning. The neural network maps input states to (action, Q-value) pairs.

That was a lot of new terms! Let’s break it down a bit.

Action: The activity performed by the model that makes a subsequent change to the environment.
Environment: The entire state space on which the model works.
Rewards: The feedback provided to the model for every action.
Q-value: The estimated optimal future value.

Q-Learning and Deep Q-Networks are model-free algorithms because they don’t create a model of the environment’s transition function.

Since the DQN is a model-free algorithm, we will build an agent compatible with all the environments mentioned in the problem.

DQN Agent Class

As we can see, we have used a small network (with two hidden layers). The input size refers to the number of states, and the output size corresponds to the number of action spaces.

Cartpole Balancing

CartPole or, commonly known as the inverse pendulum is a simple environment where the objective is to move the cart left or right to balance an upright pole. In this blog, we will be using simulation environments provided by OpenAI’s gym library.

In this environment, we have discrete action space and continuous state space as shown below:

Action Space (discrete)

0 - 1 unit force in the left direction.
1 -1 unit force in the right direction.

State Space (continuous)

0 - Cart Position.
1 - Cart Velocity.
2 - Pole angle.
3- Pole Tip Velocity.

The agent has to balance the pole for as long as possible to maximize the reward. For each successful time step, the model gains a +1 reward.

Each episode (iteration) ends if the cart-pole moves 2.4 units away from the center or deviates more than 15 degrees vertically.

We train the agent in the environment by simulating it for 1000 episodes updating and saving the best weights after every 50 iterations.

Results

The agent solved the problem within the first 250 episodes and achieved a stable high score of 199 as the episodes progressed.

Great! We have solved the balancing issues. Now it is time for our driver to learn how to move forward and climb up the hills.

Mountain Climbing

An underpowered car aims to reach the top of a steep mountain. It is supposed to use another hill on the opposite side, driving back and forth to build the required momentum.

Similar to the CartPole, this environment also comprises discrete action space and continuous state space.

Action Space (discrete)

0 -Apply 1 unit of force in the left direction of the car.
1 -Do nothing.
1 -Apply 1 unit of force in the right direction of the car.

State Space (continuous)

0 -Car Position.
1 -Car Velocity.

Generally, the car gains a reward of +100 on reaching the goal (marked by the flag). Until the agent achieves success state, there is no exchange of reward points. Thus, there is a very slim chance that our car will reach the goal randomly.

Hence, we will motivate the car with another reward condition. Considering the bottom-most part of the valley as coordinate -0.4 (standard provided by OpenAI) and the goal as +0.5, we will proportionately reward the cart with the vertical height it gains.

More the height, more the reward. (Note: The vertical can be on either of the mountains)

Let’s glance at the reward mechanism and the training function that we will use in the environment.

Results

The car was able to reach the goal multiple times! In our case, episodes 60, 71, 74, and 79 reached success state.

Note: These numbers can differ with each run.

We can also verify it with the peaking values on the score plot.

Amazing! We have solved both the problems that our driver was facing. Although integrating it into the actual game is beyond the scope of this blog, but you can try it out and let me know how it went for you!

Classic control problems lay the basis of Reinforcement Learning that influences most of the top-rated games today. OpenAI provides many more environments.

Acrobat-v1
MountainCarContinuous-v0
Pendulum-v0

I would highly recommend exploring all the environments and kick-starting your journey with reinforcement learning.

Resources:
Find out more about the environments at Gym Documentation
Complete code for CartPole-v1 (contributed under Hacktoberfest-2021)
Complete code for MountainCar-v0 (contributed under Hacktoberfest-2021)

To connect with GDSC follow social accounts and never miss an update :
Website: https://gdsc.community.dev/vellore-institute-of-technology-bhopal/
Instagram: https://www.instagram.com/gdscvitbhopal/
LinkedIn: https://www.linkedin.com/company/gdscvitbhopal
Facebook: https://www.facebook.com/dscvitbhopal
YouTube: https://www.youtube.com/channel/UCVr5tPwyUH8rJd5yEhBd94w
GitHub: https://github.com/DSC-VIT-BHOPAL/
Twitter: https://twitter.com/gdscvitbhopal
Discord Invite: https://discord.gg/V59XVCg5
To get featured on GDSC’s Medium Page, send an email at: dscvitbhopal@gmail.com
Whatsapp Group: https://chat.whatsapp.com/Fr7i8d3JAy7GepHvMP1rtS

Connect with me here:
LinkedIn | GitHub | Instagram | Twitter | Medium

GDGC VIT Bhopal

Deep Q-Networks

A road to Deep Reinforcement Learning

Cartpole Balancing

Mountain Climbing

Published in GDGC VIT Bhopal

Written by Sidhved Warik

Responses (4)