How Reinforcement Learning Defied the Laws of Gravity

Image result for cartpole gif

Have you ever played a game that you were just so bad at that, every time you played you would get frustrated? Well, that was me everytime I would play balance games like Cartpole. So, I decided to create a game bot that plays Cartpole for me! I learned how to code in Python last year and this is my first Game Bot. I used Reinforcement Learning (RL) and a Deep Neural Network (DNN) to make it play itself!


The Cartpole Debacle

Image result for cartpole gif
This is what happens when the pole falls

The Cartpole Debacle is a pendulum with a center of gravity above its pivot point. It’s not stable and will fall but, can be easily controlled by moving the black cart to centre the pivot point. The goal is to keep the Cartpole balanced by applying appropriate forces to the pivot point.

  • The Cartpole’s pivot point is the purple square.
  • The Cartpole moves left and right to keep the pole balanced upright.
  • The black rectangle indicates the “cart” that moves to balance the pole.

Reinforcement Learning

In order to get the Cartpole to “balance itself” we needed to implement Reinforcement Learning. Reinforcement Learning helps the agent learn and better itself. You’ve heard of the phrase “Practice Makes Perfect!”, well, that’s exactly what the game is doing, practicing and practicing until it can completely balance the pole on the cart, without any problems.

RL is a concept where an agent that takes actions in an environment in order to maximize its total reward. The main concept is very lifelike, where similarly to the humans in real life, agents in RL algorithms are incentivized with punishments for bad actions and rewards for good ones.

Reinforcement Learning is considered one of three machine learning paradigms. The other two are Supervised Learning and Unsupervised Learning. In supervised learning, the decisions you make, do not affect what you see in the future. It’s different from unsupervised learning because RL doesn’t need to predict decisions based on no data. Instead, the focus is on performance which involves finding a balance between exploration (new environments) and exploitation (known environments).

This is a simple diagram showing what Reinforcement Learning does.

OpenAI’s Gym Environment

In April of 2016, OpenAI released Gym. An environment that gives you access to very popular games like Cartpole and other games like their Taxi Game. Using Gym, I was to gain access to the game and replicate a game bot to play Cartpole

Basically, Gym is a collection of environments to develop and test RL algorithms. Cartpole is one of the available gyms, you can check the full list on OpenAI’s website or click here. Gym is built on a Markov chain model, which we will get into later in the article.


Markov Chain

The Markov Chain is the process of the agent observing the environment output consisting of a reward and the next state, and then acting upon that. This whole process is a Markov Decision Process (MDP).

Markov Chain

It starts with an initial environment. It doesn’t have any associated reward yet, but it has a state (S_t).

Then for each try (iteration), an agent takes current state (S_t) and it picks the best (chosen based on model prediction) action (A_t) and implements it on an environment. Then, an environment returns a reward (R_t+1) for a given action, a new state (S_t+1) and information if the new state has terminated. The process repeats until everything is terminated.

Markov Chain Code

Lines 1–5

I was just setting up the environment. [env = gym.make(“CartPole-v1]” is OpenAI’s Gym setting up the Cartpole environment. It’s created an agent (DQNSolver), an observation space (possible [state] values) and an action space (possible actions that can be performed).

Lines 6–8

Every time the game is played, a new environment is initialized. [state = env.reset ()] is the reset command and [state = np.reshape(state, [1, observation_space])] is the the command to reshape or set up the new environment.

Lines 9–19

For each step, until the game is finished, based on a given state, we get an action from an agent. Then we implement it in an environment and get a new state. Along with that is a reward. Afterward, the SARS’ command (state, action, reward, state_next, terminal) is coded in and that gives marks the end of the game/simulation before we can replay it.

Deep Q-Learning

DQN is a Reinforcement Learning technique that is aimed at choosing the best action for any circumstance (observation). Each possible action for each observation has its Q value, where ‘Q’ stands for the quality of the move. In order to end up with accurate Q values, we need to dive into the world of deep neural networks and some linear algebra.

We were able to remember each state by equipping the Deep Q-Learning algorithm with this:

Then, it remembers its experience for the next game using this function:

Every time there is a new, better, Q value, the Cartpole needs to know to change its original value to the new value. It does this by implementing this function:

q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))

In this, calculate the new Q value by taking the maximum Q for a given action (best-predicted outcome), multiplying it by the discount factor (GAMMA) and ultimately adding it to the current state reward.

You may be wondering, this may be difficult to converge because it’s like a model is predicting itself but, we are able to give it some help by adding a Deep Neural Network.


Deep Neural Network

def neural_network_model(input_size):
network = input_data(shape=[None, input_size, 1], name='input')
network = fully_connected(network, 128, activation='relu')
network = dropout(network, 0.8)
network = fully_connected(network, 256, activation='relu')
network = dropout(network, 0.8)
network = fully_connected(network, 512, activation='relu')
network = dropout(network, 0.8)
network = fully_connected(network, 256, activation='relu')
network = dropout(network, 0.8)
network = fully_connected(network, 128, activation='relu')
network = dropout(network, 0.8)
network = fully_connected(network, 2, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=LR, loss='categorical_crossentropy', name='targets')
model = tflearn.DNN(network, tensorboard_dir='log')
return model

This is a Deep Neural Network which filters all the data fed to it and trains it accordingly. In this case, it took all the Cartpole data we inputted (env = gym.make ("CartPole=v1") )and the neural net successfully trained the Cartpole to balance itself.

Finally, I wrote training scripts to train the model:

def train_model(training_data, model=False):
X = np.array([i[0] for i in training_data]).reshape(-1,
len(training_data[0][0]), 1)
    y = [i[1] for i in training_data]
if not model:
model = neural_network_model(input_size = len(X[0]))
model.fit({'input':X}, {'targets':y}, n_epoch=3, snapshot_step= 500, show_metric= True, run_id='openaistuff')
    return model
training_data = intial_population()
model = train_model(training_data)

End Matter

scores = []
choices = []
for each_game in range(20):
i=0
score =0
game_memory =[]
prev_obs = []
env.reset()`
for _ in range(goal_steps):
env.render()
if len(prev_obs) ==0:
action = random.randrange(0,2)
else:
action = np.argmax(model.predict(prev_obs.reshape(-1, len(prev_obs),1))[0])
choices.append(action)
print("Action: " ,action)
new_observation, reward, done, info = env.step(action)
prev_obs = new_observation
game_memory.append([new_observation, action])
score+=reward
if done:
print("Done with game number = ",each_game," Score = ",score)
i+=1

Now, we are at the end of our article and code for this amazing Cartpole game bot! This is just showing end matter, like resetting the game and displaying the score.


Key Takeaways

  • RL is a concept where an agent that takes actions in an environment in order to maximize its total reward.
  • There are three ML paradigms, RL, Unsupervised Learning and Supervised Learning.
  • Gym is a collection of environments to develop and test RL algorithms.
  • The Markov Chain is the process of the agent observing the environment output consisting of a reward and the next state, and then acting upon that.
  • DQN is a Reinforcement Learning technique that is aimed at choosing the best action for any circumstance (observation).
  • Deep Neural Networks filter all the data fed to it and trains it accordingly.

I found this video on YouTube for a real-life application of Cartpole and RL.

The fun starts at 1:38

Hope you enjoyed this article and now know about RL and Deep Q-Learning! Please leave some comments and give it some claps!