Solving Open AI’s CartPole using Reinforcement Learning Part-2

Maciej Balawejder
Apr 20 · 5 min read

In the first tutorial, I introduced the most basic Reinforcement learning method called Q-learning to solve the CartPole problem. Because of its computational limitations, it is working in simple environments, where the number of states and possible actions is relatively small. Calculating, storing, and updating Q-values for each action in the more complex environment is either impossible or highly inefficient. This is where the Deep Q-Network comes into play.

Background Information

The Deep Q-Learning has been introduced in 2013 in Playing Atari with Deep Reinforcement Learning paper by the DeepMind team. The first similar approach was made in 1992 using TD-gammon. The algorithm achieved a superhuman level of playing backgammon, but the method didn’t apply to games like chess, go, or checkers. DeepMind was able to surpass human performance in 3 out of 7 Atari games, using raw images and the same hyperparameters for all games. This was a breakthrough in the area of more general learning.

The basic idea is of DQN is that it combines Q-learning with deep learning. We get rid of Q-table, and use neural networks instead, to approximate the action-value function(Q(s,a)). The states are passed to the network, and as an output, we receive the estimated Q-values for each action.

DQN Architecture

In order to train the network, we need a target value, also known as a ground truth. The question is how we evaluate the loss function without actually having a labeled dataset?

Well, we create target values on the run using the Bellman equation.

This method is called bootstrapping, we are trying to estimate something based on another estimation. Essentially we are estimating the current action value Q(s,a) by using an estimation of the future Q(s’,a). The problem arises when one network is used to predict both values. It is similar to the dog catching his own tail. Weights are updated to move predictions closer to the target Q-values, but target values will also be moving forward, cause we use the same network.

The solution has been presented in the DeepMind paper Human-level control through deep reinforcement learning. The idea is that we use a separate network to predict target values. Every C time step, weights from the policy network are copied to the target network. It provides more stability to the algorithm since our network is not trying to chase a nonstationary target.

In order to make a neural network works we need four values state(S), action(A), reward(R), future state(S’). These values are stored in a replay memory vector and then randomly sampled to train. This process is called experience replay and has been also introduced by DeepMind.

First, we used a biologically inspired mechanism termed experience replay that randomizes over the data, thereby removing correlations in the observation sequence and smoothing over changes in the data distribution. To perform experience replay we store the agent’s experiences et=(st,at,rt,st+1)

The results of using experience replay and target network[3]

The Deep Q-Learning training process with experience replay and target network

[3]

Implementation details

  1. Environment

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.[4]

2. Network

The architecture is based on fully connected dense layers, with Relu activation function. The output layer is a fully connected linear layer with two outputs for each action.
As with many papers in reinforcement learning, I used the RMSProp optimizer.

3. Hyperparameters

4. Code

Version with plots available on Github.

  • defining models
  • experience replay
  • epsilon with coefficient and with a, b, c parameters to control the shape of the function
  • choosing an action
  • training function
  • training loop
  • testing loop

Results

The first plot on the left shows epsilon value decayed each iteration during the episode. The right plot shows the epsilon function defined by three parameters to achieve a step-function shape.
Achieving maximum score in the episode is tightly related to the epsilon value. When the randomness of the actions is reduced, the neural network starts to train. The minimal value is kept to prevent stochastic state transitions memorization aka overfitting.

Testing on 100 episodes using model on the right above

The training process took roughly 4 hours using Intel Core i5–10210U CPU and the model seems to solve the environment.

Improvements

The problem described here is using low-dimensional input, unlike most breakthrough models which are using raw images as an input and then extract all the features. Nevertheless, it is a good playground to understand how beautiful and powerful is the idea of Deep Q-Learning. To reduce the training time, I would go further trying different shapes of epsilon’s function. Another important hyperparameter is target model update frequency. It can be replaced by soft update, where we do not update target network at once, but frequently and very little[5]. Also, prioritizing experiences from replay memory can improve the effectiveness of the training process[6].

References

[1]https://arxiv.org/pdf/1312.5602.pdf

[2]https://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/

[3]https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf

[4] https://gym.openai.com/envs/CartPole-v0/

[5]https://arxiv.org/abs/2008.10861

[6]https://arxiv.org/abs/1511.05952

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Maciej Balawejder

Written by

Free spirited artificial intelligence enthusiast https://github.com/maciejbalawejder

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store