Solving Curious case of MountainCar reward problem using OpenAI Gym, Keras, TensorFlow in Python

Ashok Tankala
Coinmonks
6 min readOct 19, 2018

--

Image result for mountain car openai

This post will help you to write gaming bot for less rewarding games like MountainCar using OpenAI Gym and TensorFlow.

Once I built a model for playing CartPole game felt confident and thought let’s write code for one more game and found MountainCar game interesting then I thought why not write for it.

Once I started writing it realized its not an easy task. The biggest problem is it always gives a negative reward and whatever random action I took it doesn’t matter I ended up getting the total score -200 and finally losing the game. I checked different articles and tried different ways but didn’t found the proper answer.

After reading at so many places I realized instead of relying on the reward given by the game why not create one by myself based on specific condition this solved my problem. I wanted to share it with everyone so that nobody will go through the pain I went through.

Without wasting much time let’s start coding. If you are trying OpenAI Gym for the first time please read my previous article here.

First, let’s import the packages we need to implement this

Let’s create the environment and initialize the variables

Before we start writing the code first let’s understand what we are getting into

You will get output like this if you execute this code

According to the documentation “-1 for each time step, until the goal position of 0.5 is reached. As with MountainCarContinuous v0, there is no penalty for climbing the left hill, which upon reached acts as a wall.”

The episode ends when you reach 0.5(top) position, or if 200 iterations are reached. I played several times 10000 times but never reached the top position. So at the time of data population, I changed a small logic that finally gave me the solution.

Code for data population is

The key part lies in the above code let’s understand line by line and I will explain the tweak which helped me solving this problem also with this.

  1. We initialized training_data and accepted_scores arrays.
  2. We need to play multiple times so that we can collect the data which we can use further. So we will play 10000 times so that we get a decent amount of data. This line for that “for game_index in range(intial_games):”
  3. We initialized score, game_memory, previous_observation variables where will store the current game’s total score and previous step observation(means the position of Car and its velocity) and the action we took for that.
  4. for step_index in range(goal_steps): — This code is to play the game for 200 steps because episode ends when you reach 0.5(top) position, or if 200 iterations are reached.
  5. We need to take random actions so that we can play the game which may lead to successfully completing the step or losing the game. Here only 3 actions allowed push left(0), no push(1) and push right(2). So this code(random.randrange(0, 3)) is for taking one of the random action.
  6. We will take that action/step. Then we will check if it’s not a first action/step then we will store the previous observation and action we took for that.
  7. Then we will check whether the position of the car which is observation[0] is greater than -0.2 if yes then instead of taking the reward given by our game environment I took as 1 because -0.2 position is top of the hill which means our random actions giving somewhat fruitful results.
  8. Add reward to the score and check whether the game is completed or not if yes then stop playing it.
  9. We will check whether this game fulfilling our minimum requirement or not means are we able to got score more than or equal to -198 or not.
  10. If we are able to get the score greater than or equal to -198 then we will add this score to accept_scores which we further print to know how many games data and their score which we are feeding to our model.
  11. Then we will do hot encoding of action because its values 0(push left), 1(no push), 2(push right) represent categorical data.
  12. Then we will add that to our training_data.
  13. We will reset the environment to make sure everything clear to start playing next game.
  14. print(accepted_scores) — This code is to know how many games data and their score which we are feeding to our model. Then we will return the training data.

We will get some reasonable games scores like below

So our data is ready. Its time to build our neural network.

Here we are going to use the sequential model.

We have the training data so from that we will create features and labels.

Then we will start the training

The output we will get like this

It’s time for our gaming bot to play the game for us.

Here you can see I didn’t touch the reward part at all. But our model got to know if it does what action it will go top of the hill so it automatically performs well. After executing this code you will get the score like this

Great job your bot did a very good job.

Congrats!!! You understood the reward mechanism well and also you understood how to design a solution if your game is not friendly towards rewards.

You will find Jupyter notebook for this implementation here.

If you enjoyed this article, show me your love by giving it some claps 👏.
Peace. Happy Coding.
See my original article here.

Get Best Software Deals Directly In Your Inbox

--

--

Ashok Tankala
Coinmonks

I help aspiring & emerging leaders gain clarity & reach their potential so they can build a fulfilling life both personally and professionally - http://tanka.la