Solving Curious case of MountainCar reward problem using OpenAI Gym, Keras, TensorFlow in Python

Published in

Coinmonks

6 min readOct 19, 2018

This post will help you to write gaming bot for less rewarding games like MountainCar using OpenAI Gym and TensorFlow.

Once I built a model for playing CartPole game felt confident and thought let’s write code for one more game and found MountainCar game interesting then I thought why not write for it.

Once I started writing it realized its not an easy task. The biggest problem is it always gives a negative reward and whatever random action I took it doesn’t matter I ended up getting the total score -200 and finally losing the game. I checked different articles and tried different ways but didn’t found the proper answer.

After reading at so many places I realized instead of relying on the reward given by the game why not create one by myself based on specific condition this solved my problem. I wanted to share it with everyone so that nobody will go through the pain I went through.

Without wasting much time let’s start coding. If you are trying OpenAI Gym for the first time please read my previous article here.

First, let’s import the packages we need to implement this

import gym
import random
import numpy as np
from keras.models     import Sequential
from keras.layers     import Dense
from keras.optimizers import Adam

Let’s create the environment and initialize the variables

env = gym.make('MountainCar-v0')
env.reset()
goal_steps = 200
score_requirement = -198
intial_games = 10000

Before we start writing the code first let’s understand what we are getting into

def play_a_random_game_first():
    for step_index in range(goal_steps):
        env.render()
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        print("Step {}:".format(step_index))
        print("action: {}".format(action))
        print("observation: {}".format(observation))
        print("reward: {}".format(reward))
        print("done: {}".format(done))
        print("info: {}".format(info))
        if done:
            break
    env.reset()

You will get output like this if you execute this code

Step 0:
action: 0
observation: [-0.55321127 -0.00078406]
reward: -1.0
done: False
info: {}
Step 1:
action: 1
observation: [-0.55377353 -0.00056225]
reward: -1.0
done: False
info: {}
...
Step 198:
action: 0
observation: [-0.40182971  0.01383677]
reward: -1.0
done: False
info: {}
Step 199:
action: 1
observation: [-0.38888603  0.01294368]
reward: -1.0
done: True
info: {}

According to the documentation “-1 for each time step, until the goal position of 0.5 is reached. As with MountainCarContinuous v0, there is no penalty for climbing the left hill, which upon reached acts as a wall.”

The episode ends when you reach 0.5(top) position, or if 200 iterations are reached. I played several times 10000 times but never reached the top position. So at the time of data population, I changed a small logic that finally gave me the solution.

Code for data population is

The key part lies in the above code let’s understand line by line and I will explain the tweak which helped me solving this problem also with this.

We initialized training_data and accepted_scores arrays.
We need to play multiple times so that we can collect the data which we can use further. So we will play 10000 times so that we get a decent amount of data. This line for that “for game_index in range(intial_games):”
We initialized score, game_memory, previous_observation variables where will store the current game’s total score and previous step observation(means the position of Car and its velocity) and the action we took for that.
for step_index in range(goal_steps): — This code is to play the game for 200 steps because episode ends when you reach 0.5(top) position, or if 200 iterations are reached.
We need to take random actions so that we can play the game which may lead to successfully completing the step or losing the game. Here only 3 actions allowed push left(0), no push(1) and push right(2). So this code(random.randrange(0, 3)) is for taking one of the random action.
We will take that action/step. Then we will check if it’s not a first action/step then we will store the previous observation and action we took for that.
Then we will check whether the position of the car which is observation[0] is greater than -0.2 if yes then instead of taking the reward given by our game environment I took as 1 because -0.2 position is top of the hill which means our random actions giving somewhat fruitful results.
Add reward to the score and check whether the game is completed or not if yes then stop playing it.
We will check whether this game fulfilling our minimum requirement or not means are we able to got score more than or equal to -198 or not.
If we are able to get the score greater than or equal to -198 then we will add this score to accept_scores which we further print to know how many games data and their score which we are feeding to our model.
Then we will do hot encoding of action because its values 0(push left), 1(no push), 2(push right) represent categorical data.
Then we will add that to our training_data.
We will reset the environment to make sure everything clear to start playing next game.
print(accepted_scores) — This code is to know how many games data and their score which we are feeding to our model. Then we will return the training data.

We will get some reasonable games scores like below

[-158.0, -172.0, -188.0, -196.0, -168.0, -182.0, -180.0, -184.0, -184.0, -184.0, -168.0, -184.0, -176.0, -182.0, -182.0, -196.0, -184.0, -194.0, -178.0, -176.0, -170.0, -190.0, -182.0, -184.0, -184.0, -188.0, -184.0, -192.0, -172.0, -186.0, -174.0, -166.0, -188.0, -186.0, -174.0, -190.0, -178.0, -170.0, -164.0, -180.0, -184.0, -172.0, -168.0, -174.0, -172.0, -174.0, -186.0]

So our data is ready. Its time to build our neural network.

def build_model(input_size, output_size):
    model = Sequential()
    model.add(Dense(128, input_dim=input_size, activation='relu'))
    model.add(Dense(52, activation='relu'))
    model.add(Dense(output_size, activation='linear'))
    model.compile(loss='mse', optimizer=Adam())return model

Here we are going to use the sequential model.

def train_model(training_data):
    X = np.array([i[0] for i in training_data]).reshape(-1, len(training_data[0][0]))
    y = np.array([i[1] for i in training_data]).reshape(-1, len(training_data[0][1]))
    model = build_model(input_size=len(X[0]), output_size=len(y[0]))
    
    model.fit(X, y, epochs=5)
    return model

We have the training data so from that we will create features and labels.

Then we will start the training

trained_model = train_model(training_data)

The output we will get like this

Epoch 1/5
9353/9353 [==============================] - 1s 90us/step - loss: 0.2262
Epoch 2/5
9353/9353 [==============================] - 1s 66us/step - loss: 0.2217
Epoch 3/5
9353/9353 [==============================] - 1s 65us/step - loss: 0.2209
Epoch 4/5
9353/9353 [==============================] - 1s 64us/step - loss: 0.2201
Epoch 5/5
9353/9353 [==============================] - 1s 61us/step - loss: 0.2199

It’s time for our gaming bot to play the game for us.

scores = []
choices = []
for each_game in range(100):
    score = 0
    game_memory = []
    prev_obs = []
    for step_index in range(goal_steps):
        env.render()
        if len(prev_obs)==0:
            action = random.randrange(0,2)
        else:
            action = np.argmax(trained_model.predict(prev_obs.reshape(-1, len(prev_obs)))[0])
        
        choices.append(action)
        new_observation, reward, done, info = env.step(action)
        prev_obs = new_observation
        game_memory.append([new_observation, action])
        score += reward
        if done:
            breakenv.reset()
    scores.append(score)print(scores)
print('Average Score:',sum(scores)/len(scores))
print('choice 1:{}  choice 0:{} choice 2:{}'.format(choices.count(1)/len(choices),choices.count(0)/len(choices),choices.count(2)/len(choices)))

Here you can see I didn’t touch the reward part at all. But our model got to know if it does what action it will go top of the hill so it automatically performs well. After executing this code you will get the score like this

[-164.0, -92.0, -162.0, -107.0, -105.0, -93.0, -97.0, -90.0, -96.0, -170.0, -99.0, -200.0, -164.0, -91.0, -200.0, -92.0, -195.0, -166.0, -104.0, -93.0, -164.0, -200.0, -200.0, -164.0, -179.0, -176.0, -122.0, -101.0, -91.0, -162.0, -99.0, -164.0, -190.0, -199.0, -101.0, -200.0, -186.0, -185.0, -170.0, -128.0, -164.0, -164.0, -166.0, -101.0, -167.0, -89.0, -105.0, -168.0, -166.0, -100.0, -100.0, -91.0, -90.0, -163.0, -165.0, -167.0, -165.0, -105.0, -88.0, -134.0, -95.0, -90.0, -166.0, -166.0, -89.0, -167.0, -162.0, -165.0, -164.0, -171.0, -163.0, -127.0, -95.0, -159.0, -89.0, -89.0, -96.0, -168.0, -96.0, -163.0, -89.0, -90.0, -183.0, -166.0, -164.0, -163.0, -171.0, -167.0, -163.0, -97.0, -171.0, -166.0, -89.0, -200.0, -162.0, -175.0, -198.0, -93.0, -200.0, -106.0]
Average Score: -141.12
choice 1:0.007936507936507936  choice 0:0.5136054421768708 choice 2:0.47845804988662133

Great job your bot did a very good job.

Congrats!!! You understood the reward mechanism well and also you understood how to design a solution if your game is not friendly towards rewards.

You will find Jupyter notebook for this implementation here.

If you enjoyed this article, show me your love by giving it some claps 👏.
Peace. Happy Coding.
See my original article here.

Get Best Software Deals Directly In Your Inbox

Solving Curious case of MountainCar reward problem using OpenAI Gym, Keras, TensorFlow in Python

Written by Ashok Tankala