Learning Machine Learning: Roulette with Monte Carlo Policy

Every week or so I push myself with a new deep learning or machine learning problem. I’ve been coding ML daily for 124 days now. This weeks challenge was to practice what I’ve been learning in Move 37, the deep reinforcement learning course offered for free by Siraj Raval. We’ve been covering Monte Carlo Methods and have seen an example using OpenAI gym’s blackjack environment. So Lets get right down to it and code roulette with Monte Carlo technique.

Imports, gym for roulette, numpy for math, and matplotlib to graph the results. Make our blackjack environment. Set the Epsilon to 5% of the time choose a random action. Set the Gamma to 1, we will not be considering possible future rewards, since there are no guarantees in roulette. Initialize everything else to 0 or empty and sized according to the OpenAI gym roulette documentation.

import gym
import numpy as np
import matplotlib.pyplot as plt
env = gym.make('Roulette-v0')
EPS = 0.05
GAMMA = 1.0
Q = {}
agentSumSpace = [i for i in range(0,37)]
actionSpace = [i for i in range(0, 38)]
stateSpace = []
returns = {}
pairsVisited = {}
for total in agentSumSpace:
for action in actionSpace:
Q[(total, action)] = 0
returns[(total, action)] = 0
pairsVisited[(total, action)] = 0
stateSpace.append(total)

Randomly initialize our policy.

policy = {}
for state in stateSpace:
policy[state] = np.random.choice(actionSpace)

A million episodes of training should be plenty to get a good policy trained. Initialize some variables to use each episode. Check our progress every hundred thousand episodes. Reset the environment.

numEpisodes = 1000000
for i in range(numEpisodes):
statesActionsReturns = []
memory = []
if i % 100000 == 0:
print('starting episode', i)
observation = env.reset()
done = False

Until the game is done, take an action based on our Monte Carlo policy, and record the results. An action would be to place a bet, or get up from the table.

while not done:
action = policy[observation]
observation_, reward, done, info = env.step(action)
memory.append((observation, action, reward))
observation = observation_
memory.append((observation, action, reward))

Step backwards through the memory to record the rewards based on the previous state/action pairs.

G = 0
last = True
for observed, action, reward in reversed(memory):
if last:
last = False
else:
statesActionsReturns.append((observed, action, G))
G = GAMMA*G + reward
statesActionsReturns.reverse()
statesActionsVisited = []

This next part is where the Monte Carlo decision process happens. It can look intimidating, but I’ll try to explain. We are going through every state/action pair and the reward for taking that action while in that state. If a state/action pair has not been visited before we are comparing the rewards from other actions from that state to determine the best action from that state. We are choosing a random action in case of a tie. In the beginning it will be more likely to take a new action, but over time the epsilon will deminish and we will be taking only the already known best actions.

for observed, action, G in statesActionsReturns:
sa = (observed, action)
if sa not in statesActionsVisited:
pairsVisited[sa] += 1
returns[(sa)] += (1 / pairsVisited[(sa)])*(G-returns[(sa)])
Q[sa] = returns[sa]
rand = np.random.random()
if rand < 1 - EPS:
state = observed
values = np.array([Q[(state, a)] for a in actionSpace ])
best=np.random.choice(np.where(values==values.max())[0])
policy[state] = actionSpace[best]
else:
policy[state] = np.random.choice(actionSpace)
statesActionsVisited.append(sa)
if EPS - 1e-7 > 0:
EPS -= 1e-7
else:
EPS = 0

Test our trained Monte Carlo policy.

numEpisodes = 1000
rewards = np.zeros(numEpisodes)
totalReward = 0
wins = 0
losses = 0
print('getting ready to test policy')
for i in range(numEpisodes):
observation = env.reset()
done = False
while not done:
action = policy[observation]
observation_, reward, done, info = env.step(action)
observation = observation_
totalReward += reward
rewards[i] = totalReward
if reward >= 1:
wins += 1
elif reward == -1:
losses += 1

Print the results.

wins /= numEpisodes
losses /= numEpisodes
print('win rate', wins, 'loss rate', losses)
plt.plot(rewards)
plt.show()

The results are clear, zero wins and zero losses. The best way to win roulette is to not play at all. Cheeky Monte Carlo policy decided to leave the table every game. It is correct. The house always wins, it’s a loosing game. Just for fun lets see what happens if we force our Monte Carlo policy to play by changing one line of code.

actionSpace = [i for i in range(0, 38)]

Remove the option of getting up from the table.

actionSpace = [i for i in range(0, 37)]

Now lets see what happens.

Wins 2.5% of the time, looses 97.5% of the time. The first run was clearly on to something. This week I’ve learned not to play roulette, but more importantly I’ve learned how to use Monte Carlo policy to solve problems with unknown variables.

Thanks to Siraj Rival for the free deep RL course and thanks to youtuber Machine Learning with Phil for helping me understand Monte Carlo policy in OpenAI gym.