# Learning Machine Learning: Roulette with Monte Carlo Policy

Every week or so I push myself with a new deep learning or machine learning problem. I’ve been coding ML daily for 124 days now. This weeks challenge was to practice what I’ve been learning in Move 37, the deep reinforcement learning course offered for free by Siraj Raval. We’ve been covering Monte Carlo Methods and have seen an example using OpenAI gym’s blackjack environment. So Lets get right down to it and code roulette with Monte Carlo technique.

Imports, gym for roulette, numpy for math, and matplotlib to graph the results. Make our blackjack environment. Set the Epsilon to 5% of the time choose a random action. Set the Gamma to 1, we will not be considering possible future rewards, since there are no guarantees in roulette. Initialize everything else to 0 or empty and sized according to the OpenAI gym roulette documentation.

`import gymimport numpy as npimport matplotlib.pyplot as plt`
`env = gym.make('Roulette-v0')EPS = 0.05GAMMA = 1.0`
`Q = {}agentSumSpace = [i for i in range(0,37)]actionSpace = [i for i in range(0, 38)]`
`stateSpace = []returns = {}pairsVisited = {}for total in agentSumSpace:    for action in actionSpace:        Q[(total, action)] = 0        returns[(total, action)] = 0        pairsVisited[(total, action)] = 0    stateSpace.append(total)`

Randomly initialize our policy.

`policy = {}    for state in stateSpace:        policy[state] = np.random.choice(actionSpace)`

A million episodes of training should be plenty to get a good policy trained. Initialize some variables to use each episode. Check our progress every hundred thousand episodes. Reset the environment.

`numEpisodes = 1000000    for i in range(numEpisodes):        statesActionsReturns = []        memory = []        if i % 100000 == 0:            print('starting episode', i)        observation = env.reset()        done = False`

Until the game is done, take an action based on our Monte Carlo policy, and record the results. An action would be to place a bet, or get up from the table.

`while not done:    action = policy[observation]    observation_, reward, done, info = env.step(action)    memory.append((observation, action, reward))    observation = observation_memory.append((observation, action, reward))`

Step backwards through the memory to record the rewards based on the previous state/action pairs.

`G = 0    last = True    for observed, action, reward in reversed(memory):        if last:            last = False        else:            statesActionsReturns.append((observed, action, G))        G = GAMMA*G + rewardstatesActionsReturns.reverse()statesActionsVisited = []`

This next part is where the Monte Carlo decision process happens. It can look intimidating, but I’ll try to explain. We are going through every state/action pair and the reward for taking that action while in that state. If a state/action pair has not been visited before we are comparing the rewards from other actions from that state to determine the best action from that state. We are choosing a random action in case of a tie. In the beginning it will be more likely to take a new action, but over time the epsilon will deminish and we will be taking only the already known best actions.

`for observed, action, G in statesActionsReturns:    sa = (observed, action)    if sa not in statesActionsVisited:        pairsVisited[sa] += 1        returns[(sa)] += (1 / pairsVisited[(sa)])*(G-returns[(sa)])        Q[sa] = returns[sa]        rand = np.random.random()        if rand < 1 - EPS:            state = observed            values = np.array([Q[(state, a)] for a in actionSpace ])            best=np.random.choice(np.where(values==values.max())[0])            policy[state] = actionSpace[best]        else:            policy[state] = np.random.choice(actionSpace)        statesActionsVisited.append(sa)if EPS - 1e-7 > 0:    EPS -= 1e-7else:    EPS = 0`

Test our trained Monte Carlo policy.

`numEpisodes = 1000    rewards = np.zeros(numEpisodes)    totalReward = 0    wins = 0    losses = 0    print('getting ready to test policy')       for i in range(numEpisodes):        observation = env.reset()        done = False        while not done:            action = policy[observation]            observation_, reward, done, info = env.step(action)            observation = observation_        totalReward += reward        rewards[i] = totalRewardif reward >= 1:            wins += 1        elif reward == -1:            losses += 1`

Print the results.

`wins /= numEpisodeslosses /= numEpisodesprint('win rate', wins, 'loss rate', losses)plt.plot(rewards)plt.show()`

The results are clear, zero wins and zero losses. The best way to win roulette is to not play at all. Cheeky Monte Carlo policy decided to leave the table every game. It is correct. The house always wins, it’s a loosing game. Just for fun lets see what happens if we force our Monte Carlo policy to play by changing one line of code.

`actionSpace = [i for i in range(0, 38)]`

Remove the option of getting up from the table.

`actionSpace = [i for i in range(0, 37)]`

Now lets see what happens.

Wins 2.5% of the time, looses 97.5% of the time. The first run was clearly on to something. This week I’ve learned not to play roulette, but more importantly I’ve learned how to use Monte Carlo policy to solve problems with unknown variables.

Thanks to Siraj Rival for the free deep RL course and thanks to youtuber Machine Learning with Phil for helping me understand Monte Carlo policy in OpenAI gym.