Learn to Win Games with Monte Carlo Reinforcement Learning AI

Published in

The Power of AI

11 min readDec 19, 2022

Deep Blue a chess computer beat world chess champion Garry Kasparov in 1997. The game of Go is played on a 19 by 19 board and is much more difficult to play as there are about 10 to the power 360 different combinations. It was thought it would take decades before a computer beat a Go champion. But now, thanks to reinforcement learning, computers can easily beat Go champions, beat Chess Grandmasters, and outperform Humans in every game. In this blog, we will use Monte Carlo Reinforcement learning algorithms for the simple game Frozen lake.

CLICK ON: Monte Carlo Reinforcement Learning for Simple Games

To see all of the detailed explanations for the mentioned concepts and analyze/experiment with the code for this blog. You can also learn a lot of FREE courses and projects about data science or any other technology topics from Cognitive Class.

Let’s start then.

What’s Reinforcement Learning?

In general Reinforcement Learning is just another machine learning method, which is based on rewarding desired actions/output and punishing for the undesired ones. Reinforcement learning models, just like people, are choosing which action to make based on the expected return of each action. You must give your model some input which include the information about current situation and possible actions, then you must reward it based on the decision that it decides to make. Reinforcement learning models learn to perform a task through repeated trial and error interactions with an environment and does so without any human intervention.

Basic Terminology:

Agent: is your reinforcement learning model, it’s a decision maker and learner.

Environment: is a world around your agent, the agent learns and acts inside of it. The environment takes action provided by the agent and returns to the next state and the reward.
State: is a complete description of the state of the environment.
Action: is a way agent interacts with the environment. Action Space is the set of all possible actions.
Reward: is a feedback from the environment, it can be negative or positive. It impacts the agent and serves as an indication to an agent of what it should achieve. Rewards are generally unknown and agents learn how to correctly estimate the rewards.
Policy: is a rule used by an agent to decide what actions to take, given the specific state. It works as a map from state to some action can sometimes be defined as a set of probabilities for each action in the action space.
Value Function: is the function that returns the expected total reward your agent can get from following the specific policy. The agent uses this value function to make decisions and learns by updating the expected reward values of the parameters of this function. In this lab, we will be using the state-action value function, so our function Q(s,a) will take the state-action pair and will return an estimated reward for taking action a from state s.

Reinforcement Learning Process:

1. Agent plays a number of games
2. In every game, the agent chooses an Action from the action space by using Policy and Value Function
3. Action impacts the environment and the Reward and the new State is returned to the agent.
4. Agent keeps track of what reward it received after choosing a certain action from a certain set.
5. After completing the game, the agent updates the estimated reward for each state and action by using the actual rewards values received while playing the game.
6. The whole process repeats again.

Let’s check our environement:

Frozen Lake Environment:

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you’ll fall into the freezing water. You have to navigate across the lake and retrieve the disc.

In our case, Frozen Lake will be a grid-like environment where each tile can be:

1. S: starting point, safe
2. F: frozen surface, safe
3. H: hole, fall to your doom
4. G: goal, where the disc is located

Important things to know are:

The game ends if you step into a hole or get to your goal
You receive a reward of 1 if you reach the goal, and zero otherwise.

Let’s initialize our environment and explore it a bit using Open Ai gym library:

env = gym.make("FrozenLake-v1", is_slippery=False)

Let’s see what our environment looks like, in our case, it should be a 4 by 4 grid:

env.render()

Which returns a simple grid:

SFFF
FHFH
FFFH
HFFG

Now let’s check the action space of the environment.

env.action_space

It returns Discrete(4) so we have 4 possible actions, the action space is {0,1,2,3}:

Left: 0
Down: 1
Right: 2
Up: 3

Now we can make a random action and see what our environment returns:

#choose a random action from the action space
action = env.action_space.sample()

#Execute this aciton 
new_observation, reward, done, prob = env.step(action)
new_observation, reward, done, prob

#Returns:
(0, 0.0, False, {'prob': 1.0})

It returns the first observation. Currently, we are on tile 4, the reward is 0 since we didn’t get to the goal, the binary parameter represents if the game is finished, and prob shows the probability of the execution of the chosen action, we have a deterministic environment, so for example, if we choose to go down we’ll go down.

We will introduce a bit more terminology, Episode is an agent-environment interaction from initial to final states, so it’s one game that the agent plays. In addition, our agents are operating in a discrete-time game. Each time-advancing decision is a step (e.x. taking some action from some state). It’s easy to see that each Episode consists of a series of steps.

Example of an episode with returns from the environment. Note that as mentioned, only when we reach the final goal, the G tile (index:15) we get the desired reward of 1.

Now let’s see what policy we are gonna use:

Epsilon-Greedy Policy

If you remember, as was mentioned before, policy is just a function that defines which action our agent should take based on the current state. In our environment, a simple deterministic policy pi for the first 4 states 0,1,2,3 may look like this:

Now let’s clarify a few things with the title. Epsilon, is just some constant 0≤ and ≤1, and it will define some probability. Greedy, is a concept in computer science where a greedy algorithm is the one that makes a locally optimal choice at each stage. In our case, greedy policy implies that it will choose an action with the biggest estimated return.

For now assume that Q(s,a) is our value function, it will return an estimated reward based on the given state and action. Let A be the action space then our policy can be simply defined:

You may ask if we want to maximize our returns, why can’t we always use the best action, the action with the best-estimated reward, what’s the point of epsilon? To answer this question we will have to learn about 2 more concepts:

Exploitation happens when the agent makes the best decision given current information, it uses the best-estimated action to maximize the reward.
Exploration happens when the agent takes random action to explore more opportunities, gather more information about possible actions and the environment.

Epsilon defines the trade-off between Exploration and Exploitation. We need it because the best long-term strategy may involve short-term sacrifices and in most cases, agents must explore the environment and gather enough information to make the best overall decisions. It may save our agent from doing decisions that work instead of finding the best actions.

Monte Carlo Method:

Let’s talk about the heart of our algorithm, the Value function that we will be using and how it estimates the reward for each action given the state.

Monte Carlo Method was invented by Stanislaw Ulman in 1940s, while trying to calculate the probability of a successful Canfield solitaire (He was sick and had nothing better to do). Ulman randomly lay the cards out and simply calculated the number of successful plays. We will apply the same approach to create our value function. The basic principle of Monte Carlo method can be summarized in 4 steps:

Define the Domain of Possible inputs
Generate inputs randomly from a probability distribution over the domain
Perform a deterministic computation on the inputs
Average the results

Before we can see it in action let’s define a few things. Review that Episode is a set of agent-environment interactions from initial to final states which consists of steps in a discrete-time game.

Monte Carlo reinforcement learning learns from episodes of experience, it functions by setting the value function equal to the empirical mean return.
Let’s assume that we have some initialized policy π that our agent follows. Then let’s play a game once and gain the following episode:

Now let’s look at the total expected reward of taking an Action A_t in the state S_t , where t is some time step.

At time step t=0 (the first time step), the environment (including the agent) is in some state S_t = S_0 (the initial state), takes an action A_t = A_0 (the first action in the game) and receives a reward R_t = R_0 and the environment (including the agent) moves to a next state S_{0+1} = S_1. Let’s define a function G, which will just give us the expected total discounted reward at each time step:

Discount factor 0≤gamma≤ 1 is an important constant. We add the initial reward R_1 as it is, without modifying the value, then to get the total reward we are adding R_{t+1} but note that the value is multiplied by gamma, so R_{t+1} is only partially added to R_1, R_{t+2} is multiplied by gamma², R_{t+3} is multiplied by gamma³ and so on. Gamma determines how much the reinforcement learning agent cares about rewards in the distant future relative to those in the immediate future. Note that if gamma=0 then total expected return will be defined just by initial reward, so agent will only learn and care about actions that produce an immediate reward.

Now we can define our action-value function $Q_π(S, A)$ for some state S and action A as:

So value function returns the expected value of a total discounted reward G(t) for the time step t at which S_t = S and A_t = A.

Now, after completing a series of episodes in the game, how can we adjust the expected values or a bigger question is how’s the learning process itself happening in Monte Carlo Method. For it, we will use the concept of Incremental means.

Incremental means is just the average of values that is computed incrementally. Let x_1, x_2,…, x_n be the set of values, Let μ_1, μ_2, … , μ_{n-1} be a sequence of means, where μ_1 = x_1, μ_2 = (x_1+x_2)/2 and μ_3 = (x_1+x_2+x_3)/3 and so on. Let’s see how the mean is defined incrementally:

Now we can put everything together to describe the Monte Carlo Method Learning Process. Let’s have an episode:

For each (state, action) pair we will keep track of the number of times this (state, action) was visited, let’s define function N(s_t,a_t), then every time we visit (state, action) we will update the visits counter and then adjust the running mean:

As an example, our state-value for the start state $S = 0$ may look something like:

So now we know how to update the action-value function and how to use it in combination with our policy to maximize the rewards. It can be summarized as:

Note that Monte Carlo algorithm/method is a type of model-free reinforcement learning, since the agent does not use predictions of the environmental response, so it’s not trying to create a statistical model of the environment.

Let’s see the results of using the Monte Carlo Method, we will train it with 1000 episodes, an epsilon of 0.2 and a discount factor of 0.8. The training process should be rather fast.

policy, Q = monte_carlo(env,N_episodes=1000,epsilon=0.01, discount_factor=0.8)

Plotting the policy we can see that our agent has indeed created an optimal well-defined path.

We can also try doing applying the same method to a bigger and more Complicated environment.
Let’s try working with a bigger Frozen lake environment. Before we were working with 4x4, let's change it to 8x8. Note that if before we had 4x4x4= 64 possible state action pairs (counting holes and goal tiles), now we will have 8x8x4 = 256$ possible state action pairs.

Let’s try training our model in this bigger environment. First import the environment itself:

# import the frozen lake gym environment
env = gym.make('FrozenLake8x8-v1', is_slippery=False) 
# warning: setting slippery=True results in very complex environment dynamics

Let’s check how it looks, to visually understand the complexity of this environment.

env.render()

SFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG

Let’s train our agent with the same model as the last time. It will help us to understand both the efficiency and scalability of a simple Monte Carlo algorithm.

policy, Q = monte_carlo(env,N_episodes=10000,epsilon=0.2, discount_factor=0.9)

Let’s once again visualize the path that we will take using the resulting policy:

Great Results once again, but there are some limitations.

Note that a simple Monte Carlo Algorithm may not be efficient enough, because in only 1000 episodes our agent is not able to reach the only tile which returns the reward a sufficient amount of times. In this environment, G tile is also the furthest tile from the initial S tile. It clearly demonstrates the inefficiency of a simple Monte Carlo Algorithm. Think about how you can improve the algorithm to avoid this sampling inefficiency.

Learn More:

If you want to know the answer to the aforementioned questions and learn some simple trick that may help you to improve your algorithm or look at the implementation of the algorithm itself, then click on Monte Carlo Reinforcement Learning for Simple Games or explore other FREE courses and projects about data science or machine learning on Cognitive Class.