Reinforcement Learning: Part 2: Markov Decision Process

4 min readJun 7, 2023

In the previous blog, we learned how the Agent takes actions that are independent of any situation.

Now we will discuss how an agent can change the action given a different situation or state. Agent creates a sequence or trajectory that begins like this: S0, A0, R1, S1, A1, R2, S2, A2, R3,…(State, Action, Reward)

Agent will be in state S0 at t=0, then performs A0 and according to the environment, Agent will receive R1 and reach state S1.

A Markov Decision Process (MDP) provides a formal way to represent an agent’s interactions with an environment, where the agent takes actions in different states to maximize cumulative rewards.

MDPs are sequential decision-making, where actions influence not just immediate rewards, but also subsequent states.

The dynamics of an MDP can be represented by a transition model, which specifies the transition probabilities for each state-action pair, and a reward model, which assigns rewards to state-action pairs.

Let’s consider a classic example known as the “Frozen Lake” problem. In this problem, the agent navigates through a frozen lake represented by a gridworld. The agent’s goal is to reach a specific location (the goal state) while avoiding holes in the ice (terminal state).

Markov Decision Process uses 5 main components to model the problem:

· States: Each state represents the agent’s location on the gridworld. For example, in a 4x4 grid, there are 16 possible states, where each cell represents a state.
· Actions: At each state, the agent can take four actions: move up, down, left, or right. These actions determine the agent’s movement on the gridworld.
· Transitions: The transitions define the dynamics of the system. Transitions can be either stochastic or deterministic. We will discuss about it more in next blog.
· Rewards: The agent receives immediate rewards based on its actions and the resulting state. For example, reaching the goal state gives a positive reward (+1), falling into a hole gives a negative reward (-1), and all other actions receive a small negative reward (e.g., -0.01) to encourage the agent to reach the goal quickly.
· Policy: The policy determines the agent’s decision-making strategy. In this example, the policy can be a mapping of states to actions, indicating the agent’s preferred action at each state. The goal is to find the optimal policy that maximizes the cumulative rewards.

Through interactions with the environment, the agent learns from the outcomes and adjusts its policy to improve its decision-making over time. The objective is to find a policy that maximizes the cumulative reward, guiding the agent to successfully navigate the frozen lake and reach the goal state while avoiding the holes.

Let’s consider another example. Agent has to reach a destination which will give +100 reward. There are two paths to reach the location. If the Agent takes path A, it will receive an immediate reward of +5 and will reach the destination. But if the Agent takes path B, it will receive a reward of +50 in the longer run. The total reward by following path A is 105 whereas it is 150 for path B

If the agent is shot sighted, meaning It can only see the rewards in the next state, then the agent will take the Path A

But if somehow the agent becomes far-sighted, then the agent will be able to take Path B and generate the maximum reward. This can be achieved by the return function.

Gamma is a parameter between 0 & 1 called the discount rate. It is the geomatic mean which can further be reduced to the below equation if R = 1.

If the task is episodic, meaning agent–environment interaction naturally breaks down into a sequence of separate tasks, then the return would be the below equation

In the next blog, we will discuss how we can use the return function to evaluate which state or action is best in the given environment.

Thanks for spending your time on this blog. I am open to suggestions and improvement. Please let me know if I missed any details in this article.

Reference:

Reinforcement Learning: An Introduction — Richard S. Sutton and Andrew G. Barto

Reinforcement Learning: Part 2: Markov Decision Process

Written by Mehul Jain