Mastering Decision-Making in AI: An Introduction to Markov Decision Processes (MDPs)

4 min readApr 21, 2024

Introduction

Markov Decision Processes (MDPs) play a crucial role in the field of utility-based agent AI. MDPs provide a mathematical framework for modeling decision-making situations where outcomes are partly random and partly under the control of a decision-maker. This model is particularly effective in scenarios where an AI agent must make a series of decisions that lead to a goal, with the performance of each decision influenced by both the agent’s actions and external factors.

Meaning and Fundamentals of MDPs

An MDP is defined by:

States (S): A set of all possible states in which the agent can find itself.
Actions (A): A set of all possible actions the agent can take.
Transition function (T): A probability that defines the likelihood of moving from one state to another, given an action.
Reward function (R): A function that gives the immediate reward received after transitioning from one state to another via an action.
Discount factor (γ): A factor used to decrease the value of future rewards, reflecting the principle of time preference.

The goal in an MDP is to find a policy (a function from states to actions) that maximizes the expected utility of the agent, which often means maximizing the sum of discounted rewards over time.

Methods and Models in Utility-Based AI Using MDPs

1. Value Iteration

This method involves iteratively updating the value of each state under the assumption of choosing actions that maximize the expected utility. The algorithm continues until the values converge, and from these values, an optimal policy can be derived.

Here’s an example of how you could implement the value iteration algorithm for a simple MDP environment. Let’s consider a grid world where an agent has to find the shortest path to a goal while avoiding obstacles.

import numpy as np

def value_iteration(states, actions, transition, rewards, gamma, theta=0.01):
    V = np.zeros(len(states))
    policy = np.zeros(len(states), dtype=int)
    while True:
        delta = 0
        for s in range(len(states)):
            v = V[s]
            V[s] = max([sum([transition[s, a, s_prime] * (rewards[s, a, s_prime] + gamma * V[s_prime])
                            for s_prime in range(len(states))]) for a in range(len(actions))])
            delta = max(delta, abs(v - V[s]))
        if delta < theta:
            break
    for s in range(len(states)):
        policy[s] = np.argmax([sum([transition[s, a, s_prime] * (rewards[s, a, s_prime] + gamma * V[s_prime])
                                   for s_prime in range(len(states))]) for a in range(len(actions))])
    return policy, V

# Example usage
states = np.arange(4)  # Assume 4 states for simplicity
actions = np.arange(2)  # Assume 2 possible actions
transition = np.zeros((4, 2, 4))  # Transition probabilities
rewards = np.zeros((4, 2, 4))  # Rewards
gamma = 0.99  # Discount factor

# Define your transition and rewards here as needed
# Example values for demonstration purposes
transition[0, 0, 1] = 1.0
rewards[0, 0, 1] = 1.0

policy, V = value_iteration(states, actions, transition, rewards, gamma)
print("Policy:", policy)
print("Value Function:", V)

2. Policy Iteration

Policy iteration starts with a random policy and iteratively improves it. For each policy, it calculates the utility of each state by solving a system of linear equations. Then, it updates the policy by choosing actions that maximize the utility. This process repeats until the policy is stable.

3. Q-Learning

Q-learning is a model-free reinforcement learning technique. It directly learns the value of an action taken in a given state without requiring a model of the environment. It updates the Q-values (action-value pairs) based on the equation:

Q(s,a) ←Q (s,a) + α[r +γmaxa′ Q(s′,a′) − Q(s,a)]

where r is the reward received after taking action a in state s and moving to state ′s′, and α is the learning rate.

Here’s how you could implement a basic Q-learning algorithm for a hypothetical learning scenario:

import numpy as np
import random

def q_learning(episodes, alpha, gamma, states, actions, transition_rewards):
    Q = np.zeros((len(states), len(actions)))
    for episode in range(episodes):
        s = random.choice(states)
        while s != terminal_state:
            a = random.choice(actions)  # Simplified policy: random action
            s_prime, reward = transition_rewards(s, a)
            Q[s, a] += alpha * (reward + gamma * np.max(Q[s_prime]) - Q[s, a])
            s = s_prime
    return Q

# Define transition_rewards function based on your environment
def transition_rewards(s, a):
    # Assume some transitions and rewards
    if a == 0 and s < len(states) - 1:
        return s + 1, 1  # Move to the next state with a reward of 1
    else:
        return s, 0  # Stay in the same state with no reward

# Example usage
states = np.arange(10)  # 10 states
actions = np.arange(2)  # 2 actions
alpha = 0.1  # Learning rate
gamma = 0.99  # Discount factor
episodes = 1000

Q = q_learning(episodes, alpha, gamma, states, actions, transition_rewards)
print("Q-Table:", Q)

4. Monte Carlo Methods

These methods estimate the values of states based on averaging the returns (total discounted rewards) obtained from multiple episodes. Unlike dynamic programming methods, Monte Carlo methods do not require a complete model of the environment and can operate with sample episodes.

5. Temporal Difference (TD) Learning

Temporal difference methods, like SARSA and TD(λ), learn directly from raw experience without a model of the environment’s dynamics. TD learning is a combination of Monte Carlo ideas and dynamic programming ideas.

Models of Utility in MDPs

Utility-based AI often considers models of utility that can differ significantly depending on the specific goals and constraints of the agent. These include:

Linear models: Straightforward and simple, where the utility might just be the cumulative reward.
Non-linear models: These might include considerations such as diminishing returns on accumulated rewards or different weights given to certain types of rewards or penalties.

MDPs are a robust framework for handling decision-making under uncertainty, making them highly applicable in various AI domains, including robotics, automated control systems, economic modeling, and more, where decision-making in the presence of uncertainty is crucial.