Reinforcement Learning Chapter 3: Finite Markov Decision Processes
In the previous chapter, we learned about the Multi-Armed Bandits Problem which served as a partial version of the full RL problem. In this chapter, we will learn about the full RL problem and how it is formalized as a finite Markov decision process.
Markov Decision Processes
Markov Decision Processes (MDPs) are just a mathematical way to formalize a sequential decision-making problem.
MDPs have actions and rewards that we are already familiar with from the Multi-Armed Bandits problem, and the associative aspect discussed in the last article. There is also one more important addition: the concept of states.
- In MDPs, actions that the agent takes do not just determine reward but also influence the next state the agent enters.
- In the bandit problem, the goal was to learn q*(a) or the optimal action. In a MDP the goal is to learn either q*(s,a) — the optimal action given the current state — or v*(s) — the optimal state, given the optimal actions.
Basically, a MDP is just an abstraction of the problem of goal-directed learning from interaction. It states that any problem of learning goal-directed behavior can be reduced to 3 signals that get passed between the agent and the environment:
- One to represent the choices made by the agent (actions)
- One to represent the basis on which the choices are made (states)
- One to define the agent’s goal (rewards)
The reason that MDPs are used to represent problems in RL is that they are so flexible. For example,
- The timesteps don’t have to correspond to real-time; they can be arbitrary
- Actions can be low-level (e.g. controlling velocity on robot motors) or high-level (e.g. controlling general direction of movement for robot arm) so they can be applied to a variety of tasks
- States can be anything that might be useful in helping the agent make decisions
When representing a problem as a MDP, how the states, actions, and rewards are chosen can have a big impact on performance. Choosing them wisely is more of an art than a science.
Agent-Environment Interface
A MDP can be visually represented through the Agent-Environment Interface.
- Agent takes an action at timestep t in the environment
- It then receives a reward and new state at the next timestep
- The set of all states, actions, and rewards can be represented as S, A, and, R
- A trajectory is a sequential list of states, actions, and rewards (e.g. S₀, A₀, R₁, S₁, A₁, R₂, …)
Environment Dynamics
The dynamics of an environment (also known as the transition probabilities) determine the likelihood of ending up in some new state and getting some reward if an agent was to take some action from the current state.
- This can be read as p is equal to the probability of ending up in some new state s’ and receiving reward r, given the current state is s and the action taken is a.
- In deterministic environments, taking the same action from the same state will always put you in the same new state with the corresponding reward.
- In stochastic environments, you could take the same action from the same state multiple times and end up in a different subsequent state each time.
Rewards and the Reward Hypothesis
The reward hypothesis states that all “goals” can be simply though of as the maximization of the expected value of the cumulative sum of reward.
- This use of a reward signal is a distinctive feature of RL as opposed to other learning methods.
When choosing what reward should be, we have to be careful not to input knowledge about how we want the agent to achieve the goal but rather what we want it to achieve.
- For example, in chess you could give rewards to an agent for taking pieces as a means of winning the game.
- The problem with this is that an agent could learn to simply take pieces at the expense of winning the game.
- Instead, a reward should be given for the actual goal which is winning.
Returns and Episodes
We now understand that the goal of a MDP is simply to maximize the expected reward or return.
The simplest definition of expected return is just a sum of all rewards.
- This formula works for tasks that are episodic — meaning they have a clear end represented by the final timestep T after which the environment resets.
The problem with this formula is that it does not work for tasks that are nonepisodic or continuous — tasks with no clear end.
- In these tasks the final timestep T = ∞ so what we are trying to maximize could be infinite.
For continuous tasks, we define expected return using discounting which allows the agent to maximize the sum of discounted rewards instead.
- γ is a parameter called the discount rate whose value is between 0 and 1
- γ causes the previously infinite sum to have a finite value
- The discount rate determines the present value of future rewards. As γ increases, future rewards are prioritized, as it decreases vice-versa.
- If γ = 0 then the agent only cares about maximizing immediate rewards.
Policies and Value Functions
To solve a MDP an agent uses policies and value functions.
A policy is a map from states to the probability of selecting actions from each state.
- π(a|s) is the probability that Aₜ = a given that Sₜ = s or the likelihood of selecting action a from state s under the policy π.
Value functions estimate how good it is to be in a particular state, or to take a particular action from a given state in a MDP.
- This is a special equation known as the Bellman equation for vπ where v is known as the state-value function that is defined with respect to the policy π.
- It can be read as, the value of any state is equal to the discounted value of the possible next states plus reward expected along the way all weighted by the probability of those next states occurring.
- Similar to the state value function we can also use an action-value function q which returns the value of taking a particular action from a given state.
Optimal Policies and Value Functions
For any MDP there is at least 1 optimal policy (if not more) that determines the best behavior. The way the agent achieves its goal of maximizing expected return is to learn an optimal policy. All the optimal policies share 2 things:
The optimal state-value function v*
- This is the Bellman optimality equation for v*
The optimal action-value function q*
- This is the Bellman optimality equation for q*
The equations may look complex but the intuition behind them is actually quite simple. They express the fact that the value of a state or a state-action pair under an optimal policy must equal the expected return of the best action from that state.
If we can solve the Bellman equations for v* and q* then we can solve a given MDP since we will have the optimal policy. You could think of solving these equations as doing an exhaustive search across a MDP of all the possible states you could end up in and all of the possible actions you could take and getting their value. As you can probably imagine this works in theory but usually doesn’t work in practice for 3 reasons:
- We would need to accurately know the dynamics of the environment (this is often unknown).
- We need to have enough computation and space resources (for a game like backgammon there are 10²⁰ states which is more than any of today's modern computers could handle).
- The problem must follow the Markov property which means that the current state of the process is enough to predict the future states.
Usually at least one of the above is not applicable so we cannot directly solve for v* and q* but instead the agent learns to approximate them through experience.
Summary
In this chapter, we’ve learned about Markov Decision Processes which are a formal way to define RL problems. We’ve learned about how they can be represented visually through the agent-environment interface. We then learned about different environment dynamics, the reward hypothesis, and how the goal of a MDP is defined as the maximization of the expected discounted return. Finally, we learned about policies and value functions and how solving or approximating the optimal value functions can be used to solve a MDP.
In the next chapter, we’ll visit our first set of solutions to MDPs — Dynamic Programming Solutions.