A Comprehensive Overview of Reinforcement Learning

19 min readFeb 29, 2024

This article aims to comprehensively demonstrate RL-related content.

Working In Progress

Section 1: Unraveling the Complexity of Reinforcement Learning in Machine Learning

The exploration of Reinforcement Learning (RL) within the machine learning spectrum reveals a sophisticated landscape where traditional learning paradigms undergo a significant transformation.

This complexity is further nuanced by the shift from static data representations to dynamic states and actions, where the objective is not merely to predict the correct outcomes but to maximize rewards through sequential decision-making.

Section 1.1: Reviewing Machine Learning

In the realm of machine learning, various paradigms are utilized to teach machines how to interpret data and make predictions or decisions. At the core of these paradigms, symbols x and y are often used to delineate the process: x represents the input , while y symbolizes the ground-truth output we desire our models to generate. The objective is to maximize the probability of outputting the ground-truth given the input, i.e., p(y|x)

Supervised learning is a paradigm that directly targets the optimization of p(y|x) by leveraging datasets where both x (input features) and y (target outcomes) are provided. The goal is to explicitly learn the mapping from x to y, thereby enabling the model to make predictions or decisions when presented with new, unseen instances of x.

Unsupervised learning comes into play when y is not provided. This paradigm focuses on discerning patterns, structures, or insights from data where the outcome, or y, is unknown. The learning process revolves around the analysis of x alone, without direct guidance on what the output should be.

Section 1.2: Fundamental Difference: Feedback-Driven Learning

To achieve the goal of evaluating whether it is good or bad, we need outputs according to an optimized reward model R(y|x) , i.e., finding the action y maximizing R(y|x) rather than finding the output y maximizing p(y|x) in typical machine learning tasks. As the definition from the Sutton&Barto’s book (the Bible book in the field of RL), an RL agent aims to

learn how to map situations (states) to actions — so as to maximize a numerical reward signal.

Section 1.3: Fundamental Difference: Dynamic Learning

Firstly, “dynamic” comes from the requirement of making sequences of decisions. RL models take the job to generate a sequence of decisions. During this process, rather than static mapping from a input to one or more ground-truth labels, both inputs and outputs are dynamically changed.

Dynamically changed xand y: It is the perceived state from the environment. Obviously, for one example/trajectory, a sequence of different inputs could be injected into the model. Correspondingly, outputs can be different.

Secondly, dynamic comes from the unknown reward models for learning purpose. Ideally, the agent knows the reward models and transition models. Hence, they only need to adjust their actions (Model output: y) conditioned on a state (Model input: x) based on the weighted feedback/reward they receive.
However, reward models are often unknown. Hence, agents have to learn reward models through a sequence of trials and errors (or experience) by interacting with an environment. This leads to:

Dynamically changed objective function: Through the learning process, it is dynamically determined.
Dynamically changed Policy and y: The policy and the best action for a same input can be dynamically changed through the learning process.

Section 1.4: Fundamental Difference: Independent vs Dependent

Independent Assumption in Unsupervised/Supervised Learning Algorithms: This assumption exists in most of Supervised learning and unsupervised learning. In supervised learning, the independent assumption is crucial for the training process. It posits that each example in the dataset is independent of the others, meaning that the occurrence of one example does not influence the occurrence of another. This assumption simplifies the model training by allowing the algorithm to treat each example as an isolated instance, which is essential for the effectiveness of statistical inference and the reliability of predictive performance. However, this assumption can sometimes be violated in real-world scenarios where data points may be related, leading to challenges in model accuracy and generalization. For example, if the model assumes that each day’s stock price is independent of the previous day’s, it overlooks the critical temporal dependencies that often drive financial markets.

“Identically distributed” is another important assumption in common unsupervised/supervised Learning algorithms to ensure the training and future unseen data come from the same distribution, facilitating the model’s ability to generalize well from the training set to the test set or real-world scenarios. For example, the model might associate the word “avenger” with positive sentiments due to its positive connotation in the context of movie reviews, particularly for the “Avengers” movie series which is widely appreciated. When applied to a product review that uses “avenger” in a completely different context, e.g., “Total disappointment. This product turned me into an avenger, hunting for a refund. Save your money!”, the model may incorrectly predict the sentiment as positive due to its reliance on the learned association.

Dependent in Sequential Decision Making in RL: In the realm of Reinforcement Learning (RL), the scenario is markedly different due to the sequential and dependent nature of decisions, as opposed to the independent examples typically found in Supervised and Unsupervised Learning. The concept of dependency is pivotal, where the outcome of a previous action (y) directly influences the subsequent state or observation (x) encountered by the agent. The 1st challenge it leads to is exploration and exploitation. This intertwined relationship necessitates a nuanced approach in RL, known as the balance between exploration and exploitation. Exploration involves trying new actions (a) to gauge their outcomes, essential for uncovering potentially superior strategies. On the other hand, exploitation leverages the best-known actions to maximize immediate rewards. The art of RL lies in effectively managing this balance, as it directly impacts the learning efficiency and the ability of the agent to adapt to dynamic environments, thereby achieving optimal decision-making over time. The 2nd challenge it leads to is iterative calculations of objective. The Objective in Supervised Learning is easy: p(y|x). However, this dependency in Reinforcement Learning (RL) naturally leads to an iterative process for calculating the objective, which is typically defined as the sum of future rewards. To make it solvable, the concept of value is devised: a value under a policy Vπ(s) is the expected return (cumulative discounted reward) from state s .

Thinking: Why do I use the term “expected return” rather than “expected reward”? The “return” in RL contexts refers to the sum of all rewards an agent receives, often discounted by a factor at each time step to account for the uncertainty or diminishing value of future rewards. This includes not just the immediate reward for the next action, but all subsequent rewards the agent will collect. The concept of return encompasses the entire sequence of rewards that follow a state or a state-action pair, reflecting the long-term consequences of actions.

Specifically, following a policy at the state s yields a random path. The utility (or return) of a policy is the (discounted) sum of the rewards on the path.

The return from the timestep t is a random quantity. The random variable cannot be an objective for optimizing the policy. Hence, we need the expectation over the path coming from the current state sand ending at a terminal state or reaches a finite time horizon.

(Optional) Section 1.5: An Example Using All The Three Learning Paradigms — — ChatGPT-like LLMs

Since I always boast myself as a language intelligent researcher, let’s show off my expertize. All of you know ChatGPT, a Large Language Model. However, maybe you do not know is that ChatGPT is trained by including all the three learning paradigms:

Unsupervised Learning: Self-supervised learning (SSL), a particular type of unsupervised learning, is used to let language models learn huge amounts of world knowledge. For detail, here is a very early but typical paper by Yoshua Bengio. For a more intuitive one, I will release my own work later to demonstrate this)
Supervised Learning: It is used to tune the naive ChatGPT to behave like a human. In the leftmost part, you can see, in this stage, the ChatGPT model is called policy (A good time for you to refresh the terminology).
Reinforcement Learning: The middle section talks about training a reward model (Again, a good time for you to refresh the terminology) for optimizing the less naive ChatGPT.

Section 2: Refining Terminology

Now, after understanding the unique characteristics of RL by comparing it with ML paradigms. Now, let’s correct notations and terminology in the RL setting.

a (not y) for action
s (not x) for the perceived state.
R(s, a) (not R(y|x)) is the expected reward by performing the action a on the stat s .
Policy π: The model mapping from the state to the action is defined as a policy.

Actually, the output of policy can be either a distribution of R(s, a) over all the possible actions and a deterministic action sampled from the distributions

Agent: RL using a policy model to perform actions is normally called agent

Section 3: Modeling Dynamic World

The model underlying RL is a decision process where an action y on a state x at the time t results in an indeterministic state x at the time t+1. i.e., the environment is dynamic . Here is a simple running example.

For each round r=1,2, …
- You choose stay or quit.
- If quit, you get $ 10 and we end the game.
- If stay, you get $ 4 and then I roll a 6 -sided dice.
- If the dice results in 1 or 2 , we end the game.
- Otherwise, continue to the next round.

The reason why we want to model this as a MDP problem is because the indeterministic resulting states when the action “state” is performed on the “in”, which is represented by the chance node (in, stay).

Action: In the toy example, the action space A = {stay, quit}.
States: In the toy example, the state space S = {in, end}.
Transition models: Commonly, the transition models can be represented by the distribution of possible states and their rewards Pr(S, R|s, π) for each state and each action.
For the current state “in”: (the total number of transitions = the number of actions * the number of states)
For the action “stay”, p(end|in, stay)=1/3, p(in|in, stay)=2/3;
For the action “quit”, p(end|in, quit)=1, p(in|in, quit)=0.
If we use each state as the current state, the total number of transitions = the number of actions |A|* the number of states |S|²
Rewards: r(in, stay->end) = $4; r(in, stay->in) = $4

Section 4: Objective and Policy Evaluation

As defined previously, “Policy” is a mapping from states of the environment to actions to be taken. Formally, it is a state-dependent distribution over the action space π(a|s). The number of policies = |A| to the power of |S|, which is combinatorially large. If there are 11 states and 4 actions, it equals to ⁴¹¹.

Objective: Maximizing Utilities of Following A Policy π

The Objective in Supervised Learning is easy: p(y|x) . The objective for RL is the cumulative reward of following a policy. Specifically, following a policy yields a random path. The utility (or return) of a policy is the (discounted) sum of the rewards on the path.

The return from the timestep t G_t is a random quantity. The random variable cannot be an objective for optimizing the policy. Hence, we need the expectation.
Note that the path comes from the current state s and ends at a terminal state or reaches a finite time horizon. Below show four paths and their utilities for the toy example.

Terminology: “Horizon” refers to the number of future steps or decisions a model considers when calculating values like rewards.

However, these utilities can give us an intuitive sense of how the policy performs if we can simulate tons of them and average them, which is the idea of the Monte Carlo method. The method calculates the expected utility by sampling set of trajectories and average their returns.

Objective: Maximizing Value of Following A Policy π

The value referes to the expected utility (or return) received by following policy π.

Why do I use the term “expected return” rather than “expected reward”?Here is the difference between utility (or return) and reward: The “return” in RL contexts refers to the sum of all rewards an agent receives, often discounted by a factor at each time step to account for the uncertainty or diminishing value of future rewards. This includes not just the immediate reward for the next action, but all subsequent rewards the agent will collect. The concept of return encompasses the entire sequence of rewards that follow a state or a state-action pair, reflecting the long-term consequences of actions.

Each value is related to a particular state (so-called state-value function) Now, what we miss is the state. We can only follow a policy from a given state. Let’s rephrase again in a specific way by considering the state: the expected return from each individual state while adhering to the policy. Why do I emphasize this? Well, because I want to emphasize that policy evaluation involves computing the value for each state under the current policy but is not to sum the values of all states.

The key point is that when evaluating a policy, you don’t add up these individual state values into one aggregate sum. Each state’s value is kept distinct because it serves a specific purpose in the context of policy evaluation and improvement.
For example, in algorithms like Policy Iteration or Value Iteration, these individual state values are used to inform decisions about which actions to take in each state (policy improvement) and to check for convergence to an optimal policy
The calculation of Vπ(s) for the particular state s take into account the other states that may occur in the future under policy π. This expected return is not just about the immediate rewards but also includes all the future rewards the agent can expect to receive, appropriately discounted.

Objective: Defining Value Function for Maximization

Formally, the value function Vπ(s) is defined to map from a state s to the expected return. Intuitively, It answers the question: what is the expected return by starting in state s and following policy π?

Section 5: Using Model-based Approaches with Reward Model and Transition Model

Due to the probabilistic nature of state transitions, it is calculated by aggregating (technically we use the weighted sum) rewards (it can be discounted rewards) for all the one-timestep future states and corresponding rewards by taking an action according to the policy. Note that the rewards may be aggregated with discounts, e.g., more discounts for further future.

The latter part of the expression is a recursive decomposition for the value function of a policy in a MDP, which is known as the Bellman equation.

IMPORTANT: The Bellman equation is a fundamental concept in dynamic programming and reinforcement learning.
This recursive relationship provides a powerful tool for solving MDPs, especially when combined with algorithms like value iteration and policy iteration.
Generally, the Bellman equation is the backbone of algorithms like Q-learning, Value Iteration, and Policy Iteration
Below is a less common recursive form, which simplifies the immediate reward as a direct consequence of being in state s.

Stochastic policy: We need two levels of expectation for stochastic policy where π(a′∣s′) gives the probability of taking action a′ when in state s.

The sections below will give a more detailed introduction of the model-based method — Dynamic Programming and other model-free methods based on sampling experience.

Why Does The Expressions Make Sense? Why is V(s) always stationary?

Markov Means “Memoryless”: The formulas and algorithms described here are often (not always) used for Markov Decision Process (MDP). It is “Markovian” because we assume that

State: The next state is probabilistically determined by only the current state and action.
Value: No matter how we got to the current state, the value of the state is identical;
Action: the optimal decision at any point depends only on the current state and not on the path taken to reach that state ( how the state was reached); 4) Hence, we know the independence of future optimal decisions (actions) from past decisions (actions).

In summary, the sequential decision-making process is thus simplified to a series of independent decisions, each based solely on the current state. This is why the above regressive forms make sense, and algorithms like Value Iteration and Policy Iteration can systematically improve policies by considering the optimality of actions at each state independently of the actions that led to that state.

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (See Bellman, 1957, Chap. III.3.)

Further Thought: How about it’s not stationary?

Section 6: Model-based Approach: Dynamic Programming

Dynamic programming is used for policy evaluation via Bootstrapping and Recursion.

Policy Iteration and Value Iteration in the context of Dynamic Programming (DP) are indeed two parallel and distinct methods for finding the optimal policy in Markov Decision Processes (MDPs).

In DP, the solution to a larger problem (finding V(s) for all s) is built upon the solutions to smaller problems (finding Vk−1(s′) for all ′s′). This decomposition into subproblems is made possible by the model’s structure (the transition probabilities and rewards) and the recursive nature of the Bellman equations.
Model’s structure: In DP for MDPs, the model of the environment (specifically, the transition probabilities and reward structure) is known. This model tells us the probability of transitioning to various future states given the current state and action, as well as the rewards associated with these transitions.

Policy Iteration

Note: while value-based methods derive policies from value functions, policy-based methods optimize the policy directly, and actor-critic methods use a hybrid approach.

Under the assumption of an infinite horizon
1. Stationary policy, i.e., the policy does not change over time: In policy iteration, a policy is evaluated and improved iteratively. During the evaluation phase, the policy is considered stationary. That is, it’s assumed that the policy does not change while its value function is being calculated. Just note that the Bellman equation is typically formulated under a stationary policy π.(I am not sure whether it can be applied for non-stationary policies.)
2. Convergence of Value Function, i.e., V(s) becomes constant over time: When the value function under this stationary policy is computed (policy evaluation) over an infinite horizon, this value function converges to a constant as long as the policy remains unchanged. Specifically, the value function becomes stable and stops changing after sufficient iterations. Once the value function stabilizes, the policy is then improved based on the current value function, and the process repeats.

Policy Iteration involves two iterative steps: policy evaluation and policy improvement, iterated until convergence.
1. Policy Evaluation:

Thought: Why iterative update of value function will finally converge?

In this step, the value function for a given policy is calculated until it stabilizes. This involves solving a system of linear equations to find the expected returns from all states under the current policy.

The subscripts k and (k-1): Using just V(s) and V(s′) without specifying the iteration might imply that we’re referring to a static value function or the final, converged value function. By using the subscripts , we explicitly acknowledge that we’re working with the value function as it was estimated in certain iterations, not some static or final value function. For example, (k-1) is the previous iteration of k.

Bootstrapping referring to use the bootstrapping term below as a substitute for actually doing all the roll-out of future steps.

Specifically, the essence of DP is to estimate the expectation over all the possible future by using a one-timestep values exactly. In other words, what would it be like if I start from each possible state of one-time step future, as denoted below? This is called

We now have the estimate of values for all the possible states. However, our goal is to estimate the expectation over all the possible future. So we must know the distribution (probabilities)of reaching potential future states given the current state and action, as denoted below.

Now would be the best time to explain the word “Dynamic”. It actually refers to the dynamics of the environment given by the probability function above and the reward model that returns the immediate reward.

Note that the dynamic model (i.e., the probability distribution) and reward model above are the components of the Markov Decision Process Model.

2. Policy Improvement: Once the value function is stable, the policy is greedily updated for each state s with respect to its value function V(s) under the policy improvement theorem. This means for the state s, the policy chooses the action on s that leads to the highest expected return from the next state according to the current value function.

3. Convergence to Optimal Policy: This process of evaluation and improvement is repeated until the policy no longer changes, at which point it is considered optimal.

Value Iteration

Single Step Process: Value Iteration simplifies the process by combining policy evaluation and improvement into a single step.
Value Function Update: Instead of fully evaluating a policy, Value Iteration continuously updates the value function for each state. It does this by choosing the action at each state that maximizes the expected return, based on the current estimate of the value function.

Implicit Policy Improvement: The policy is implicitly improved in each iteration as the value function gets updated. The optimal policy can be easily derived from the converged value function by choosing the best action at each state.
Faster Convergence: Value Iteration often converges faster than Policy Iteration because it doesn’t require the value function to stabilize under a particular policy before improving it.

Simpler Methods

When the state and action spaces are small and the transition probabilities and reward structures are relatively simple, it might not be necessary to use complex algorithms like Value Iteration or Policy Iteration. In such cases, simpler methods might suffice.

Imagine a simple MDP with three states (S1, S2, S3) and two actions (A1, A2), where actions lead to different states with certain probabilities:

Transitions and Rewards:

From S1, taking A1 leads to S2 with a probability of 0.7 and to S3 with a probability of 0.3. The reward is +1 in either case.
From S2, taking any action leads back to S1 with a reward of 0.
From S3, taking any action leads back to S1 with a reward of -1.

Policy:

Consider a policy where:

In S1, the policy chooses A1.
In S2 and S3, the action choice doesn’t matter (as all actions lead to the same outcome).

Objective:

To evaluate this policy, we calculate the expected return from each state under the policy.

Setting Up the Equations:

Assuming a discount factor ( = 0.9 ), the Bellman equations for our policy become:
For S1 under A1: ( V(S1) = 1 + 0.9 [0.7 V(S2) + 0.3 V(S3)] )
For S2: ( V(S2) = 0 + 0.9 V(S1) )
For S3: ( V(S3) = -1 + 0.9 V(S1) )

Solving the System:

This system of linear equations needs to be solved to find the values of V(S1) , V(S2) , and V(S3) .
This can be done using matrix algebra methods or iterative techniques like Gauss-Seidel iteration.

Section 7: Roll-out

In a rollout algorithm, one simulates or “rolls out” an episode from a given state until a terminal state or a predefined horizon is reached according to a policy. This approach is used to estimate the value function or the quality of an action at a state.

It can be considered as a form of Monte Carlo simulation, as it relies on averaging over multiple random samples of trajectories (or episodes).

Section 8: Model-free Approach: Monte-Carlo Methods — Learning from Experience

If we do not know the dynamic models and reward models, we will use other methods like Monte-carlo and Temporal Difference methods.

Policy Evaluation (Converged Value Function)

The method calculates values (estimates the value function) by sampling set of trajectories(episodes) and average their returns.

It only suits for episodic MDP (e.g., we cannot use it for optimizing robots) but the good thing may be that it does not have markov assumption of states.

It suits episodic tasks where episodes are guaranteed to terminate after a finite number of steps, allowing the calculation of the return for each state visited in the episode. Specifically, we. sample actions from π ,which leads to the sample episode: s_1, a_1, r_1, s_2, a_2, r_2, …, s_T
On-policy: Monte Carlo policy iteration, where a policy is evaluated using the returns generated by episodes produced under the same policy, and then the policy is improved based on the evaluation.

Off-policy: Off-policy Monte Carlo methods evaluate or improve a policy different from the one generating the data. An example of this would be using importance sampling to re-weight the returns generated under a different policy (behavior policy) to evaluate or improve a target policy.

Section 9: Model-free Approach: Temporal Difference

Now, I can easily understand the following expression, where we simply estimate the future reward G_t using the bootstrapping term in DP.

Compared to the Monte-Carlo method, it has the following advantages:

Faster Learning: TD methods update the value estimate after each step rather than waiting for the end of an episode as in Monte Carlo methods. This means that TD learning can learn from incomplete sequences and update its estimates partway through an episode. This leads to potentially faster learning since the algorithm doesn’t have to wait for the episode to finish before making value updates.
Handling of Continuous Tasks: Monte Carlo methods are primarily suited for episodic tasks where the episodes terminate. In contrast, TD methods can handle both episodic and continuous (non-terminating) tasks efficiently. This is because TD methods do not require the final outcome of an episode to update the value estimates.
Reduced Variance: The updates in TD learning are based on the difference between estimated values of successive states (TD error), which often leads to lower variance in the updates compared to Monte Carlo methods. Monte Carlo updates are based on full returns, which can have high variance especially in stochastic environments.

~~~~ Uncompleted Article (TBC) ~~~~

~~~~ Welcome to correct me if I made any mistake here ~~~~~

How to solve optimize a policy model to maximize this objective?

1. Getting **a reward model** to calculate the reward?

2. Getting a value function? (or a action value function)?

3. Getting a dynamics model, i.e., the distribution of $x_{t+1}$ and action $a$ of the state $x_t$ will be transited to ($Pr(x_{t+1}, r|x_t, y’)$)?