Key concepts in Reinforcement Learning

Published in

Intro to Artificial Intelligence

6 min readMay 25, 2020

The goal of any Reinforcement Learning(RL) algorithm is to determine the optimal policy that has a maximum reward. It is important to understand a few key concepts before getting into any specific RL algorithms.

Episode

It is a sequence of states and actions.

Reward and Return

The reward determines how good was the action from state s to reach the next state. This is the crucial component of RL that determines the learning of the RL agent. A reward at specific timestep t is given below[5]:

This formula implies rt is the reward at timestep t for taking action, at from state st to reach new state st+1. R is the indication of a reward function.

On the other hand, the return is the sum of rewards from the current state to the goal state. There are two types of return: finite-horizon undiscounted return and infinite-horizon discounted return[5]

Finite-horizon undiscounted return

It is the sum of reward from the current state to goal state which has a fixed timestep or a finite number of timesteps Τ[5].

This is an undiscounted return as the name suggested because the finite horizon of timesteps we do not really multiply rewards with discounting factor.

Infinite-horizon discounted return

It is the sum of all rewards ever obtained by the RL agent, but discounting factors determines how far future rewards need to be accounted for[5].

Discounting factor γ

It determines how far future rewards are taken into account in the return. The value of γ is between 0 and 1. On the extreme end, γ = 0 means agent only care about immediate rewards and γ = 1 indicates that all the future rewards are taken into consideration[4]. Consider another example where γ = 0.9 has a different return compared to γ = 0.99 has. For γ = 0.9, the sum of rewards for the return is accounted until 6th timesteps. Whereas, γ = 0.99 need to take the sum of reward until 60th timesteps.

The discount factor is used for an intuitive reason and mathematical reason. For the intuition, reward now is better than the reward later. Whereas mathematically, the infinite sum of rewards may not converge to the finite value which is hard to deal with in mathematical calculation[5]. Using discount factor, far future rewards can be discarded which enable the return to converge to a finite value.

States and Observations

The state s is a complete description of the state of the world where the states are fully observable. Whereas observation o is a partial description of the state of the world.

Action and action spaces

The agent performs the action in the environment to reach the next state from the current state. For instance, in a navigation task, turning left or turning right is an example of action. The set of all valid actions in a given environment is called the action space[5]. There are two types of action space: discrete action space and continuous action space. In a discrete action space, the finite number of actions are possible. For example, turning left or right. Whereas continuous action space can have an infinite number of actions. For instance, steering angle instead of turning left or right.

Policy

The policy is a mapping from states to actions. In other words, policy determines how the agent behaves from a specific state. There are two types of policies: deterministic policy and stochastic policy.

Deterministic policy

The deterministic policy output an action with probability one. For instance, In a car driving scenario, consider we have three actions: turn left, go straight, and turn right. The RL agent with deterministic policy always outputs one of the actions with probability 1. That means the agent always choose an action without considering any uncertainties. Normally deterministic policies are represented with the following notation:

Stochastic policy

Stochastic policy output the probability distribution over the actions from states. For instance, consider three actions: turn left, go straight, turn right from a state. The output of the policy will be a probability distribution over the actions, say 20% to turn left, 50% to go straight, and 30% to turn right. This type of probability will be used in non-deterministic environments. Stochastic policies are represented with the following notation:

Trajectories

Trajectory τ is a sequence of states and action[5].

Value function

state-value function

The state-value Vπ(s) is the expected total reward, starting from state s and acts according to policy π. If the agent uses a given policy π to select actions, the corresponding value function is given by:

Optimal state-value function: It has high possible value function compared to other value function for all states

If we know optimal value function, then the policy that corresponds to optimal value function is optimal policy 𝛑*.

Action-value function

It is the expected return for an agent starting from state s and taking arbitrary action a then forever after act according to policy 𝛑.

The optimal Q-function Q*(s, a) means highest possible q value for an agent starting from state s and choosing action a. There, Q*(s, a) is an indication for how good it is for an agent to pick action while being in state s.

Since V*(s) is the maximum expected total reward when starting from state s , it will be the maximum of Q*(s, a)overall possible actions. Therefore, the relationship between Q*(s, a) and V*(s) is easily obtained as:

and If we know the optimal Q-function Q*(s, a), the optimal policy can be easily extracted by choosing the action a that gives maximum Q*(s, a) for state s.

Policy iteration and value iteration

In policy iteration, the random policy is selected initially and find the value function of that policy in the evaluation step. Then find the new policy from the value function computed in the improve step. The process repeats until it finds the optimal policy. In this type of RL, the policy is manipulated directly.

In value iteration, the random value function is selected initially, then find new value function. This process repeated until it finds the optimal value function. The intuition here is the policy that follows the optimal value function will be optimal policy. Here, the policy is implicitly manipulated.

If you like my write up, follow me on Github, Linkedin, and/or Medium profile.