• For MDP we can define state-value function, for policy π, Vπ(s) as: $$V_\pi(s) = E[R_t|s]$$ where R is the expected discounted reward sum $$R_t = r_{t+1} + \gamma r_{t+2} + \gamma² r_{t+3} + … = \sum_{i=0}^{n}\gamma^i r_{i+1}$$ where γ is the discount factor that takes value between 0 and 1.
• We can also define the state-action value function, for policy π, Qπ(s, a) as: $$Q_\pi(s, a) = E[R_{t}|s, a]$$ The Q value of action a, actually is the expected discounted reward sum if we follow action a in state s and then continue using policy π.
• Optimal policy, π*, can be defined using state-action value function as: TD methods try to find the state value or the state-action value function.
• The simplest TD algorithm is TD(0) and determines the state value function using the update rule: $$V(s_t) \gets V(s_{t}) + \alpha(r_{t+1} + \gamma V(s_{t+1}) — V(s_{t}))$$ If the state-space is large then, we cannot save in memory a different value for all possible states, so we try to approximate state-value function.
• On the other hand Policy Gradient methods try to directly approximate function π(s), using policy gradient theorem.