- For MDP we can define state-value function, for policy π, Vπ(s) as: \begin{equation} V_\pi(s) = E[R_t|s] \end{equation} where R is the expected discounted reward sum \begin{equation} R_t = r_{t+1} + \gamma r_{t+2} + \gamma² r_{t+3} + … = \sum_{i=0}^{n}\gamma^i r_{i+1} \end{equation} where γ is the discount factor that takes value between 0 and 1.
- We can also define the state-action value function, for policy π, Qπ(s, a) as: \begin{equation} Q_\pi(s, a) = E[R_{t}|s, a] \end{equation} The Q value of action a, actually is the expected discounted reward sum if we follow action a in state s and then continue using policy π.
- Optimal policy, π*, can be defined using state-action value function as: TD methods try to find the state value or the state-action value function.
- The simplest TD algorithm is TD(0) and determines the state value function using the update rule: \begin{equation} V(s_t) \gets V(s_{t}) + \alpha(r_{t+1} + \gamma V(s_{t+1}) — V(s_{t})) \end{equation} If the state-space is large then, we cannot save in memory a different value for all possible states, so we try to approximate state-value function.
- On the other hand Policy Gradient methods try to directly approximate function π(s), using policy gradient theorem.

@kdnuggets: “Markov Decision Process and Reinforcement Learning #MachineLearning” open tweet »