In one of the previous posts, we’ve discussed the value function. To recap it quickly, the value function represents the sum of rewards obtained from the specific state onwards. Let’s see how the value function can be denoted in terms of the Bellman equation,
The value function for the state considered ‘s’ is the summation of immediate reward indicated by R(s, a, s’) and the next state value function V(s’) with some reduction factor. And V’(s) again is dependent on its consecutive next state and the same logic applies to all the succeeding states. The above equation perfectly fits when we have a discrete environment (i.e) when an agent makes a transition, it can only go to one next state.
But the state space(environment) can also be stochastic. Due to this nature, there could be multiple states possible from the current state along with the transition probabilities. To bring the stochastic property into the equation, we include the transition probability as well. Consider a scenario where from the state (s1), there is a 0.7 probability of going into s5 and the probability for the next state to be s7 could be 0.3. So rewriting the above equation, we have
The additional term P(s’|s,a) indicates the stochastic environment. And one more refinement to the above equation is made (i.e) adding the term P(s’|s,a). The summation ensures we are taking the details across different states based on the transition probability of that state.
We also know that not only the state space(environment) can be stochastic but also the policy can be stochastic. In a particular state, there could be multiple actions that can be undertaken and the action space is basically the probability distribution.
The first term expresses the stochastic characteristics of the policy. So for different actions, the formula will be altered to include all the ranges of the probability distribution.