rib…nce in the MCMC sampling. Effectively, there are T sources of variance with each R_t contributing. However, we can instead make use of the returns G_t because from the standpoint of optimizing the RL objective, rewards of the past don’t contribute anything. Hence, if we replace r(τ) by the discounted return G_t, we arrive at the classic algorithm Policy …