…arning is an off-policy TD control policy. It’s exactly like SARSA with the only difference being — it doesn’t follow a policy to find the next action A’ but rather chooses the action in a greedy fashion. Similar to SARSA its aim is to evaluate the Q values and its update rule is:
t, I ha…f brevity of this post, I have assumed the readers know about the Bellman equation. According to it, the utility of a state is the expected value of the discounted reward as follows: