1 min readMar 18, 2020
I understand it now. By accumulating negative rewards, you force an algorithm to explore the unexplored (ie, not-yet-penalized) space. As negative rewards accumulate, you will tend to attempt new solutions, until you happen to succeed… or something like that :)