Simple Reinforcement Learning with Tensorflow Part 8: Asynchronous Actor-Critic Agents (A3C)
Arthur Juliani
42763

Arthur, thank you for the amazing post. I really learned a lot. But there is something I don’t quite understand. You calculate policy loss using

self.policy_loss = -tf.reduce_sum(tf.log(self.responsible_outputs)*self.advantages)

This is minimized for minimum probabilities (responsible_outputs) and minimum advantages. How come that your algorithm maximizes rewards?

Show your support

Clapping shows how much you appreciated Krzysztof Nowak’s story.