Nice post.
T SHAO
11

Hi T SHAO,

This is the policy gradient loss calculation. It is related to cross-entropy in the sense that we want to increase the log-likelihood of rewarding actions and decrease it for un-rewarding actions.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.