…ink of a Policy gradient network is one which produces explicit outputs. In the case of our bandit, we don’t need to condition these outputs on any state. As such, our network will consist of just a set of weights, with each corresponding to each of the …