I tried the same approach on your Two-Armed Bandit code (old one) but couldn’t get it to work.
Lucas Vazquez
1

Hi Lucas,

Getting ‘nan’ for the loss means that the values in the network have become either too big or too small to calculate anymore. In that case, you should reduce the learning rate.

In order to properly get the gradients for the loss, you will have to do something a little more complex than using a matrix multiply to replace the parts of the output layer we don’t want. You will need to get the values by their index.

indexes = tf.range(0, tf.shape(output_layer)[0]) * tf.shape(output_layer)[1] + action_holder
responsible_outputs = tf.gather(tf.reshape(output_layer, [-1]), indexes)

This should preserve the gradient you want when you make the update.

Show your support

Clapping shows how much you appreciated Arthur Juliani’s story.