Getting ‘nan’ for the loss means that the values in the network have become either too big or too small to calculate anymore. In that case, you should reduce the learning rate.
In order to properly get the gradients for the loss, you will have to do something a little more complex than using a matrix multiply to replace the parts of the output layer we don’t want. You will need to get the values by their index.
indexes = tf.range(0, tf.shape(output_layer)) * tf.shape(output_layer) + action_holder
responsible_outputs = tf.gather(tf.reshape(output_layer, [-1]), indexes)
This should preserve the gradient you want when you make the update.