first congratulations for you awesome posts about RL in Tensorflow.
I was wondering one thing:
Cant we use the softmax function for Qout and nextQ and the cross-entropy loss?
Just a thing like that:
Qout = tf.nn.softmax(tf.matmul(inputs1,W))
nextQ = tf.nn.softmax(tf.placeholder(shape=[1,4],dtype=tf.float32))
loss = tf.reduce_sum(-tf.reduce_sum(nextQ * tf.log(Qout), 1))
I am saying just a stupid thing? Or you can balance the different utilities of actions in positions as a probabilities with the softmax?