Quick tour of major RL algorithms on PlaneStrike
I was very excited to see the release of TensorForce, which seems to offer a very slick interface to reinforcement learning. I previously tried rllab but couldn’t make it quite work. TensorForce looks simpler and the build-in algorithms are more comprehensive. And according to their blog, I think the overall design is more thoughtful. So I decided to run my good old PlaneStrike game through it and see how TensorForce does. It turns to be quite easy. All I needed to do was create a simple environment and hit ‘run’. Code is here. Graph below shows smoothed reward per episode vs. iteration when I penalize repeat move:

The follow graph is for when I do not penalize repeat move:

A few comments:
- It’s very cool to see things working :)
- Vanilla policy gradient (actor-critic) did the best. Pure policy gradient methods TRPO and PPO seem less efficient here.
- I did not get to try A3C since I did not have time to figure out how to set up cluster_spec and etc.
- Penalizing repeat move certainly helped all algorithms
Next I’ll try to apply TensorForce to a somewhat practical problem in ads yield management.
