Sep 5, 2018 · 2 min read
Chip, addressing your questions:
- The values {-1, 0, 1} are only used in my example to represent the players on the board. The MCTS algorithm I implemented do not use these values as rewards. The sentence you quoted in the article is talking about the probability of winning X_j of a given move j; and being a probability, X_j is surely within [0,1].
- On rewards — I chose to leave the implementation simple, and the “reward” for winning a playout is simply represented as an increase in the probability of winning X_j. Perhaps you are coming from a machine learning angle where rewards are parameterized, but here it is simple.
- About decisive/anti-decisive (D/AD) moves —it’s surprising to me that vanilla UCT (using UCB1) performs poorly on tic-tac-toe. I had thought that it should perform remarkably well, especially given that the state space of TTT is tiny. In decisive scenarios, a UCT player should quickly converge on the single optimal play out of 6 moves or so. A UCT adversary would then be in an anti-decisive scenario, and should quickly converge to plays avoiding the decisive scenario. I do advise digging deeper into what’s going on there, since you mentioned that you did not see this behavior. Regardless, I’m glad you found that implementing them improves the performance of your players! The way I see it, D/AD moves essentially helps out UCT by skipping the last 2 levels of every terminal branch of the tree, and that’s a lot of branches cut. Be warned that, because D/AD moves seems to work well for the most obvious scenarios where a win is imminent, it may mask a problem with the UCT algorithm itself.
- About getting player moves from my implementation —
MonteCarlo.bestPlay(state)does exactly this. I’m not sure how your Java implementation can interface with Node.js modules, but I’m sure you’ll work it out.
Happy exploring!
