Published in

Oracle Developers

# Lessons from AlphaZero (part 3): Parameter Tweaking

This is part three of a series discussing how we replicated and extended DeepMind’s AlphaZero algorithm in interesting ways. You can find Part I here.

Hello again! In this week’s installment, we’ll be sharing some thoughts on tweaking (hyper)parameters in AlphaZero. These are parameters that are not learned during training (like network weights), but are fixed in advance. By picking good values, training might proceed faster or top out higher.

Many of these parameters affect the so-called exploration-exploitation tradeoff: the algorithm needs to balance between exploiting its existing knowledge and exploring new possibilities. Exploration is important when training AlphaZero to play against highly skilled opponents (as in the paper), but even more important when training it to play well in a wider variety of positions. As covered in the last post, our test set includes imperfect games. Therefore, we spent some time tuning our implementation to explore broadly enough to be not just a good optimal player, but a good general player.

# Hyperparameters in AlphaZero

Let’s take a look at some of the hyperparameters we might want to optimize. Here we’ll just take a glance at what they are and do.

## c_puct

During Monte-Carlo Tree Search (MCTS) simulation, the algorithm evaluates potential next moves based on both their expected game result, and how much it has already explored them. A constant c is used to control this trade-off.

Standard MCTS uses a variant of the Upper Confidence Bound-1 (UCB-1) formula called UCT (UCB for trees). AlphaZero uses a version called polynomial upper confidence trees (PUCT). If we are at state s and considering action a, we need three values to calculate PUCT(s, a):

• Q — The mean action value. This is the average game result across current simulations that took action a.
• P — The prior probabilities as fetched from the network.
• N — The visit count, or number of times we’ve taken this action during current simulations.

We calculate PUCT(s, a) = Q(s, a) + U(s, a), where U is calculated as follows:

Notice how we sum across all potential actions (b) in the numerator, and the visit count for the action under consideration (a) is in the denominator. Therefore, the less we have tried this action, the greater U will be. This encourages exploration.

By increasing c_puct, we put more weight toward this exploration term. By decreasing it, we more strongly value exploiting the expected result (Q).

After trying out a handful of values (from 1 to 6), we found that approximately 4 was optimal.

## Dirichlet ɑ (alpha)

One innovation of AlphaZero was introducing noise into MCTS.

During an MCTS simulation, we repeatedly simulate play from a root node (s) representing the current board state. The first thing we do is query the neural network for the prior probability vector (p) of potential actions from s.

Instead of using p as-is, AlphaZero adds noise according to the Dirichlet distribution with parameter ɑ, aka Dir(ɑ). To help understand what this means, let’s see some examples of the Dirichlet distribution with different ɑ. For these examples, we’ll assume that there are n potential next actions.

First, recall that the support of Dirichlet (i.e., the values where it is defined) is the set of vectors (x_1, x_2, …, x_n) whose coordinates are nonnegative and sum to 1. If n=3, then some examples would be (0.1, 0.5, 0.4) and (0.3, 0, 0.7). This set is also called the (n-dimensional) simplex.

Now note that Dir(1) is uniform. This means that sampling from it will produce either of the above two vectors (or any other valid vector) with equal chance. As ɑ gets smaller, Dirichlet starts to prefer vectors near the basis vectors: (0.98, 0.01, 0.01) or (0.1, 0.9, 0) will be more likely to be drawn than (0.3, 0.3, 0.4). For values greater than 1, the opposite is true — the basis vectors are de-emphasized, and more-balanced vectors are preferred.

To add noise to p, we sample a value d from Dir(ɑ), and take a weighted sum: x*p + (1-x)*d. Because both p and d have coordinates summing to one (and the weights x and 1-x sum to 1), this property is preserved in the weighted sum, making it a valid probability distribution. In the paper, x is set to 0.75. Below, we’ll use x=0.5 to exaggerate the effect of d.

When ɑ is larger than one, this tends to “flatten” p, making us less strongly prefer any one move:

For ɑ<1, it will cause a random component to become more emphasized:

In both cases, we are encouraged to explore away from our prior (which strongly preferred the middle move).

To pick a value, we first looked at the values they used for chess, shogi, and Go: 0.3, 0.15, and 0.03. The average number of legal moves in those games are 35, 92, and 250. It’s hard to know how to optimally extrapolate, but a reasonable first guess looks to be choosing ɑ = 10/n. Since there are approximately four legal moves in a Connect Four position, this gives us ɑ=2.5. It’s true that this is greater than one while the rest are less, and yet this number seemed to do well in our testing. With a little playing around, we found that 1.75 did even better.

UPDATE: while early versions of our training did best with a=1.75, we ultimately settled on a=1.0 as the optimal value for our training.

More research (and compute time) would be helpful to get more insight here.

## Temperature, τ (tau)

After MCTS has completed a round of simulations, and before it has chosen a move to play, it has accumulated a visit count (N above) for each potential next move. MCTS works such that good moves are eventually visited more often than bad moves, and are a good indication of where to play.

Normally, these counts are normalized and used as a distribution to choose the actual move. But a so-called temperature parameter (τ) can be used to first exponentiate these counts: N^(1/τ). They set τ=1 for the first 30 moves of the game (which has no effect) and then set it to an infinitesimal value for the rest of the game (which, post-normalization, suppresses all values except the maximum).

• N = (1, 2, 3, 4)
• Normalized N: (0.1, 0.2, 0.3, 0.4)
• Using τ=1: (0.1, 0.2, 0.3, 0.4)
• Using τ~0: (0, 0, 0, 1)

Therefore, the first 30 moves are somewhat exploratory, and the rest will play only the most-visited action.

We decided to play around with this a bit. First, we tried leaving τ at 1 instead of reducing it. This resulted in a significant improvement in performance. Emboldened by this success, we tried some living on the edge and bumped it up to 1.75 for the whole game:

• Using τ=1.75: (0.15, 0.23, 0.29, 0.34)

The larger τ gets, the more this flattens the distribution, leading to more even exploration. Although τ=1.75 gave us the best results amongst values we tried, we discovered that it was no better than simply increasing c_puct. So we simply left it at 1.

# Other tweaks

In addition to tuning hyperparameters, we tried a few other modifications.

## Nvidia-TensorRT and int8

Nvidia offers an inference optimization tool that sped up our inference by 3-4x. In addition to optimizing the TensorFlow graph, it enables 8-bit quantization, trading off precision for throughput.

A side-effect of this quantization is that it encourages exploration. For example, if the neural net predicts that two positions have slightly different results (a number in the range [-1, 1]), with a little loss of precision they may come out looking equal.

## Position deduplication

When training the neural net, positions are selected at random amongst the most recent games. Because Connect Four has very few reasonable opening lines, we found this led to a significant overrepresentation of early board positions. This caused the network to focus too much on them.

We mitigated this problem by deduplicating positions before randomly selecting them for training. In the process, we averaged the priors and result values. This seemed to work well.

# Conclusion

As in much of machine learning, the AlphaZero paper has lots of “magic numbers” that aren’t adequately explained. It’s hard to know how much exploration was done to settle on the provided values. For some parameters, a principled approach can help narrow the choices, but in general, it’s a game of “try it and see.”

Exploration vs exploitation is a central concept in reinforcement learning, and as such, much of the tweaking we did affected that tradeoff in various ways.

We certainly haven’t found the optimal choices, but through a lot of experimentation, we helped AlphaZero do much better on Connect Four than it did “out of the box.”

Part 4 is now published here.

--

--

## More from Oracle Developers

Aggregation of articles from Oracle engineers, Groundbreaker Ambassadors, Oracle ACEs, and Java Champions on all things Oracle technology. The views expressed are those of the authors and not necessarily of Oracle.