Machine Learning Writing Month: Reinforcement Learning

26 min readDec 2, 2017

[This post chronicles my quest to become more up to date with the field of deep reinforcement learning, in the form of summaries of relevant concepts and papers. Curious about why these posts are all associated with days of the month? Check out this explanation here]

Day 21: Reinforcement Learning Basics

This summary is not of any one individual paper, but is the result of reading multiple sources, and coalescing that down into a shared understanding.

First, some important terminology. In reinforcement learning, the world is a connection of states (S), possible actions (A), and the reward ® you receive from taking that action, conditional on your prior state. The goal is to find a pathway with the highest total reward (i.e. reward experienced at every step in the pathway, albeit slightly discounted to make far future rewards worth less.) A policy is your strategy for what you’ll do at each state. V(s) is the value of state S, defined as the reward you’d get if you started at state S, and made your best possible decision at every point after that. Q(s, a) is the value of taking action a, starting out from state s. This asks a similar question: “what is the reward I get from taking this action, at this state, as calculated by the next stage reward plus the maximum Q(s, a) from the states available to me at that point”

Before diving into details, here are a few meta-bits of knowledge that might make the experience of reading this more clear:

For the purposes of this model, there are considered to be a fixed, enumerate number of states
Importantly: both value and policy iteration contain an important assumption, without with these explanations would be confusing. Both assume that the basic structure of the environment they’re in — the transition matrix from one state to another, conditional on action, and the reward from taking a given action at a given state — is fully known to them. This is an important distinction between these two approaches and Q Learning, which I’ll cover last

First, value iteration. The premise of value iteration is that the best way to make a choice about your best action is to have a good estimate of the “value” of the potential next states that you’re choosing between. Honestly, I don’t really like the word “value”, because it’s so vague. I like to instead think of it as “potential”. See, the fundamental thing that makes the calculation of “value” tricky is that the “value” of a state isn’t just in its next action/reward, but in the whole path that stretches out afterward. There could be a state with awful next — step reward, but with the ability to access high-value states later on. Value Iteration works by randomly assigning values to all of your states. Then, for each state, for each action, it looks at the next-step rewards from that action, as well as our current estimate of the value of ending up in whatever next state the action got you too. Note: this is our current estimate, that matrix we just initialized with random noise, or maybe zeros. At the beginning, our estimates will be quite bad. However, as we accumulate information about next-state rewards for the different functions, we eventually converge onto a good estimate of the value of each state. Then, it’s straightforward to calculate a policy by just always choosing to move next to the state with the highest V value (since by definition, that V now represents our estimate of the best possible path starting with that state)

One criticism of Value Iteration is that it’s overly slow, and that it learns more than it really needs to solve the problem. Your agent doesn’t really care about learning the value of every state in the universe, they just care about developing a good policy. And, empirically, it can often take longer to develop a full model of all possible values than it does to just determine an optimal policy. The heuristic for this approach is to start with a random policy (that is: a choice of action for each state), evaluate the potential value of each next state *using that policy* to jump from state to state. Based on this value estimation, update your best possible action at each state (ie your policy).

Finally, as hinted before, Q learning is a canonical approach when you need to learn not only your best strategy given the parameters of the environment, but also those parameters themselves. This can either be “model based” or “non model based”. The model-based approaches try to estimate the transition and reward matrices, and then uses those to do value/policy iteration. Non model based means that you aren’t trying to learn those intermediate parameters about the model, but instead trying to learn Q(s, a) directly The approach most often being used by ML researchers will tend to be “non model based”. In practice, this means taking many rollouts during training, where the mode randomly samples an action, and an initial state, and then, taking the best action known to you at this point, you add up all your reward. This gets used as a training label for the value of that habitual Q and S, and we then update the model/network to be better.

Day 23: How are Policy Gradients Different from Q Learning?

This post continues in the vein of Tuesday’s, where, instead of addressing a paper, I instead write up explanations or clarifications I’ve had about broader ML questions, generally by looking at a few different sources. While there’s a thrill of feeling like I’m understanding the cutting edge, in the specific domain of Reinforcement Learning, which I’ve never had much context to explore in the past, there are areas I need to brush up on, before I can realistically follow along with recent work in this domain.

On that note, there’s another phrase I’ve heard thrown around in RL domains, without really having understood it: policy gradients. So: how would a policy gradient system work, and what are the meaningful differences between it, and the three approaches I addressed on Tuesday (value iteration, policy iteration, and Q learning)?

The mechanics of what’s going on in a policy gradient approach are actually quite straightforward. A policy, recall, is a way of taking in state, and producing an action plan for that state. Policies can vary quite a bit. They can be discrete (if there are a fixed number of states, and you list the action for each) or continuous (taking in an input value, and outputting an action). They can also be deterministic or stochastic; in the former case, the policy is one single action, where in the latter case, the policy is a distribution over actions, based on confidence of which is the optimal one. The key thing that makes a policy a policy is that it is a function between the input of a current state (s), and some set of actions (a). With gradients, policies are typically thought of as producing distributions over possible output policies (you can consider the case where a policy puts all of it’s weight on a single value as just a general case of this).

A policy, in this framework, is parameterized by some set of parameters, theta. If you had a very simple policy of, say, a taking a weighted logistic regression of different dimensions of your current-state vector, then the weights of that regression would be the theta. However, typically, simple models like single-layer regressions aren’t sufficient for learning the relationship between input and optimal output, so more often these days you’ll find the bridge between input (current state) and output (distribution over next actions) to be a neural network, which also has many internal weights, which would collectively be considered “theta” in this context.

A policy gradient works by generating a batch of paths (read: an action, which results in a new state and reward, which results in a new action, and so on) and keeping track of the reward at each point in the chain. We determine a gradient — a direction to move in parameter space — based on updating theta to increase reward. And, then, we update all of the weights based on that gradient, using everyone’s best friend gradient descent.

One interesting aspect of policy gradients, and, based on what I’ve found, the most salient practical difference from something like Q learning, is that policy gradients is an “on-policy” approach rather than an “off policy” one. The distinction between “on-policy” and “off policy” relates to whether the set of data you train on is dependent on or independent of your current policy. In the case of policy gradients, in order to estimate the changes you need to make to update this policy, you need to run *this specific policy* for many trajectories, and aggregate its rewards to use in the objective. This has the implication that, if your model were to accidentally “parameter jump” into a very suboptimal place far away from the true optimal policy, that poor policy would also impact the data you were able to collect, since you train on trajectories created by running your policy on a range of inputs. This may leave you in a situation where it’s difficult, given your observations, to get sufficient signal to move yourself back into a good region of policy space. This is meaningfully worse than the kind of instability you can observe in your run of the mill neural nets, since however bad a mistake a supervised network gets, it can always hope to “wash out” that error by getting additional, independently-generated data to correct on. One advantage of Q learning is that it avoids this issue; by learning the value of different actions-at-a-state rather than our actual strategy, we can update our beliefs about that world-model by using data that doesn’t come from the policy we’ve learned so far. That said, there’s a possible critique going in the other direction, that Q learning might spend too much of its resources learning information about the (state, action) space in regions that are very far from the optimal policy, meaning they’re in some sense inefficient.

Day 24: Playing Atari With Deep RL

This paper was the one that kicked off the modern Deep Reinforcement Learning phase, when DeepMind proved that it could use a single, deep-learning-powered algorithm to play 26 different Atari games as well as human experts, starting from raw pixels.

The actual algorithm is a Q learning system, like those I’ve mentioned a few times in my recent posts on RL. That means that the network is learning to predict Q, which is a function of state (s), and available action (a), and is a number that tells you how valuable each (state, action) pair is, if you assume that, after this current action, the agent performs the most optimal action for the rest of the trajectory. This assumption of “performing the optimal action from now on” is important for purposes of recursion, and thus tractability. The reason is that, under that definition, we can define Q(s, a) as being the immediate reward that the agent gets for performing action A at state S, added to the largest possible Q that is reachable from the agent’s new state. An odd consequence of this setup is that the network actually uses it’s own outputs as a target, trying to match the network output of Q(s, a) with [ r + maximum-possible network output Q(s’, a’)], where s’ is the state jumped to after s.

Other than applying Q learning in the context of convolutional neural networks, that are able to learn a complex representation of state from just raw pixels, another innovation introduced by this paper is the one of experience replay. The basic idea of it is that you collect what are called “transitions”, where one transition is a tuple of (s, a, s’, r): initial state, action, new state, reward received along with new state. These transitions are collected in the typical way: by starting at an initial state, and taking a series of actions until you get to an ending state. However, instead of learning on these actions in order, starting from the first transition and working up to the last, an experience replay system randomly chooses transitions from among that chain of experience. The rationale for this is that it keeps the information your model is training on nicely balanced over the whole state space; if we learned in order of the chain, we would have many iteration steps where we only considered states on the left hand side of the board, which is implicitly giving the model a skewed idea of the overall data distribution. Another benefit of experience replay is that, once you remove the need to learn on samples in sequence, you give yourself the ability to learn on one transition more than once. This is actually more valuable than it may at first appear, due to the dependence of the target value on the estimation of the network at the next point in time; if you re-use a sample, it may not have changed, but the network’s value of the Q value at the next step may have, which would mean there is new information.

Looking back from the lofty heights of 2017, it may seem unremarkable to see a machine able to succeed at games in this way, but at the time — and still today, in some un-recently-biased absolute sense — this paper feels like a revolutionary first step towards a different kind of model.

Day 25: Double Q Learning

[word to the wise: this summary essentially operates in a series with the three RL summaries I posted last week, so it uses terminology, like “Q function” that I don’t exhaustively re-explain here, but for which more thorough explanations can be found in those earlier posts]

This summary goes over a paper which implements a fairly straightforward improvements to the initial 2015 DeepMind DQN paper, and which was published in the months following it, also by DeepMind, giving the impression that this was an area of research that had been in the works when the initial paper came out, but was not quite finalized at that point.

This paper, is an update to the initial Q learning system, called “Double Q Learning” that tries to correct for a source of initial instability in the training procedure of the original DQN. Before explaining the actual mechanism of this change, it’s important to first highlight a few aspects of the Q learning procedure that help motivate the change. First, a fundamental fact of Q learning (and, in fact, many varieties of RL) is that it’s a bootstrapped estimator. By “bootstrapped”, I here just mean that the target that we compare our output to at each iteration is itself a function of the current estimate made by the model. For Q learning, we are comparing our current estimated Q [ Q(s, a) ] against the sum of (the reward that comes from taking action a at state s) and (the highest possible Q value achievable from the new state we reach after taking action a). This means that, while we can’t adjust a specific Q(s, a) against a ground truth value, we can adjust it to be consistent with its next-step neighbor; over time, the adjustments we make to account for the rewards we receive in these windows between states push us towards having a more globally consistent Q estimation.

However, one problem with this approach is that, by adjusting the current Q estimation against the overall highest estimated Q at the next step, we become vulnerable to outliers and poor early estimation. As an example, if our estimation of the next-step Q function has some average error in it, the algorithm will often pick up an amplified amount of that error, because it constructs loss using the maximal value, which could be out in a long tail. Then, we use that overlarge error in our loss function, and then our gradients, making training more high-variance and noisy.

The solution that was proposed was to create two parallel Q-estimating neural nets, Q1 and Q2, alternate between them, and use information from one in order to construct the output for the other. The way this works is that, in the loss term, instead of having max-over-all-actions Q1(s, a) to update Q1, you instead split the two parts of this operator, using the Q1 network to find which action corresponds to the maximal Q value, according to its current estimate. Then, that action is evaluated by Q2, e.g. Q2(s, best-action-according-to Q1). This is the value (once added to the reward received by going from step s to step s+1 that the network uses to train on. The reciprocal happens, with the Q1/Q2 designation stops, when we’re instead updating Q2; we choose which one gets updated by random selection. The premise of this method is that it pulls outliers in when they’re caused by a random estimation issue, since it’s unlikely that (state, action) would have anomalously high Q values in both of the Q1/Q2 functions, if that high value was just the result of overfitting rather than actual meaningful learning.

In practice, this simple change reduces the jerkiness of training, and lets it progress much more smoothly.

Day 26: Prioritized Experience Replay

This paper is another of the significant steps forward in RL that came in the months immediately following the initial DeepMind announcement, and, like Double Q Learning, used a straightforward(ish) technique to achieve a substantial performance boost.

In the original DeepMind paper, one of their key innovations was the notion of Experience Replay. Under a ER system, instead of applying updates from your transition’s errors one by one, you instead sample some number K of transitions, storing their (State, Action, New-State, New-State-Reward), and then subsample from that k to create a minibatch of transitions, which are grouped together to calculate error for the model to train on. The theory behind this approach is that it helped reduce the correlation between transitions, and make the set of transitions used to train on more representative of the whole batch, instead of having multiple repeating samples in the same region of state space, which would be the case in an online system, where each state is quite close to the last.

This paper adds an additional level of remove from the online case: in addition to sampling transitions out of the order in which they were collected, Prioritized Experience Replay additionally liberates your transitions from the frequency with which they were collected, allowing some transitions believed to contain more useful information to be used in updates more frequently. This prioritization happens on two levels:

Before a transition has been evaluated, e.g. before it’s time-delayed error* has been calculated, a transition is given maximal priority. This incentivizes not letting transitions languish in an un-evaluated state, and also incentivizes utilizing them soon after they are collected
After a transition has been evaluated a first time, which happens the first time it’s used in an update, it’s priority is set to the absolute value of its Time-Delay error. This then prioritizes evaluating transitions that had high absolute error last time, so that you can “check” whether the updates you’ve made to your estimate have fixed the error they previously experienced

The paper suggests turning these priority scores into a true probability distribution (from which you could sample) in two ways.

Proportional priority, where you use the error value directly
Raked priority, where the priority value is equal to 1/rank(), where rank is based on a sorted ordering of the error values (a rank of 1 is the highest error value).

These priority values are them divided by the sum of all priority values together, thus creating a sums-to-one probability distribution.

The logic behind this system is that we should prioritize learning off of transition-examples that we’ve either never before seen, or on which the model has historically performed badly. You can think of this a little like boosting in a supervised learning context, where you try to push your classifier to perform well on samples that your current model performs the most poorly on.

*Time Delay error is the gap between our current estimate of Q(s, a) and (the reward resulting from action A + our current estimate of the best Q that you can reach on your next step, after taking action A). The idea here is that, if we had a perfect estimate, these two quantities would be perfectly equivalent, because the Q at state (s, a) should be accounting for the reward it’s about to receive

Day 27: Asynch Advantage Actor-Critic

And now we come to an algorithm with an elegant solution to a recurring RL problem…but a name I can never verbalize correctly for the life of me (I always end up saying “AC3” instead of “A3C”): Asynchronous Advantage Actor-Critic. Though, I shouldn’t criticize that much on the name, front, because apparently, in the 90s, there was an algorithm whose initials acronymed into “REINFORCE”. That one would have had some issues with Google-ability.

I think the clearest way to understand this approach overall is to split up the “Asynchronous” bit and the “Advantage Actor-Critic” bit, because they really are pretty distinct ideas. So, let’s start with “Asynchronous”:

To understand the approach proposed by this paper, it’s useful to go back, far in the depths of Machine Learning past (by which I mean: 2015), to the key innovation that kicked off the renewed interest in neural-net based RL: experience replay. The reason that experience replay was useful is that, prior to it being introduced, reinforcement learning that were trained in an online way, the updates that an algorithm would see would be non-stationary and highly correlated. This makes sense if you think about the process of exploring a space: when you take one move, you end up close to where you were, and the samples that one would get from a 10 second set of frames right now is likely quite different in distribution to a similarly-sized set of frames you might get 40 seconds ago. Experience replay helped fix this because for any given gradient update, you’d have a mixture of samples from the past, which, in aggregate, are a better representation of the overall performance of your current set of estimates.

However, the experience replay framework introduced limitations: it meant that you would end up re-using transitions before a prior update when creating your gradient batches. This is perfectly fine for off-policy learning (for example: where you’re just trying to estimate the value of each state assuming that you follow the universally optimal policy), since, when defined that way, the value of a state is independent of whatever policy you happened to have been pursuing at the time you encounter it. However, this is an issue when you’re using a on-policy approach, like, for example, having a parameterized policy and using each transition to ascertain how good your policy. In that case, having old samples isn’t workable, because you’d then be creating a gradient on your current policy based on a loss function engendered by a prior version of your policy, which doesn’t really make sense. So, that’s a lot of why you had previously seen more work on off-policy algorithms, like Q learning: because they fit into the experience replay framework.

With a goal of remedying this shortfall of ER, the team behind this paper (Googlers + Deepmind folks, subsequent to DeepMind having been bought by Google) came up with an alternative: using asynchronous actors. The notion here is that you have multiple agents traversing the environment, each with slightly different starting conditions*, but all pulling from the same shared parameters. Each agent accumulates some small batch of transitions, and then, at some pre-determined interval, sends the accumulated gradient of those transitions to a central thread whose job it is to receive updates from all the different actor-threads. Then, at some different (hyperparameter-set) frequency, it pulls down a new set of master parameters from that central aggregator. The overall effect of this scheme is that, because each actor is in a different region of state-space at any given point in time, the central aggregation thread sees a mixed, reasonably uncorrelated set of examples when it does gradient updates, but from the perspective of each learner, those updates are still applied in an online way, so we don’t re-use transitions that occurred on an old policy. This means that we can use on-policy algorithms in addition to off-policy ones on this system.

So, that covers the “Asynchronous” bit. Now, to the “Advantage Actor-Critic”. This part just comes from the DeepMind team using their asynchronous architecture on four different, previously-invented RL algorithms (SARSA, one-step Q learning, multi-step q learning, and advantage actor-critic), and advantage actor critic happening to do the best. Now, it is indeed true that, because actor-critic is an on-policy method, so you couldn’t have used it in a experience replay environment. A very brief summary of advantage actor critic is that it combines the two main paradigms of RL: the value paradigm, where you optimize your understanding of the value of different states and take the optimal policy only at the end, and the policy paradigm, where you optimize directly on your strategy, without learning an explicit model of the states and their values. It does this by calculating the expected value from following the current policy at this point, and subtracting out an estimate of the average value of the state at that point, under an average policy. That way, you calculate the “advantage”, or how good this policy is relative to an expected average value at this state. This approach is seen as combining the “best of both worlds”: converging faster than a pure value algorithm that tries to learn about states it won’t need in an optimal policy, and being lower variance than a pure policy algorithm, which can have very different values at different states if you don’t normalize.

*I think what differs is starting conditions, but it was also mentioned that they have different “exploration policies”; I haven’t gotten a good explanation of how an “exploration policy” differs from an old fashioned policy policy, so I’m a bit fuzzy on this bit atm

Day 28: Alpha Go (the original one)

In which there are altogether too many acronyms

This AlphaGo paper is interesting to me, both on an object level and meta level; the latter because so many of the ways in which it is (to me, at least) unintuitive and complex are artifacts of it’s status as a cross over between fields, pulling some techniques I’d never heard of (like Monte Carlo Tree Search) and interspersing them with ones I’m more familiar with. It also has the cobbled-together feel of “let’s take an already existing framework, and graft on our neural network-based systems to that framework”. And, don’t get me wrong, it works enormously well. But it’s a bit hard to explain, because of the number of different components, and the sometimes unintuitive ways they interact with each other. All of that being said by way of slightly excusing myself for this being a bit hard to follow: let’s dive in.

Since I find it valuable to have a view of the full picture, I’m going to explain somewhat in reverse: starting with the ultimate algorithm, which combines many components, and then explaining how those components came to be. In general, it works by:

At any given state in the game, run several simulations. The goal of these simulations is, broadly speaking, “get a sense of the value of the possible moves you could make”. However, in a game with ~80 possible moves at each turn, and a typical time-to-end-game-reward of 150 steps, fully searching the full depth of each tree in the space is not computationally feasible. So, we use neural networks to help us sample actions intelligently for these simulations, and also to provide an estimate of the value of a state without having to search the full depth of that tree, until reaching a reward.
As best I understand it, at any given point in the simulations, the algorithm chooses whichever child node has the highest value of an objective function that weighs 1) some initial probability P(s, a) of choosing that action in that state, 2) an estimate of the value of the state we’d end up in, and 3) a bias towards picking moves we’ve never picked before, to explore the space more broadly.
The mechanical way a simulation works is that, along this tree of possible moves, each edge has an “estimated value of this state-action” [Q(s, a)] and “prior probability of this state-action”[P(s, a)] It walks the tree of possible moves until it gets to a point in the tree that it hasn’t reached yet. At that point, the network wants to get an estimate of how valuable the state resulting from this untried move would be
For this, the algorithm combines two methods. The first is fairly simple: use a simple and quick-to-run policy to “rollout” the rest of the game from that state onward, picking an action according to your “fast policy” at each juncture, and eventually hitting some reward (+1 for win, -1 for loss). The second is a learned value network, v(s). These two outputs, the “rollout reward” and value network, are combined in a very straightforward linear combination (literally, the equivalent of 0.7*value + 0.3*rollout, where the 0.3/0.7 parameter is a chosen hyperparameter)
This combined value estimate is then used to update the estimated state-action value of the state-action that we started this whole process with, which will then make that pathway likelier to be chosen on future simulations.
After simulations are complete, the network plays the action that was visited the greatest number of times across all the simulations.

If you were playing along at home, there are 3 components in that above explanation whose provenance didn’t get explained.

The initial probabilities, P(s, a)
The value function, v(s)
The fast policy function

Both 3) and 1) were policy network models, trained on a supervised learning (SL) set of human expert games. These systems updated by trying to make it more likely that the kinds of moves made by human experts would be more likely to be made in future. The difference is that the first one (often referred to as SL in the paper) is more complex, with 13 layers and convolutions and whatnot .The fast policy is, by design, very simple and fast. It’s not very successful on it’s own, only winning ~27% of the time. But since it is 15000 times faster than the non-fast policy, this fast policy is quite useful for the rollouts mentioned earlier, since it’s very quick to traverse from a given state up to the end of the network . The more complex system, SL, outputs a distribution over potential actions given a state, and those are use to give prior/initial probabilities to each of the transitions,

2), the value estimation is trained by trying to predict the scalar reward when two AI machines play each other; it’s goal is to estimate that end-game reward as closely as possible. An interesting quirk of the AlphaGo setup is that the policy used by the dueling networks in this setup isn’t actually the supervised learning one mentioned above, but is instead a Reinforcement Learning system insofar as it takes the SL solution as a baseline, and learns from there by playing games against older versions of itself.

Interestingly, the more complex model is better for helping train the value function, but less good than the un-fine-tuned Supervised Learning network that was initially trained on expert human data,

Overall, these three policy networks + value network combine to be one of the most notable AI systems to date, and beat a professional player at Go.

Day 29: AlphaGo Zero

Just over a month ago, DeepMind released a paper that was both breathtaking in its performance and also an obvious next step, given their prior work and clear ambitions: a neural network that beat the world’s most renowned human Go player, learned entirely from the rules of the game, and without human knowledge. In a lot of ways, the things that make this achievement so astounding also make it considerably more straightforward to summarize than yesterday’s original AlphaGo (Alpha Go Negative One? I’m going to go with AGNO for the remainder, just to have an easy handle). Where AGNO felt very cobbled together, this system looks downright elegant, even though it’s based on some similar principles.

So, what are some similarities and differences between the two systems?

Both generate their moves using a tree search procedure (MCTS) based on simulations, where, starting from the current board state, we, at each step take the action that maximizes the sum of Q(s, a) [the expected value of that action at that state] and U(s, a) [an exploration term that makes actions that have high probability under some learned policy, or that have not been visited in the past, have higher values of U]. When we hit a leaf node, we gather an estimate of its prior policy probability P(s, a) and some estimate of the estimated value of the state, V(s).
However, in AGNO, this estimated value was created by combining the output of a separately-trained value network with a value derived from executing a simple heuristic policy from the state in question to the end of the network, to see the value that results. In AGZ, there is one single network that outputs both the estimated value for v, and also the policy probabilities, and no heuristic play-till-end-of-game rollouts were used.
In both networks, after a leaf node has been evaluated, the information on all the earlier edges of the tree is updated. N(s, a), which represents the number of simulations that have traversed that edge (edge here is a particular action from a particular state). Q(s, a), represents the average expected value of that state-action, is updated by using the estimated value, v. It’s easiest to think about this being done through an intermediate value, W, that keeps track of the total reward that has come through this node; W is updated to be W + v, and then Q is updated to be this newly-updated value of W, divided by the newly-updated counter N.
These updated N and Q values will influence the decisions made by the next simulation, making actions with better estimated values later on more likely
After all simulations (this is a hyperparameter; the network in the paper used 1600) have been completed, we come up with a new “empirical action probability” distribution, denoted by pi. (this would be a symbol if FB could render LaTeX). This pi is indexed by state, and it tells you, for a given state, the distribution over the actions taken by all the simulations. So, if there were two possible actions, and 1200 simulations took action A, with 400 taking action B, pi would be set to (0.75, 0.25).
The network makes the move that has the highest value in the distribution pi
Then, the opposing player (also the network) does the same thing, and the game progresses onward
In AGNO, the policy network was initially trained on expert human moves (as in: at a given state, predict the move that a human would have played here), and then “fine tuned” by using self-play. The value network was learned using self-play of the reinforcement learning network. In AGZ, the policy and value networks are trained together, all using self-play. Self-play works by having the network play itself, and update parameters after each game has come to an end
These updates are done by taking a uniform sample over all the time steps in the game (putting both “players” into the pot before taking the sample), and calculating a joint loss function that A) tries to push the probabilities P(s, a) to be closer to the empirical probabilities pi, that were generated by the MCTS simulation process, and B) tries to push the value estimation for each state to be closer to +1 if that player eventually won after being at that state, and closer to -1 if the player eventually lost after being in that state
Then, another self-play game is started, using these newly-updated parameters
Mostly orthogonal to the whole Reinforcement Learning system, another change between AGNO and AGZ was the switch from a “vanilla” convolutional network to a residual convolutional network, which has demonstrated notably excellent performance in non-RL Deep Learning domains. And, empirically, it does look like a sizable proportion of the improvement in final Elo ranking is attributable to just this change.

In total, this system experienced 4.9 million games of self play, over the course of 3 days, trained on one TPU (Tensor Processing Unit). By contrast, AGNO took several months, and was trained on 48 TPUs.

4.9 million games is undoubtedly more games of Go than any single human could ever complete in their lifetime; it makes me wonder (slightly philosophically) if the right dimension along which to be thinking of AI superiority is not so much their having superior computing power, but instead being able to learn very quickly from tasks that it takes humans much longer to perform. An interesting thing to consider is: what other tasks have the property of being able to simulate very quickly in a high-quality way? It seems like useful criteria of this property are “involves manipulating the world or physics in a way where it’s trivial to calculate an ultimate end state without human involvement”, and “doesn’t involve interacting with actual 3D objects in the world, or with human systems”. I’d be quite interested to hear your thoughts on what tasks you think are similar to Go along relevant dimensions, and which are not.