TD-Gammon algorithm

Published in

Clique Community

10 min readJul 9, 2020

Initial arrangement of black and white pieces in Backgammon. Source: https://www.ancientgames.org/nard-original-backgammon/backgammon-board-and-initial-position-according-to-bozorgmehrs-rules/

Introduction

In this article I will discuss TD learning,the TD-gammon algorithm, its implementation and why was it such a big success.

Temporal difference (TD) learning is an approach to learning how to predict a quantity that depends on future values of a given signal. The name TD derives from its use of changes, or differences, in predictions over successive time steps to drive the learning process. The prediction at any given time step is updated to bring it closer to the prediction of the same quantity at the next time step. TD algorithms are often used in reinforcement learning to predict a measure of the total amount of reward expected over the future, but they can be used to predict other quantities as well.

TD learning is a combination of Monte Carlo ideas and Dynamic Programming ideas. Similar to Monte Carlo, TD methods can learn directly from raw experience without a model of the environment’s dynamics. Similar to dynamic programming, TD methods update estimates based in part on other learned estimates without waiting for final outcome.

The basic idea of TD methods is that the learning is based on the difference between temporally successive predictions. In other words, the goal of learning is to make the learner’s current prediction for the current input pattern more closely match the next prediction at the next time step.

In layman terms, it can be said that TD is a method in which learning takes place after each interaction (or maybe a few interactions) with the environment. This is in contrast to Monte Carlo learning where learning takes place only at the end of an episode.

The formation of TD(λ) algorithm using TD(0) and TD(1) can be found here: https://www.youtube.com/watch?v=V0nhLV0ssEc

A brief description of Backgammon

Backgammon is one of the oldest known board game. If you know about it, you might know that the number of possible backgammon positions is enormous. So is the number of moves from each position. For a typical dice roll, there might be about 20 different ways of playing. Furthermore, the future moves of the opponent also have to be considered. The game tree has an effective branching factor of 400. This is far too large to permit effective use of conventional heuristic search methods that have proved effective in games like chess and checkers.

Although the game is highly stochastic (due to the rolling of the dice), a complete description of the game’s state is available at all times. The game evolves over a sequence of moves and positions until finally ending in a win for one player, thus ending the game.

Source: http://incompleteideas.net/book/first/ebook/node108.html#nonlinearTDrule

The above image shows a random position in the middle of the game of Backgammon with dice rolled to a five and a two, and one black piece on the bar.

TD Gammon

TD Gammon is considered the greatest success story of Reinforcement Learning. This is because it required little backgammon knowledge yet learned to play extremely well, near the level of world’s strongest grandmasters.

TD-Gammon was designed as a way to explore the capability of multilayer neural networks trained by TD(λ) to learn complex nonlinear functions. TD-Gammon contains a neural network at its heart (represented in the figure below), that is organized in a standard multi layer perceptron(MLP) architecture. The MLP architecture can be thought of as a generic nonlinear function approximator. Its output is computed by a feed forward flow of activation from the input nodes, to the output nodes, passing through one or more layers of internal nodes called hidden nodes. Each of the connections in the network id parameterized by a real valued weight. Each of these nodes in the network outputs a real number equal to a weighted linear sum of inputs feeding into it, followed by a nonlinear sigmoidal squashing operation that maps the total summed input into the unit interval.

An illustration of the MLP architecture used in TD-Gammon’s neural network. Source: https://dl.acm.org/doi/pdf/10.1145/203330.203343

The nonlinearity of the squashing function enables the neural network to compute nonlinear functions of its input, and the precise function implemented depends on the actual value of the weights. The process of learning in an MLP consists of applying a formula for changing the weights so that the function implemented by the network more closely approximates a desired target function.

The training procedure for TD-Gammon is as follows: the network observes a sequence of board positions starting at the opening position and ending in a terminal position characterized by one side having removed all its checkers. The board positions are fed as input vectors x[1], x[2], . . . , x[f] to the neural network, encoded using a representation scheme described later in this article (each time step in the sequence corresponds to a move made by one side, i.e., a “ply” or a “half-move” in game-playing terminology.) For each input pattern x[t] there is a neural network output vector Y[t] indicating the neural network’s estimate of expected outcome for pattern x[t]. (For this system, Y[t] is a four component vector corresponding to the four possible outcomes of either White or Black winning either a normal win or a gammon. At each time step, the TD(λ) algorithm is applied to change the network’s weights.

where,

𝛼 — learning rate

w — vector of weights

∇wYₖ— gradient of network output with respect to weights

The quantity λ is a heuristic parameter controlling the temporal credit assignment of how an error detected at a given time step feeds back to correct previous estimates. When λ=0, no feedback occurs beyond the current time step, while when λ=1, the error feeds back without decay arbitrarily far in time. Intermediate values of λ provide a smooth way to interpolate between these two limiting cases.

At the end of each game, a final reward signal z is given, based on the outcome of the game. The preceding equation is used to change the weights, except that the difference (z — Y[f]) is used instead of (Y[t+1] — Y[t]).

First version TD-Gammon 0.0

TD-Gammon used a non-linear form of the TD(λ) algorithm. The estimated value Vₜ(s) of any state s (board position) was meant to the probability of winning, starting from state s. For this, rewards were defined to be zero at all times except when the game was won. To implement the value function, TD-Gammon used a standard multilayer neural network. The input to the neural network was a representation of backgammon positions and output was an estimate value of that position.

Backgammon positions were represented to the network in a relatively direct way involving little backgammon knowledge. However it involved substantial knowledge of how neural networks work and how information is presented to them in the best way.

There are a total of 198 input units to the neural network. For each point on the backgammon board, 4 units indicate the number of pieces at that point.

Tesauro made little attempt to minimise the number of units. He provided one unit for each conceptually distinct possibility that seemed likely to be relevant, and he scaled them to roughly the same range, in this case between 0 and 1.

Given a representation of a backgammon position, the network computed its estimated value in the standard way. Corresponding to each connection from an input unit to a hidden unit was a real-valued weight. Signals from each input unit were multiplied by their corresponding weights and summed at the hidden unit. The output, h(j), of hidden unit j was a nonlinear sigmoid function of the weighted sum:

where

ϕ(i) — the value of the ith input unit

wᵢⱼ— the weight of its connection to the jth hidden unit.

The output of the sigmoid is always between 0 and 1, and has a natural interpretation as a probability based on a summation of evidence. The computation from hidden units to the output unit was entirely analogous. Each connection from a hidden unit to the output unit had a separate weight. The output unit formed the weighted sum and then passed it through the same sigmoid nonlinearity.

TD-Gammon used the gradient-descent form of the TD(λ) algorithm

Backward view of gradient descent TD(λ)

In gradient-descent methods, the parameter vector is a column vector with a fixed number of real valued components,

(the T here denotes transpose), and Vₜ(s) is a smooth differentiable function of θₜ for all s ∈ S

A backward view of gradient descent TD(λ) is given by:

where δₜ is the usual TD error,

and eₜ is a column vector of eligibility traces, one for each component of θₜ, updated by

A complete algorithm for on-line gradient-descent TD(λ):

Source: http://incompleteideas.net/book/first/ebook/node87.html#eq:TDlforward

For the backgammon application, in which 𝛾=1 and the reward is always zero except upon winning, the TD error portion of the learning rule is usually just Vₜ(sₜ₊₁) — Vₜ(sₜ).

Tesauro obtained an unending sequence of games by playing his learning backgammon player against itself. Initially, the weights of the network were set randomly to small values. Thus the initial evaluations were arbitrary. Since the moves were selected on the basis of these evaluations, the initial moves were inevitably poor, and the initial games often lasted hundreds or thousands of moves before one side or the other won, almost by accident. After a few dozen games however, performance improved rapidly. After playing about 300,000 games against itself, TD-Gammon 0.0 as described above learned to play approximately as well as the best previous backgammon computer programs. This was a striking result because all the previous high-performance computer programs had used extensive backgammon knowledge.

Subsequent versions

After the success of TD-Gammon with zero backgammon knowledge, a modification was much needed. This brought about the TD-Gammon 1.0 with specialized backgammon features but keeping the self-play TD learning method. Later versions of the program, TD-Gammon 2.0 (40 hidden units) and TD-Gammon 2.1 (80 hidden units), were augmented with a selective two-ply search procedure. To select moves, these programs looked ahead not just to the positions that would immediately result, but also to the opponent’s possible dice rolls and moves, assuming the opponent always took the move that appeared immediately best for him. TD-Gammon 3.0, uses 160 hidden units and a selective three-ply search.

Why did TD-gammon work

Although TD-Gammon is one of the major successes in machine learning, it has not led to similar impressive breakthroughs in temporal difference learning for other applications or even other games(like chess, Go and checkers). Some of the success for TD-Gammon could be replicated by developing a competitive evaluative function on a 4000 parameter feed-forward neural network. This was done without using back-propagation, reinforcement learning or temporal difference learning methods. Instead it applied simple hill-climbing in a relative fitness environment. These results and further analysis suggest that the surprising success of Tesauro’s program had more to do with the co-evolutionary structure of the learning task and the dynamics of the backgammon game itself.

Self-playing learners usually develop eccentric and brittle strategies which allow them to draw each other, yet play poorly against humans and other programs. Yet Tesauro’s 1992 result showed that this self-play approach could be powerful.

Tesauro, 1992 pointed out some of the features of Backgammon that make it suitable for approaches involving self-play and random initial conditions. Unlike chess, a draw is impossible and a game played by an untrained network making random moves will eventually terminate (though it may take much longer than a game between competent players). Moreover the randomness of the dice rolls leads self-play into a much larger part of the search space than it would be likely to explore in a deterministic game. It is not simply the dice rolls which overcome the problems of self-learning, the dynamics of backgammon contain something critical that sets it apart from other games with random elements like Monopoly. The outcome of the game continues to be uncertain until all contact is broken and one side has a clear advantage. What many observers find exciting about backgammon, and what helps a novice sometimes overcome an expert, is the number of situations where one dice roll, or an improbable sequence, can dramatically reverse which player is expected to win.

The primary cause for success must be the dynamics of backgammon combined with the power of co-evolutionary learning. If we can isolate the features of the backgammon domain which enable evolutionary learning to work so well, it may lead to a better understanding

Conclusion

Let me conclude by saying that TD-Gammon was a novel and interesting approach towards the game with an astounding result. It exhibited the potential of Reinforcement Learning in an unprecedented manner. But unfortunately, like everything, it too had its shortcomings. This made it futile for other games like Chess, Go and Checkers. Nevertheless, this algorithm opened up new possibilities in the field of Reinforcement Learning and prompted researchers to delve deeper into it.

References

[1]Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3), 58–68. https://dl.acm.org/doi/pdf/10.1145/203330.203343

[2]Pollack, J. B., & Blair, A. D. (1997). Why did TD-gammon work?. In Advances in Neural Information Processing Systems (pp. 10–16). https://papers.nips.cc/paper/1292-why-did-td-gammon-work.pdf

[3]http://incompleteideas.net/book/first/ebook/node108.html#nonlinearTDrule

[4]http://incompleteideas.net/book/first/ebook/node87.html#eq:TDlforward

[5] Intro to RL by Sutton and Barto

[6] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. https://arxiv.org/pdf/1312.5602.pdf

[7] Andrew G. Barto (2007) Temporal difference learning. Scholarpedia, 2(11):1604. http://www.scholarpedia.org/article/Temporal_difference_learning#:~:text=Temporal%20difference%20(TD)%20learning%20is,to%20drive%20the%20learning%20process.