Remember Your Path to Success — Why Machine Learning has the potential to exceed human knowledge.

Patrick Michalsky
omi-uulm
Published in
12 min readApr 14, 2021
Photo by Lukasz Szmigiel on Unsplash

Nowadays, many companies rely on the deployment of machine learning algorithms to solve increasingly challenging optimisation problems. However, not every machine learning technique is equally suitable. Limiting factors such as accuracy, precision, recall, runtime efficiency and memory efficiency have a significant influence on the operational scope of each algorithm.

In this blog post, we describe AlphaZero, an all-purpose machine learning algorithm, which increased both the runtime efficiency and the overall performance of automated decision-making processes by combining the non-linear function approximation of an artificial neural network model and the powerful, domain-independent capabilities of the Monte Carlo Tree Search algorithm. The decisive advantage of AlphaZero lies in the independent and unbiased learning of potential solutions, which enables AlphaZero not only to outperform established decision-making algorithms, but also to overcome the entrenched thinking patterns of our human knowledge.

Human Knowledge

To understand how it is possible to design algorithms capable of solving existing optimisation problems while overcoming explicit human knowledge, it is essential to first take a look at the general process of knowledge creation.

Generally, most of what we understood as knowledge can be simplistically interpreted as the accumulation of skills that we acquired through experiences. The distinction between the experiences we have had ourselves and the knowledge we have gained through the experiences of others can be considered as negligible in this context. However, to acquire knowledge, we as individuals need to gain additional experiences either by educating ourselves or by trialling unfamiliar situations.

Both procedures come with downsides. While conducting practical and theoretical experiments takes a lot of time and does not guarantee the generation or discovery of new findings, the correctness of existing human knowledge is commonly not consistent and can be refuted anytime by new insights. One of the most important questions is therefore how the process of future knowledge acquisition can be made faster, more efficient and more reliable.

A promising answer is provided by the concept of reinforcement learning. Instead of generating mathematical models based on explicit knowledge, reinforcement learning combines both the optimal control and behaviourist psychology to simulate an experience-driven learning process. The fundamental approach is based on trial-and-error learning: An autonomous agent continuously interacts with a certain environment and alters its behaviour depending on the observed consequences.

Reinforcement Learning: An agent performs actions to maximize the resulting reward

The more interactions are performed regarding an given environment, the more experiences can be gathered and converted into knowledge. In return, the more knowledge is acquired, the higher is the probability of selecting beneficial interaction opportunities, which leads to a continuous improvement of the reinforcement learning algorithm. Combined with the usage of high-performance servers, which are able to reduce the required wall-clock time of machine learning algorithms to a minimum, it is thus possible to simulate the elementary human knowledge acquisition process in a more time-efficient manner.

In comparison to humans, which may take years of intense studies to reach a specific level of knowledge, machine learning algorithms can achieve similar results in a fraction of time and by trialing more unfamiliar situations. Moreover, the application of reinforcement learning algorithms allows the omission of explicit knowledge, providing the necessary power to unleash the potential for superhuman performance.

Game Playing Approaches

A breakthrough in the field of machine learning was achieved in March 2016 when AlphaGo, an experience-based search tree algorithm, was able to beat the world’s best Go player Lee Sedol. In contrast to established Go engines, AlphaGo uses pre-learned artificial neural networks to reduce the complexity of the Monte Carlo Tree Search algorithm by adapting human experts’ strategies.

To get more independent from human knowledge, AlphaGo Zero, a more generic version of AlphaGo was launched in October 2017. Instead of relying on pre-learned strategies, AlphaGo Zero starts without previous knowledge about the game environment and continuously improves the behaviour of the artificial neural network by learning from the suggested policies of the Monte Carlo Tree Search algorithm. In December 2017, the AlphaGo Zero algorithm was applied under the name of AlphaZero to a broader class of game rules.

Without additional parameters, AlphaZero mastered the games of Go, Chess and Shogi through self-play, overcoming the extraordinary calculating capabilities of the world’s strongest game engines like Stockfish and IBM’s Deep Blue, which rely on thousands of handcrafted rules and heuristics. However, it was AlphaZero’s style of play which caused astonishment. In Chess, for example, AlphaZero independently discovered and played common human chess strategies such as openings, king safety and pawn structure. Furthermore, AlphaZero developed its own strategies that augment and overturned centuries of chess knowledge with a wide range of innovative game playing approaches.

The implications go far beyond my beloved chessboard… Not only do these self-taught expert machines perform incredibly well, but we can actually learn from the new knowledge they produce.

(Garry Kasparov — former World Chess Champion)

Monte Carlo Tree Search

But how did AlphaZero generate a deeper understanding of the gameplaying environment? Let us start with the Monte Carlo Tree Search algorithm. The Monte Carlo tree search algorithm is a heuristic search method to calculate optimal decisions in a designated domain combining both the precision of tree search and the generality of random sampling. The basic process is subdivided into four repetitive computational steps: selection, expansion, simulation, and backpropagation.

Monte Carlo Tree Search

In chess, for example, the Monte Carlo Tree Search is iteratively applied to compute the optimal interaction for a given board state by simulating various future scenarios and selecting the move which guarantees the best possible long-term outcome. The nodes within the search tree represent certain board states, while each connection between the nodes illustrates the movement of a game piece. In mathematical terms, a specific node (s, a) can thus be determined by the combination of a certain board situation s and the respective move a performed on it.

Each node of the tree represents a combination of a certain board situation and the move applied to it (PNG).

During the execution of the algorithm, each simulated node is assigned with a efficiency value, indicating whether the combination of board state and performed move had a positive or negative effect on the current player’s performance. Usually, the efficiency value of a node is defined as the relation between the number of simulations N(s,a) and the corresponding number of victories Q(s,a). For example, if a node led to a victory only once, but was simulated four times, the efficiency value is ¼. Nodes that have a higher efficiency values are consequently more profitable and are preferred in the selection of subtrees.

For the simulation of possible board interactions as well as consequential board situations, the Monte Carlo Tree Search algorithm is provided with perfect knowledge of the game rules and considers possible opponent reactions by simulating both the players and opponents moves alternately. To differentiate between the player’s benefits, the outcome of an opponent step is provided with a negative sign. This procedure is similar to the Minimax algorithm, but differs in the fact that not all, but only a limited number of nodes need to be traversed.

Selection

Let’s get into a little more detail. Starting at the root node (sₒ,⋅ ), a child selection policy 𝜋 is recursively applied to the search tree until a nonterminal board state sₖ with unvisited child nodes (sₖ, a) is reached. The most popular child selection policy 𝜋 is represented by the simple and efficient Upper Confidence bounds applied to Trees (UCT) algorithm, which guarantees to be within a constant factor of the best possible bound. The functionality of the UCT algorithm is described as an independent multiarmed bandit problem, selecting an optimal action among all possible decisions by maximizing the following policy:

Upper Confidence bounds applied to Trees (UCT)

Considering a current nonterminal board state sₘ, and a set of subsequent board states sₙ, each connected to node (sₘ,⋅ ) over a respective move a. The variable N(sₘ,⋅ ) represents the number of times sₘ has been investigated in total, while the variable N(sₘ, a) describes the number of visits related to each of the subsequent nodes (sₘ, a). If a specific value of N(sₘ, a) equals zero, the UCT value of node (sₘ, a) is defined as infinity. This definition is based on the exploration-exploitation dilemma, ensuring unvisited child nodes are considered at least once before any other subsequent node is expanded further. In addition, the variable Cₚ > 0 can be increased or lowered to regulate the influence of the ensuing exploration term. The ratio of Q(sₘ, a) and N(sₘ, a) represents the efficiency value of node (sₘ, a), which is considered to be within the interval from zero to one. If more than one node (sₘ, a) has the same maximal value, an action is chosen randomly.

Expansion

Whenever a combination of non-terminal board state sₘ and selected move a, which is not yet represented by a corresponding tree node, is examined during the traversal of the search tree, the expansion step extends the preceding node by creating the nonexisting child node (s, a).

Simulation

In a next step, the game outcome ∆ for the newly created leaf node (sₘ, a) is calculated by simulating a sequence of random actions until a final game state is reached. Depending on the application area of the Monte Carlo Tree Search algorithm, the simulation can variate in its complexity. Furthermore, the expected return ∆ may be a discrete or continuous value for simpler domains, or a vector for more complex multiagent domains. For example, in a environment in which the set of possible outcomes are defined by either winning or losing a game, the value of ∆ would contain either 1 or 0. Since each of the created nodes are generated without previous knowledge about the search space, the generality of random sampling remains.

Backpropagation

In the last step, the simulated outcome ∆ of the newly created leaf node (sₘ, a) is backpropagated through the preceding tree nodes to update the outcome statistics Q(s, a) of each node (s, a) in the selected subtree. The update is usually realised by adding ∆ to the value of Q(s, a). In addition, the visit count N(s, a) of each node is incremented.

Termination

The traversal of the Monte Carlo tree search algorithm is repeated until either a time, memory, or iteration constraint is satisfied. As soon as the search terminates, one of the possible actions a of the root node sₒ is selected to be the optimal solution to the initial decision problem. Generally, there are four different criteria for selecting the best possible action:

  • max child: select the root child with the highest efficiency value.
  • robust child: select the most visited root child.
  • max-robust child: select the root child with both the highest visit count and the highest efficiency value. If none exists, then continue searching until an acceptable visit count is achieved.
  • secure child: select the child which maximizes a lower confidence bound.

Experience-based Tree Search

In contrast to traditional Monte Carlo tree search algorithms, which are investigating all possible upcoming board situations, AlphaZero uses previous search experiences, represented by an artificial neural network model, to exclude search branches that are expected to be insufficient. As a result, the potential board situations can be reduced dramatically.

Experience-based Tree Search

In Chess, for example, AlphaZero searches only 60 thousand positions per second, compared to roughly 60 million for the open-source chess engine Stockfish. To realize the AlphaZero algorithm, each node (s, a) consists of a prior probability of success P(s, a), a visit count N(s, a), and an expected outcome, represented again by the ratio of Q(s, a) and N(s, a). Starting from a current board state sₒ, the AlphaZero algorithm simulates multiple board situation by iteratively selecting board interactions, that are maximizing the upper confidence bound, until an unvisited child node (sₘ,⋅ ) is encountered.

Upper Confidence bound and neural network predictions

The position of the encountered unvisited child node is then expanded and evaluated by the artificial neural network model fₔ to generate both the prior probabilities P(sₘ,⋅ ) and the prognosticated outcomes V(sₘ) of the subsequent child nodes (sₘ, a).

Since the prognosticated outcomes V(sₘ) ∈ {-1, 0, 1} are already generated by the artificial neural network model, no further simulations are needed to evaluate ensuing board interactions. In the following step, the prognosticated values of the encountered child node (sₘ, a) are backpropagated through the preceding nodes to update the average outcome statistics of the selected path nodes, described by Q(s, a) and N(s, a).

After 800 iterations, the Monte Carlo tree search algorithm of AlphaZero returns the board interactions with the most visited node (sₒ, a) as optimal solution to the current decision problem sₒ. As an additional return, the visit counts of the branches (sₒ, a) are transformed into a probability distribution 𝜋, calculated either proportionally or greedily with respect to the visit counts of the root node (sₒ,⋅ ). The distribution 𝜋, representing the latest executed search experiences, is then used to adjust the parameters ɘ of the artificial neural network fₔ.

At the beginning of the self-play reinforcement learning algorithm, the neural network is initialized to random weights ɘ. During a game, AlphaZero uses the guided Monte Carlo tree search algorithm to find an optimal solution for each of the emerging board situation. The resulting probability distributions 𝜋ₜ and the corresponding board situation sₜ are saved until the current game terminates. By definition, a game terminates at step T when either both players pass, when the search value drops below a resignation threshold or when the game exceeds a maximum length.

The emerging board situations s are then transferred to the artificial neural network model as a stack of input images N × N ×(M ⋅ T + L), concatenating T sets of M planes with the size of N × N. Each set of planes represents the board position at a specific timestep and is oriented to the perspective of the current player. The M feature planes are composed of binary feature planes, indicating the presence of the player’s pieces, while each of the L additional constant-valued input planes are donating supplementary information in terms of player’s colour, total move count or game-depending restrictions.

The combination of final outcome z ∈ {-1, 0, 1}, probability distributions 𝜋, prognosticated outcomes v:=V(sₒ), prior probabilities p:=P(sₒ,⋅ ) and corresponding board situations s are then used to update the parameters of the artificial neural network model fₔ by minimizing the following loss function l. Specifically, the parameters ɘ are adjusted by gradient decent, applied to the mean-squared error and the cross-entropy of the probability and value output of the artificial neural network model. The value of c is introduced as parameter to control the level of L2 weight regularisation:

Loss function which needs to be minimized

To learn each game, an untrained artificial neural network model plays millions of games against itself. At first, it plays completely randomly, but over time the system learns from wins, losses, and draws, making it more likely to choose advantageous moves in the future. The amount of training the network needs depends on the style and complexity of the game, taking approximately 9 hours for chess, 12 hours for shogi, and 13 days for Go — a fraction of the time it would take us, as humans, to master the game’s mechanics. Similar results can be observed in comparison with the world’s strongest game engines:

Comparison between AlphaZero and world’s strongest game engines (PDF)

Research

Since experience-based search tree algorithms are no longer limited to the analytical description of explicit human knowledge, similar machine learning techniques could be applied to other domains requiring recognition of complex patterns, long-term planning, and decision-making. At our institute, we are deeply inspired by the overwhelming results of AlphaZero and are currently investigating alternative application areas. Stay tuned.

--

--