The University of Alberta Collaboration with DeepMind: Introduce Novel Search Algorithm: Policy-Guided Heuristic Search with Guarantees.

Gayan Samuditha
Expo-MAS
Published in
5 min readDec 11, 2021

========================================

Most machine learning techniques operate with the assumption that the training and test data were generated from the same statistical distribution (IID, for independent identically distributed). But it’s not that simple in the real world, where adversaries may supply data that violates such statistical assumptions. Typical examples are adversarial and stochastic games like Go and Chess.

RESEARCH ABSTRACT:

Abstract The use of a policy and a heuristic function for guiding search can be quite effective in adversarial problems, as demonstrated by AlphaGo and its successors, which are based on the PUCT search algorithm. While PUCT can also be used to solve single-agent deterministic problems, it lacks guarantees on its search effort and it can be computationally inefficient in practice. Combining the A* algorithm with a learned heuristic function tends to work better in these domains, but A* and its variants do not use a policy. Moreover, the purpose of using A* is to find solutions of minimum cost, while we seek instead to minimize the search loss (e.g., the number of search steps). LevinTS is guided by a policy and provides guarantees on the number of search steps that relate to the quality of the policy, but it does not make use of a heuristic function. In this work, we introduce Policy-guided Heuristic Search (PHS), a novel search algorithm that uses both a heuristic function and a policy and has theoretical guarantees on the search loss that relates to both the quality of the heuristic and of the policy. We show empirically on the sliding-tile puzzle, Sokoban, and a puzzle from the commercial game ‘The Witness’ that PHS enables the rapid learning of both a policy and a heuristic function and compares favorably with A*, Weighted A*, Greedy Best-First Search, LevinTS, and PUCT in terms of a number of problems solved and search time in all three domains tested.

Paper Title: Policy-Guided Heuristic Search with Guarantees

Download here: arXiv.

========================================

Policy-Guided Heuristic Search with Guarantees Laurent Orseau,1 Levi H. S. Lelis2 1DeepMind, UK 2Department of Computing Science, Alberta Machine Intelligence Institute (Amii), University of Alberta, Canada lorseau@google.com, levi.lelis@ualberta.ca

========================================

  • In previous research in this area, as demonstrated by DeepMind’s AlphaGo and its successors, the policy and heuristic function are based on the polynomial upper confidence trees (PUCT) search algorithm, which be can be quite effective for guiding search in adversarial games. However, PUCT lacks guarantees on its search effort and is computationally inefficient. While other methods such as LevinTS provide guarantees on search steps, they do not make use of a heuristic function.

^ To address the deficiencies of such search algorithms, a research team from DeepMind and Alberta University has proposed Policy-guided Heuristic Search (PHS), a novel search algorithm that uses both a heuristic function and a policy while also offering guarantees on the search loss that relates to the quality of both the heuristics and the policy.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

“ The idea behind the PUCT search algorithm is to guarantee that a model’s value function converges to the true value function in the limit of exploration. The guarantee however does not hold when actual rewards are replaced with an estimated value, and it is not possible to ensure that the level of PUCT generality is the best fit for difficult deterministic single-agent problems. ”

The more recent A* combined algorithm has a learned heuristic function that trades-off solution quality for search time, and tends to work better than PUCT. But the purpose of using A* is to find solutions of minimum cost, which is inconsistent with the goal of minimizing the search loss (e.g., the number of search steps).

Levin Tree Search (LevinTS), meanwhile, is an approach that uses a learned policy to guide its search and minimize the number of search steps that account for the quality of the policy but is incapable of learning heuristic functions.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Algorithm 1:

The proposed PHS combines LevinTS’s policy-guided search with the heuristic-guided search of A* in a novel algorithm. The PHS algorithm is designed such that, when given a set of unknown tasks (starting with little to no knowledge of the tasks), it will solve all of them as quickly as possible. Unlike traditional search algorithms that seek minimum path loss solutions for all tasks, PHS minimizes the total search loss, hence it does not require the search algorithm solutions to be path-cost optimal.

Algorithm Explained Here…….

*******************************************************************

PHS generalizes LevinTS by considering arbitrary nonnegative loss rather than enforcing the value of loss to be one, and by introducing a heuristic factor that is bigger than or equal to one. The researchers validate that for any nonnegative loss function, and for any given policy and any given heuristic factor, PHS will return a solution node. They also show that the search loss for PHS has an upper bound and that an accurate heuristic can greatly reduce the number of search steps.

*******************************************************************

The researchers compared their PHS algorithm with policy-guided and heuristic search algorithms PUCT, LevinTS, A*, Weighted A* (WA*), and Greedy Best-First Search (GBFS). They evaluate the algorithms on the 5×5 sliding-tile puzzle, Sokoban (Boxoban), and on a puzzle from the 2016 video game The Witness.

Improved results for LevinTS are likely due to the cooperation of search and learning during training, allowing to gather gradients for harder problems.

Table 1: Results of the tests sets. Lengths are the solution depths, and the last three columns are averaged over solved problems only; hence numbers such as 123 cannot be properly compared with.

The PHSh and PHS* algorithms obtained the best results on Sokoban. On The Witness, a challenge for learning good heuristic functions, the policy-guided BFS-based algorithms (PUCT, PHS*, LevinTS and PHSh) were the winners. GBFS performed poorly on this challenge as it relies exclusively on the quality of the heuristic. WA* and PHS* meanwhile could solve the Sliding Tile Puzzle problems, clearly outperforming all other algorithms.

Overall, the study show PHS enables the rapid learning of both a policy and a heuristic function and performs well on all three test domains in terms of the number of problems solved and the search time required.

*******************************************************************

--

--

Gayan Samuditha
Expo-MAS

Software Engineer , Biologist, Techie, Try to Save the Human Being with Combination of Medical Informatics and AI