MuZero: the undefeatable player

AI Club, IITM
4 min readSep 8, 2022

--

A look at MuZero, the most advanced reinforcement algorithm ever.

You are given this task- learn to play a game, but you won’t be told the game’s rules or have someone to play against. All you will be told is if a particular move is legal and if the game has ended.

What do you think? Confused? Do you think you can learn to play this game?

Google DeepMind’s MuZero model can. And learn it to a superhuman extent. It has been termed “unbeatable”.

A glimpse of the past

In 2016, AlphaGo came to life, created by DeepMind, the first computer program to defeat a professional human Go player and arguably the strongest Go player in history! All this has been achieved with the help of deep learning, which shows the true potential of neural networks.

In 2017, DeepMind made yet another significant step forward in the world of AI by introducing AlphaZero, a single system that taught itself from scratch how to master the games of Chess, Shogi and Go, given the game’s rules. AlphaZero’s games were considered groundbreaking and highly dynamic, again proving the superiority of deep learning. AlphaZero defeated the world’s best chess engine, Stockfish, in a one-sided 100-game match just two years back.

“I can’t disguise my satisfaction that it plays with a very dynamic style, much like my own!”

- Garry Kasparov, Former World Chess Champion

But it didn’t stop there! DeepMind managed to achieve something which is even difficult to imagine. A computer-based program that can play any game with the highest accuracy/expertise without needing to be told the game rules! Doesn’t that sound astounding? Well, it does exist and let us see and try to get an intuitive feel of how this works.

How did MuZero achieve this?

MuZero is a reinforcement learning (RL) algorithm. RL algorithms are based on a system of rewards the machine can potentially earn for doing the right moves/actions from a particular state.

However, unlike traditional RL models, we do not give an environment where we allow the model to learn. Instead, MuZero learns a model of the environment and uses an internal representation that contains only helpful information for predicting the next move.

  • Value: How good is the current state? What are the chances of winning from the current state?
  • Policy: Which move will give the highest chance of winning from the current state?
  • Rewards: How good was the last action? What is the reward obtained for that action?
How does MuZero work?

These parameters are learned using a neural network. The neural network has the following essential functions:

  • A representation function h, to map the situation(like the position of various pieces on a chess board) to an embedding that can be fed into the neural network.
  • A dynamics function g is used to find and play out future possibilities.
  • A prediction function f is used to predict the outcomes of various moves.

The model learns the best action at each position using these parameters.At each instance, MuZero converts the observation into a hidden state, and in subsequent steps, these hidden states get updated. There is no constraint for the hidden states to learn all the information regarding the environment; thus, the amount of information the model has to predict and maintain drastically decreases compared to a model-based system.

MuZero in action

All this is great, but DeepMind has bigger plans for MuZero. MuZero took upon a real-world challenge in improving video compression for YouTube. A mechanism called self-competition, which converts the complex objective of video compression into a simple WIN/LOSS signal by comparing the agent’s current performance against its historical performance, converts a rich set of codec requirements into a simple signal that can be optimised. As a result of combining the power of search with its ability to learn a model of the environment and plan accordingly, MuZero achieves superhuman performance.

Applying MuZero beyond research environments illustrates how our RL agents can solve real-world problems.

MuZero relies more on its future predictions and generalises to many environments than previous techniques. The generalisation capability of this AI is just as important as its performance. In other words, if there is a narrow algorithm that is the best possible chess algorithm that ever existed and a somewhat below world champion level AI that can play any game that can be imagined, the latter triumphs. This generalisation aspect of MuZero extends its applications to real-life situations.

Citations:

Credits: Niveath, Priyanka, Gokul, Ruthwik

May your model converge.

Signing off,

Abhiram & Archish

--

--

AI Club, IITM

A community of passionate students working in various subfields of AI and Machine Learning from the Indian Institute of Technology Madras.