Solving a Rubik’s Cube with Reinforcement Learning (Part 1)

A Deep Q-Learning approach to solving a common puzzle

Published in

Analytics Vidhya

6 min readAug 11, 2020

Source: https://openai.com/blog/solving-rubiks-cube/. OpenAI made headlines in Fall 2019 for solving the related but more difficult RL task of teaching a physical hand to manipulate the cube. The task of actually solving the cube was done using Kociemba’s Algorithm.

Last year, I started my journey into machine learning through a Master’s program at Cornell Tech. One topic that particularly caught my eye was reinforcement learning (RL), which we approached from both the traditional direction of Markov Decision Processes (MDP) and from the direction of Deep Learning (DL). While the coursework was very informative, I wanted to take it a step further. Here, I have documented my (ongoing) attempt to do just that, by training an agent to solve a Rubik’s Cube.

A couple introductory notes:

Solving a Rubik’s Cube with reinforcement learning is not a new problem and I will be basing most of my work off this paper by Stephen McAleer et al., with some modifications.
For certain concepts, I will try to go into as much detail as possible about my task specific implementations, so some familiarity with probability, machine learning, RL, and DL is recommended. That being said, I am by no means an expert and any and all feedback is much appreciated!

Introducing Markov Decision Process

A Markov Decision Process captures how an agent takes actions in an environment. Each action puts the agent in a different environmental state, usually according to some probability distribution, where the agent then has the possibility of receiving some reward. The agent’s goal is to learn a policy (i.e. the appropriate action to take in a given state) in order to maximize the long-run reward that the agent receives.

Source: *Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.*

More concretely, an MDP is defined by a tuple (S, A, Pₐ, Rₐ). Where:

S is the set of all possible states in the environment

A is the set of all possible actions the agent can take

Pₐ defines the probability distribution for transitioning from state s⁰ to state s¹ when taking action a

Rₐ specifies the reward received for taking action a in state s

Formulating the Rubik’s Cube task as an MDP

Source: https://mathematica.stackexchange.com/questions/195064/pascal-records-and-mathematica-programming

To train our agent, we must specify an MDP to represent the task of solving a Rubik’s Cube, which we do by defining each component of the tuple.

Our set of possible states, S, is all possible permutations of the Rubik’s Cube. We should note that there are ~4.3 x 10¹⁹ possible states! This will play a major role in how our agent learns to complete the task.

Our set of possible actions, A, is made of the 12 possible, 90 degree, rotations to the cube. By convention, these 12 moves are named based on which of the six faces of the cube is moved, such as the Front face or Right face, and whether that face is rotated clockwise or counterclockwise. These moves are abbreviated with the following notation: (F, R, L, U, D, F’, R’, L’, U’, D’).

Specifying the state transition probability distribution, Pₐ, is fortunately very easy, since each action deterministically changes the state. For example, executing the F action will only transfer the cube from its current state to the state where its front face is rotated clockwise 90 degrees. If we call these states s₁ and s₂ respectively, then p(s₁ -> s₂ ; F) = 1 and p(s1 -> s’ ; F) = 0 for all s’ != s₂. We can then apply this same logic to all other state / action combinations to get the probability distribution.

Finally, we have to specify the reward function (i.e. what reward, if any, the agent receives for taking each action in each state). To ensure our agent only cares about solving the cube and no other secondary objective, we will say that the agent receives a reward of 1 when it takes an action that leads the cube to being solved and 0 if an action leads to any other states.

A Strategy for Solving the Cube

The choice of which action an agent should take in each state of an MDP is referred to as an agent’s policy. In particular, we want to learn an optimal policy that will maximize the long term reward the agent receives. Given the reward function we specified above, the optimal policy for the Rubik’s cube agent is a policy that eventually leads to the solved state, as that is the only state which gives a reward.

Policies are functions that map states to actions and are denoted by π(•). For simple MDPs, the optimal policy can be determined by solving a set of recursive equations known as the Bellman equations.

The Bellman Equation

Intuitively, the left hand side of the above equation can be thought of as representing the “value” of the agent being in state s while following policy π(s). The Bellman equations say that this value is equal to the expected reward of taking the next action specified by the policy (the average of the rewards from each state reachable by this action, weighted by the likelihood of ending up in that state) plus the “value” of each of these states, discounted by some factor γ. From these equations, an optimal policy can be defined as taking the action in each state which maximizes V(s).

While the Bellman equations can be calculated for simple MDPs using dynamic programming, this quickly becomes infeasible for larger action and state spaces. For these larger problems, we instead have to take a reinforcement learning approach to solving the MDP. One such approach is called Q-learning, which utilizes a related function called the Q-function. Once the Q-function is specified (or at least estimated), the agent can then follow a policy of taking the action which maximizes the Q-function in each state.

The Q-function in it’s simplest form can be thought of as a giant look-up table with dimensions |S| x |A|. However, in the Rubik’s Cube example where this would result in a table with dimensions of 4.3 x 10¹⁹ x 12, we quickly see it’s infeasible to determine every entry of the table. Instead, we will need a way to estimate the Q-function for all state-action pairs.

Enter Deep Learning

The machinery we will use to estimate the Q-function is a neural network. Specifically, we will leverage the Universal Approximation Theorem of neural networks, which states under certain conditions, that a neural network can be used to approximate any function.

In my next post, I will cover the seminal algorithm used to train a neural network to estimate the Q-function (if you’d like to read on in the meantime, you can check out the original paper here), the specific model architecture I used for the Rubik’s Cube task, as well as some preliminary results.

Thank you for reading my first blog post on Medium! I hope you found it interesting and would love to discuss any feedback or further the conversation in the comments below. Stay tuned for Part 2!