Solving Rubik’s Cube with Deep Reinforcement Learning

A Deep Reinforcement Learning approach to solving a 3-D common puzzle

Yakup Bilen
7 min readSep 1, 2022
Photo by Volodymyr Hryshchenko on Unsplash

Preface

  • English is not my native language and probably I have some mistakes on this article. Sorry about that.
  • During this process, I tried and changed different methods to achieve success. In this article, the final state of the project will be explained.
  • Code to reproduce the results is available at: https://github.com/yakupbilen/drl-rubiks-cube
  • Being familiar with DL(Deep Learning), RL(Reinforcement Learning) will help you understand the article.
  • Some of the methods (ADI, BWAS) used in this project were taken from 2 separate papers in the references section and applied with the changes made.
  • Feedbacks will help me to improve my article, project and myself.

Introduction

The Rubik’s Cube was invented by Hungarian Architect Ernő Rubik in 1974, originally called the Magic Cube. It was renamed as Rubik’s Cube by the company that undertook the production, selling more than 350 million copies at 2018, making it the world’s most popular puzzle toy in this field.

Photo by Kenny Eliason on Unsplash

A regular(3x3) Rubik’s Cube has 6 faces, each faces has 9 colored squares. The solved/target state is when the squares on each face have the same color. Due to movement restrictions, the squares in the middle of each face cannot be changed in any way. The middle square of each face show what color that entire face will be for the solved state. The best algorithms human-invented can solve any state of the cube in 26 moves.

My goal is to let the computer learn how to solve the Rubik’s Cube without feeding it any human knowledge like the symmetry of the cube.

Cube Actions

Challenges

  • At first sight, the cube with 9 squares on each face looks simple. The challenges here is that only one state out of the extremely large state space(43,252,003,274,489,856,000) is the solved(target) state.
  • In the application part of the problem, generating cubes and training the model with the generated cubes require multiple high-performance GPU units due to the state space.

Before we discuss how to approach this problem, first, let’s see data representation and actions.

Data Represantation

Each state was represented as a numpy vector of size 54, between 0 and 5 according to its colors. Each number represents a color. Every state of the cube is represented as 54 × 6 tensor with one-hot encoding for neural network.

Actions (Moves)

Actions are the possible rotations that we can apply to any given state(cube). Note that, only 90° actions are defined in my implementation.

Each face of Rubik’s Cube can be rotated in two directions: clockwise or counterclockwise. Rubik’s cube has 6 faces so in total we have 12 actions. Each action is named according to which face of the cube is rotated which direction. Names of cube faces are left, right, top, bottom, front and back.

Names of actions are derived from the first letters of face’s name. For example; if we rotate right face in clockwise direction, is named as R(first letter of face). For counterclockwise actions, an apostrophe is added to the end of the action name. (R’,L’,T’,B’,F’,B’)

In my code, actions are implemented using numpy indexing due the huge speed advantage it provides compared to the for loop.

Deep Reinforcement Learning

Deep Reinforcement Learning; It is a area of machine learning that combines at least one Neural Network with Reinforcement Learning. It is especially used in situations where Reinforcement Learning is insufficient due to the size complexity of the states/actions or where inputs are high dimensional (video, picture, etc.).

Deep Reinforcement Learning uses Artificial Neural Networks to get rid of memory limits. In Deep Reinforcement Learning the system can also make a decision for a situation that has not been encountered before.

Deep Reinforcement Learning has been used for various applications in many subjects(robotics, video games, board games, natural language processing, computer vision, education, transportation, finance, health, etc.).

Process of DRL

NN architecture

On the figure the network architecture is shown.

NN Architecture

My NN architecture here.

ADI

The method implemented here is mostly from the paper Solving the Rubik’s Cube Without Human Knowledge. First of all our states aren’t labelled. We need labelled data(states) to train our model. ADI provide us labelled states. In my implementation I didn’t use any policy value.

In ADI we start with the solved state and apply the sequence of random actions of some user-defined length N. This process gives us N states. For each state s we do the following steps;

  • Apply all actions(12 actions) to the state s separately.
  • Pass all of 12 substates to our neural network, asking for value for each substate.
  • Target value(label) for state s is calculated as:
  • A(sᵢ, a) is the state after action a applied to the state s and R(s) equals 0 if s is the solved state and -1 otherwise.

In the figure below, x(state) represents s in the above formula.

Source : Solving the Rubik’s Cube Without Human Knowledge

I assigned a higher training(loss) weight to samples that are closer to the solved cube compared to samples that are further away from solution. I assign a loss weight to each sample states;

Loss weight

My ADI implementation here.

In previous versions of the project: Generating labelled data with ADI and training the solver model was done with the same model. However, when satisfactory results could not be obtained, 2 different NN models with the same architecture were used for these 2 processes. Better results were achieved with this change. In order to avoid confusion, these were referred to as Generator NN/Generator model and NN/model, respectively, in the rest of the article.

Training

In every iteration of training, we do following steps;

  • Generate labelled states by ADI with the help of Generator NN. Length of the number of states was generated in each iteration is B(batch size) and it is pre-defined.
  • Train NN with the labelled states.
  • In each U iteration the learning rate was decayed by the learning rate decay between 0–1 .
Decay
  • In each G iteration, Generator NN is updated with a portion (0–1) of the NN. Note that; Generator NN parameters only update here.
Updating Generator NN

My implementation of training process is here.

A*

The A* search algorithm is an heuristic graph traversal and pathfinding algorithm. It is an algorithm used to find the shortest path between the start and goal node. The starting node A* expands continuously until it associates it with the goal node. This expansion is shaped by the 𝑓 value of each node. The f value of each node on the path is defined by the equation f(x)=g(x)+h(x).

g(x) is the path cost between the starting node and node x. h(x) is a heuristic function that estimates the distance of node x from the goal node. In this project, h(x) is the NN that we have trained.

A*

Weighted A*

In the weighted version of the A* search, g(x) is multiplied by the value λ (constant number) 0–1.

Consideration should be given to the trade-off between time and optimum path length when deciding on the value of λ. In cases where the λ value is close to one, the algorithm may take longer to reach the goal node, but the result converges to the optimum path length.

The f value of each node is defined by the equation f(x)=λg(x)+h(x).

Batch Weighted A*

The method implemented here is mostly from the paper Solving the Rubik’s cube with deep reinforcement learning and search. While the Batch Weighted A* algorithm searches for the shortest path from the origin node to the destination node, it expands according to a B (batch size) number. This method was used because the h(x) value can be calculated simultaneously for many nodes in parallel.

My Batch Weighted A* implementation here.

Parameters

  • You can find the NN parameters here.
  • You can find the BWAS parameters here.

🎉🎉Ta Daa 🎉🎉

Result

I believe that if we train the model with more states, results will converge to the optimum path length and the accuracy will increase.

Thank you for reading my blog post on Medium! I hope you found it interesting.

References

  • McAleer, S., Agostinelli, F., Shmakov, A., & Baldi, P. (2018). Solving the Rubik’s cube without human knowledge. arXiv preprint arXiv:1805.07470
  • Agostinelli, F., McAleer, S., Shmakov, A., & Baldi, P.(2019). Solving The Rubik’s Cube With Deep Reinforcement Learning And Search. Nature Machine Intelligence Vol 1 August 2019 356–363

--

--