Solving a Rubik’s Cube with Reinforcement Learning (Part 2)

A Deep Q-Learning approach to solving a common puzzle

Published in

Analytics Vidhya

10 min readSep 13, 2020

Source: https://openai.com/blog/solving-rubiks-cube/. OpenAI made headlines in Fall 2019 for solving the related but more difficult RL task of teaching a physical hand to manipulate the cube. The task of actually solving the cube was done using Kociemba’s Algorithm.

Welcome to the second installment of my attempt to solve a Rubik’s Cube via reinforcement learning (RL). Last time, I provided an intro to Markov Decision Processes (MDPs) and formulated the task of solving a Rubik’s Cube as an MDP. If you missed this post or would like a quick refresher, you can check it out here.

At the end of my last post, I left off with a discussion of the Q-function and how we will need to approximate it for our task since the space of state-action pairs is too large. In this post, I will implement a neural network to do exactly that. Along the way, we will explore how the network is trained via the Experience Replay algorithm and provide some initial experimental results. In case you are curious, my actual Python implementation is here. As always, any comments, questions, or feedback is much appreciated!

Architecture

Recall that the Q-function takes a state-action pair (s, a) as input and returns a scalar value that roughly represents how “good” it is to take the action a in state s with respect to the ultimate goal of solving the Rubik’s cube. Furthermore, recall that at each solving step, the agent’s strategy will be to take the action which maximizes the Q-function for the current state.

With these two ideas in mind, we will want to make a neural network that takes some representation of the cube as input and returns a 12 dimensional vector as output. For a given input to the network, s’, each of the 12 output neurons represents the value of the Q-function when taking each of the valid 12 actions in that state. For example, the first neuron could represent Q(s’, Front rotation) and the twelfth neuron could represent Q(s’, Left rotation). Note: the exact mapping between actions and neuron position in the output layer is arbitrary.

Example architecture used for Q-function approximation

Now that the output has been covered, I will also need a way to represent the Rubik’s Cube state as a numerical input. To accomplish this, we will rely on the concept of embeddings. An embedding is a dense, fixed dimension vector used to represent some object. For example, in NLP, word embeddings are often used to represent individual words and the relationship between two word embedding vectors is meant to capture the semantic relationship between the two corresponding words.

In our case, each of the 27 pieces that makes up a 3x3x3 Rubik’s Cube will have its own unique dₑ dimensional embedding.These embeddings will be learned along with the other weights in the network during training. For implementation details see here.

Between the input layer and the output layer, I have chosen to use a 3D convolutional layer, and then flatten the output from the convolutional layer, and use one more fully connected layer before the 12 dimensional output layer. ELU activations are used on all layers. The rationale behind using ELUs as well as including the convolutional layer are explored in the Initial Results section below.

Experience Replay

The actual training of the neural network is done via the Experience Replay algorithm. This algorithm is described in the seminal paper on Deep Q-Learning, Playing Atari with Deep Reinforcement Learning by Volodymyr Mnih et al.For the reader’s reference, I have included the pseudo-code from the paper. However, I will try to first describe the procedure at a high level.

Experience replay is an iterative algorithm in which the agent attempts to solve the cube many times. During each attempt, the agent starts with a scrambled cube and takes the actions that it thinks are best (according to it current approximation of the Q-function). Every time the agent takes an action, the action, the current state, the resulting state, and an estimate of the action’s “goodness” are all stored as a tuple in something called a replay buffer. The replay buffer is simply a collection of these past “experience” tuples. Mini-batches of these tuples are then sampled from this buffer to make updates to the neural network via gradient descent. Performing updates to the network based on past episodes is a form of off-policy learning. This is in contrast to making updates based on the current episode, which is referred to as on-policy learning (another major branch of RL algorithms).

During the first attempts to solve the cube, the neural network is untrained and so the agent makes nearly random actions to try and solve the cube. However, this series of random actions will eventually lead the agent to solve the cube by chance during one of the episodes. At this point a reward signal will be added to the replay buffer. As the agent keeps training, it will start to include the “experience” tuple that contains this reward signal in the mini-batches used for the gradient descent weight updates. This will slowly improve the Q-function approximation and cause the agent’s actions in subsequent training episode to become a little less random. This will result in further instances of solved cubes and more “good” episodes to be added to the replay buffer. This process is a virtuous cycle and will lead to the agent becoming better at solving the task.

However, one thing to note is that sometimes during this process the agent can get stuck in a local minima of sorts. For example, the agent may only know how to solve the cube from certain shuffled states but not others. To try and push the agent out of these local minima, every so often, the agent will take a random action instead of the action prescribed by it’s Q-function approximation. This random action gives the agent a chance to solve the cube in a completely new way and learn something new.

The likelihood of taking this random action is controlled by a parameter epsilon (between 0 and 1), which is why this strategy is usually referred to as an epsilon-greedy strategy. More broadly, within RL, the epsilon-greedy strategy is a method of exploration. Exploration, or discovering new ways to complete the task at hand is often contrasted with exploitation, or leveraging the agent’s existing knowledge to solve the task. These two concepts are often described in terms of a trade-off and are an important component of reinforcement learning.

Pseudo-code for the Experience Replay algorithm described in: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

Initial Results

For my initial round of experimentation, I started with a very simple set-up with the goal being to help decide the architecture for my Q-function approximation. The task was to solve a Rubik’s Cube that was a shuffle distance of 3 from the solved state. By shuffle distance of 3, I simply mean that starting from a solved cube, three random rotations are applied sequentially to the cube. This shuffled state is the starting point from which the agent attempts to reach the solved state during each episode (the cube is shuffled in a different way before each episode). Furthermore, the agent is limited to at most 5 moves per episode to solve the cube.

Training was done for 1000 episodes, with the exploration rate, epsilon, starting at 100% and linearly decaying to 10% over the first 100 episode, and after that staying constant at 10%. A replay buffer of size 128 was used, meaning only the last 128 “experience” tuples across all training episodes are stored at a time. Finally, mini-batches of size 16 are used to make the weight updates.

In evaluating the results of these experiments, there are three metrics that I found helpful to track performance:

Loss (Mean Squared Error): The loss is the squared difference between the network’s current approximation of the Q-value and the observed long run reward for taking a certain action in a given state, averaged over all “experience” tuples in the mini-batch. This loss is in turn averaged over all mini-batches in an episode to get the per-episode loss. As a general rule, the loss should converge as the network learns, with occasional spikes when the agent rethinks its strategy.

Validation Accuracy: A more direct approach to evaluating agent performance is to use a hold-out set of shuffled cubes. Specifically, I used a set of 100 cubes with a shuffle distance of 3 and tested the network every 25 episodes of training. The number of cubes from this set that the agent was able to correctly solve in 5 moves or less, divided by 100, is reported as the validation accuracy.

Avg. Max Q Value: The final evaluation metric is taken from the Atari paper. Every 25 episodes, for the same set of cubes used for validation accuracy, the initial shuffled states for each cube are passed through the network and the maximum Q-value over actions is recorded. These maximum Q-values are then averaged together. The rationale for doing this is that as the agent gets smarter, it will have a stronger sense of how a particular action will lead to solving the cube. This means the Q-value for that action (which is the sum of the current reward and all discounted future rewards) will hopefully increase as well. Over training, if the agent is improving, we will therefore expect to see this metric increase.

Experiment 1: Convolutions vs Fully Connected Layers

The first experiment was to assess if using convolutional layers was necessary. The alternative being after the embedding layer to simply concatenate all embeddings into a flat (27 * dₑ) dimensional vector. Comparing these two architectures across all three performance metrics, the results are inconclusive with perhaps a slight edge toward using the convolutional layer. The model with a convolutional layer has a slightly higher validation accuracy over training. Furthermore, the average max Q value seems to increase much more smoothly for the model with a convolutional layer. An unfounded hunch tells me this is indicative of a more stable training process.

Based more so on the validation accuracy and a prior belief that convolutional layers will work well with the symmetry of the Rubik’s cube, I elected to keep the convolutional layer in my architecture. A perhaps more grounded rationale is also that the convolutional layer has fewer parameters than the fully connected layer (8,020 vs. 67,550).

Comparison of model w/ Conv Layer vs. model w/ FC Layer. The Conv Layer model has slightly better validation accuracy over training.

Experiment 2: ELU vs ReLU Activation

The second experiment I ran was to help choose the activation function for each layer of the neural network. Initially, I had chosen ReLU activations and was noticing some strange behavior. At a shuffle distance of 3, the agent was not learning anything as evidenced by the diverging loss function and the validation accuracy hovering around random chance. The key diagnostic however, was that the average max Q-value was very quickly flatlining to 0 during training.

ReLU (Blue) and ELU (Orange) activation functions

This observation was evidence of a dying ReLU problem. Dying ReLUs occur when the layer proceeding the ReLU activation function produces negative values. For reference, ReLUs are defined to take the constant value of 0 for all negative values. This means that the gradient through the activation will also be 0 during backprop and no weight update will occur. Unless the underlying layer can achieve a positive value at some point later in training, the layer output will always be 0. We see this problem in the graph below, where the average max Q-value (essentially the max value of the output neurons) for the ReLU model is 0.

Fortunately, this problem can be avoided using an ELU activation function, which maintains a slightly decreasing output for input less than 0. This means that it will have a negative (and non-zero) gradient during backprop, which allows for some gradient flow and prevents the neuron from completely shutting off. When I switched to using ELUs instead of ReLUs, I marked a vast improvement in training across all three evaluation metrics as seen below.

Switching to ELU activations is necessary for the agent to solve the task

On the basis of the above experiments, the architecture that I settled on for further experimentation is:

To improve the performance of the agent, future work could be directed at increasing the shuffle distance of cubes that the agent is able to solve, thereby increasing the complexity of the task. One possible way to do this would be combining the output of the neural network with a Monte Carlo Search Tree, a common approach in RL and one that is used in the previously mentioned McAleer paper as well.

Thank you for reading part two of my Rubik’s Cube series! I hope you found it interesting and would love to discuss any comments or further the conversation in the comments below.