# Reinforcement Learning Using Quantum Boltzmann Machines

In this article, we are going to cover a third paper which talks about Reinforcement Learning using Quantum Boltzmann Machines.

In previous articles, we explained Boltzmann Machines, Reinforcement Learning, and quantum computing. If you have not read them here is a link to the first one. If you visit my profile you will find the others too. They are in order.

The original paper can be found here.

In this paper, they **simulated quantum annealing** to demonstrate the benefit of reinforcement learning using Quantum Boltzmann Machines over their classical counterparts in small problems.

As it was said, in a previous article, we talked about reinforcement learning and an algorithm suitable for Restricted Boltzmann Machines for doing RL tasks. We will cover a similar algorithm but applied on QBM.

It is mentioned that the DBM is a better candidate to perform learning tasks in reinforcement than the RBM. One of the advantages is that, based on research, it was demonstrated that the DBM has a greater capacity to model higher order relations than the RBM. This may be due to the fact that they have a greater number of hidden layers and, therefore, neurons than RBMs.

An important advantage of the design of the DBM for its implementation in a quantum annealer is that the proximity and coupling of the qubits in its quantum design are similar to those of the computers of D-Wave Systems. This indicates that these designs can be made in the near future. That is why, instead of building a Boltzmann machine inside a quantum annealer, it was assumed that the network itself is the native connectivity structure of a future quantum annealer. Then, using numerical methods, they tried to understand its applicability in reinforcement learning.

In this way, assuming that the quantum design of the DBM is similar to that of the qubits structure of the D-Wave computers and using SQA to optimize a Hamiltonian similar to that of an Ising model, they proceeded to the application of this type of technology to problems in reinforcement learning.

A proposed problem was to find the optimal path within a grid. It is a 2D board, similar to one of chess, of r rows and c columns in which our agent is free to move in the four directions of space (up, down, right and left), see figure 1.

In figure 1 you can see the mentioned grid. Each box represents a state (S) of the system and contains a value. It can be a neutral one, a reward value (objective), a penalty or an obstacle. In Figure 1 (a) and (b) two examples of grids are shown and in © a solution to the grid of (a).

The main objective of the algorithm, therefore, is to find the strategy π that best fits the resolution of the problem. Each box has a numerical value, as it was said, and the possible actions to carry out are: 𝑎∈ {↑, ↓, ←, →, ↺}. In the established model, neutral values are 100, rewards are 200 and penalties 0. One being in a state must consider the best action to be taken to achieve the desired objective. The latter is what the aforementioned reinforcement learning algorithms are focused on. Different algorithms were studied and then the strategies obtained by each one were compared.

The algorithms used were very similar but applied to different Boltzmann machines: RBM, DBM, and QBM. In the first part, it is an algorithm introduced in the following article, the second is implemented in a structure like the one shown in figure 2, where the input layer is for the states and the output for the actions to be taken.

For the comparison of the algorithms, the fidelity measure was used. In the expression (1) S represents the set of states of the system, Tr the times they trained the algorithm, A(s, i, l) denotes the assigned action in the training l and example of training i for the state s and α*(𝑠) is the set of optimal actions.

In the experiments, each algorithm was executed 1440 times and for each execution, 𝑇𝑠 = 500 training samples were generated. In the following figure, it is shown some graphs with the fidelity of the strategy generated by each algorithm to solve two 3x5 grids.

In Figure 3 (a) the algorithms were executed on the grid of figure 2 (a). In Figure 3 (b) the algorithms were executed on the grid of Figure 2 (b). As it is clearly distinguished in the images, the quantum algorithm, whose sampling was done using SQA, is quite better than its classical counterparts. On the other hand, as we said at the beginning, we can see that DBM-RL considerably surpasses the RBM-RL, making the DBM a better candidate for reinforcement learning tasks.

In the paper, they mention that as the size of the grid gets larger, RBM-RL stops performing well but DBM-RL continues to train well (numbers similar to those in the graphs).

It should be mentioned that in the DBM-RL algorithm both optimization methods are used: SA (Simulated Annealing) and SQA. Regarding the structure of the algorithms, the DBM-RL is the same as the one introduced in the article mentioned above but with a greater number of nodes and layers. On the other hand, the QBM-RL is identical to the DBM-RL but the Hamiltonian of the D-Wave quantum computer is optimized (similar to the Ising model), to see the algorithms go to [1].

# Conclusions

In the last three articles, we covered three different papers that show the potential of Quantum Computing in machine learning tasks. The structure of the three problems is called a variational algorithm.

We are going to talk more about this kind of algorithms in the following articles.

Stay tuned!

# References

[1] Daniel Crawford, Anna Levit, Navid Ghadermarzy, Jaspreet S. Oberoi and Pooya Ronagh, Reinforcement Learning Using Quantum Boltzmann Machines, arXiv:1612.05695 [quant-ph]