# How to write a high-performance quantum computing simulator in just a few lines of code

There are dozens of quantum simulators out there. Written in all possible languages. Some take advantage of the GPU. Some use advanced processor instruction sets like AXV, FMA, SIMD etc. Some are hardware designed for such operations. Why bother write a new one? And a simple one for that matter. The answer is I don’t really know. I’m passionate about quantum computing and I just thought I make the contribution I know best: writing code.

The moara simulator is written in a single file, in under 300 lines of code. This includes parsing the circuit, finding the gate definitions, computing the state-vector and performing the measurement sampling. If we strip off the boilerplate code and focus on the part that performs the evolution of the state-vector in the circuit, we see that it is comprised of just 3 methods. Of which one represents the action of single qubit gates and another deals with controlled operations. You can check out the said file here.

## High-performance, really?

Saying that this is a high performance simulator is a bold claim. To convince you of this I will fist present the results of benchmarking against the most popular simulators in the ecosystem: Qiskit’s qasm simulator, Riggeti’s qvm and qsim, recommended by Cirq. Finally, if after all that you are still interested, you can read how this simulator was created in such a concise form.

But first I need to quickly describe the benchmarking technique. And I need to give a shout-out to my good friend Voicu Tomut who came up with it. Here it goes: We have a sub-circuit composed of layers of single-qubit and two-qubit gates. We want to see how many times we can repeat this sub-circuit in a simulation, such that the simulation finishes in less than 500ms. Basically we create a circuit composed of this sub-circuit and we simulate it. If the simulator returns in less than 500ms we add another repetition of the the sub-circuit to the previous one and try again. The cycle finishes if either the simulator takes too long, or if we reached 100 repetitions of the sub-circuit. Then we move on to a higher number of qubits and repeat the cycle.

The sub-circuit needs to be as general as possible. Meaning no diagonal or anti-diagonal gates, entanglement gates across qubits close and far from each other. No gates that can be merged together. In this concrete case the sub-circuit is made up of a layer of X rotations followed by a layer of controlled-X rotations on neighboring qubits, then a layer of Y rotations followed by a layer of controlled-X rotations between qubits far from each other. Below is the sub-circuit used for 3 qubits. In the experiments “depth” means the number of repetitions of this sub-circuit.

` ┌─────────┐ ┌─────────┐ ┌─────────┐`

0: ┤ RX(π/3) ├────■────┤ RY(π/5) ├────────────────■─────┤ RX(π/7) ├

├─────────┤┌───┴───┐└─────────┘┌─────────┐ │ └────┬────┘

1: ┤ RX(π/3) ├┤ RX(3) ├─────■─────┤ RY(π/5) ├─────┼──────────┼─────

├─────────┤└───────┘ ┌───┴───┐ ├─────────┤┌────┴────┐ │

2: ┤ RX(π/3) ├──────────┤ RX(3) ├─┤ RY(π/5) ├┤ RX(π/7) ├─────■─────

└─────────┘ └───────┘ └─────────┘└─────────┘

All tests were done on a laptop with an Intel Core i7–8550U @ 1.80 GHz CPU, 4 cores, 8 logical processors and 16GB RAM. The range of qubits was meant to go from 0 to 29. The results are not presented in the same graph because for each quantum platform I used the native circuit format. Moara is compatible with qiskit, pyquil and cirq so for comparing with qasm-simulator I used the circuit as returned from the qiskit transpiler. For the other platform I did the same, trying to get as close to the way they run natively.

## moara vs qasm-simulator

Moara has an adapter called MoaraBackend which just plugs in to the `execute()`

method from `qiskit`

so you can run circuits defined in qiskit seamlessly. Below you can see the results of performing the benchmarking on the two simulators.

We see that for circuits of 2 and 3 qubits both simulators could execute circuits of depth 100 in under 500ms. Between 4 and 13 qubits the qasm-simulator cannot reach 100 repetitions in half a second, while moara can. From 14 qubits onward none can get to 100.

It’s not really visible from these graphs, but there is an intersection point at around 19 qubits. From then on the qasm-simulator starts performing better, but for under 19 qubits moara is faster.

## moara vs qvm

Next up Riggeti’s qvm. Running quil circuits is straight forward.

I did use the qvm with the `— compiled`

flag on, if anyone was wondering. Here to it is obvious that moara is performing better. The qvm cannot reach depth 100 even for 2 qubits. The slope of the qvm is more gradual so it is possible that there is a cut-off point in the range 21–30 qubits, but the aim is to see which one runs better on a normal laptop with the focus on up to 20 qubits so I did not investigate any further.

## moara vs qsim

Comparing to qsim was a little trickier. I could not find a way to run qsim on Windows and moara only runs on Windows. With the help of the qsim dev team I managed to compile it and run it in a docker container.

Even so, for 27 or more qubits qsim crashed in my configuration so I could only benchmark circuits up to 26 qubits.

Below you can see the performance comparison.

A few things to notice here. Qsim performs better from 15 qubits onward and the slopes are similar so it doesn’t look like moara could catch up at more qubits. But there is a visible decrease in performance for qsim around 13–14 qubits and, even if I performed benchmarking multiple times, that dip is still noticeable. So I decided to compare the performance for fewer qubits by lowering the limit to **200ms** instead of 500ms.

From this we can see that for circuits larger than 15 qubits qsim is faster, but, for circuits up to 15, qubits moara is better. At this point I was just curious to see how low do I have to make the limit (in ms) so the limitation of this simulator become visible for fewer qubits. It turns out that in this qubit range moara is so fast that lowering the limit to 100ms does not change the graph much. It is only at a limit of **50ms **that the slope becomes apparent.

I think that’s impressive

## How it was designed

Moara was written in rust. I have a lot of programming experience, but I’m new to low level coding. Rust drove me crazy with it’s borrowing model, but in the end it was worth it.

Now let’s look at how one might go about executing a quantum circuit. A quantum circuit is made up of a series of steps in which different gates act on a set of qubits. In each step some or all of the qubits are affected by a quantum operation. In one step a single qubit cannot be part of more than one operation. This is more visible in cirq where we have the concept of a “moment”. But this is present in all platforms, even if it’s not explicit in the circuit definition. The operations are modeled as matrices with complex elements and the state of the system is a vector of complex values. The operations in the same step can be composed as a tensor product into one large operation matrix. They act on the vector state by matrix-vector multiplication and obey the rules of linear algebra. Finally to get a measurement value from the state-vector one needs to find the real probabilities of the state of the system by squaring the norm of each vector element and then sampling from the resulting vector of probabilities. A random number generator can be used to do the sampling.

Naively, to simulate a circuit one would first tensor all operations in each step (or moment) substituting the identity gate where no gate is acting on a given qubit. Then compose all step-matrices into a single matrix that represents the whole circuit. After that, by multiplying the initial state-vector with this matrix, we get the final state and we can do the sampling. This would give all the mathematical information about the circuit, but it’s not very efficient. The size of a matrix for a step would be *2^n* where *n* is the number of qubits. We’ll label this size *N*. Multiplying two of these matrices has complexity O(N³) and the complexity of applying the tensor product is O(N⁴).

A better way would be to apply each gate to the state-vector separately. This would mean tensoring with identity gates on all qubits that are not affected by the gate but that will turn out to be simpler. Let’s look at how the operation matrix looks when we have a single operation on a step. We’ll take a 4 qubit system and model a single qubit operation. We’ll vary the affected qubit and see how the resulting matrix changes. The single qubit operation will be a generic 2x2 matrix with complex elements *[a, b, c, d]*.

R = [ a b ][

[ c d ]a b0 0 0 0 0 0 ]

┌───┐ [c d0 0 0 0 0 0 ]

0: ┤ R ├ [ 0 0a b0 0 0 0 ]

└───┘ [ 0 0c d0 0 0 0 ]

1: ───── => [ 0 0 0 0a b0 0 ]

[ 0 0 0 0c d0 0 ]

2: ───── [ 0 0 0 0 0 0a b]

[ 0 0 0 0 0 0c d] [a0b0 0 0 0 0 ]

0: ───── [ 0a0b0 0 0 0 ]

┌───┐ [c0d0 0 0 0 0 ]

1: ┤ R ├ => [ 0c0d0 0 0 0 ]

└───┘ [ 0 0 0 0a0b0 ]

2: ───── [ 0 0 0 0 0a0b]

[ 0 0 0 0c0d0 ]

[ 0 0 0 0 0c0d] [a0 0 0b0 0 0 ]

0: ───── [ 0a0 0 0b0 0 ]

[ 0 0a0 0 0b0 ]

1: ───── => [ 0 0 0a0 0 0b]

┌───┐ [c0 0 0d0 0 0 ]

2: ┤ R ├ [ 0c0 0 0d0 0 ]

└───┘ [ 0 0c0 0 0d0 ]

[ 0 0 0c0 0 0d]

We notice a few things. First of all we see that each row has only two elements and one of them is on the main diagonal. That means that each element in the state vector is only affected by its self and another element. Secondly we notice that each each column has only two elements and one of them is on the main diagonal. That means that each element in the state vector only affects itself and another element. Thirdly we can see that the distance between elements on the rows and between those on the columns is the same and is *2^q,* where *q* is the qubit index on which the gate is placed. This means that there are pairs of elements on the state vector, at distance* 2^q* from one another that only affect each other. If we compute them in the same step we can get away with changing the state vector in the same memory location. This is really important because the size of this vector grows exponentially with the number of qubits. Plus this means we can perform a single qubit operation with complexity O(N).

Now let’s move on to controlled gates. Let’s see how their step matrix looks like.

` [ `**1** 0 0 0 0 0 0 0 ]

┌───┐ [ 0 **1** 0 0 0 0 0 0 ]

0: ┤ R ├ [ 0 0 **a b** 0 0 0 0 ]

└─┬─┘ [ 0 0 **c d** 0 0 0 0 ]

1: ──■── => [ 0 0 0 0 **1** 0 0 0 ]

[ 0 0 0 0 0 **1** 0 0 ]

2: ───── [ 0 0 0 0 0 0 **a b** ]

[ 0 0 0 0 0 0 **c d** ]

[ **1** 0 0 0 0 0 0 0 ]

┌───┐ [ 0 **1** 0 0 0 0 0 0 ]

0: ┤ R ├ [ 0 0 **1** 0 0 0 0 0 ]

└─┬─┘ => [ 0 0 0 **1** 0 0 0 0 ]

1: ──┼── [ 0 0 0 0 **a b** 0 0 ]

│ [ 0 0 0 0 **c d** 0 0 ]

2: ──■── [ 0 0 0 0 0 0 **a b** ]

[ 0 0 0 0 0 0 **c d** ]

[ **1** 0 0 0 0 0 0 0 ]

[ 0 **a** 0 **b** 0 0 0 0 ]

0: ──■── [ 0 0 **1** 0 0 0 0 0 ]

┌─┴─┐ [ 0 **c** 0 **d** 0 0 0 0 ]

1: ┤ R ├ => [ 0 0 0 0 **1** 0 0 0 ]

└───┘ [ 0 0 0 0 0 **a** 0 **b** ]

2: ───── [ 0 0 0 0 0 0 **1** 0 ]

[ 0 0 0 0 0 **c** 0 **d** ]

From this we can immediately notice that the pattern for a controlled gate is similar to the one for the same operation as a single qubit gate. The difference is that some elements have vanished. The elements that have disappeared are the ones where the control qubit is in state |0>. So this patter depends on the control qubit position. Surprisingly, performing a controlled operation is faster than performing a single qubit operation because there are elements of the state vector that don’t change! We can do this no mater how far apart the control qubit is from the target.

So there you have it. Applying a generic single qubit gate or a controlled generic gate are really two variants of the same operation. It’s only a matter of finding a pair of state vector elements and computing them together in the same step. If you can do that fast then you are all set to create a pretty fast simulator. I’m not going to go into details about how I did it. But again you can find the code here. Maybe you can find a faster method than mine.

Of course, as we saw in the benchmarking section, for larger number of qubits advanced processor methods begin to show their value. At even larger numbers simulating quantum operations becomes impossible on classical computers. As a quantum programmer, I think this is a really good thig ;)

## Finishing up

Quantum computing needs our involvement and our dedication. As a computer scientist going into quantum, I thought writing a quantum simulator was one of the least interesting thing I could do. It turned out I learned a lot. I hope this experience feeds my intuition about what problems are the ones to be pursued with QC technology.

I encourage you too to bring your personal expertise to the quantum field. I think it would be worth it for both you and the quantum community as a whole.