Accelerating Quantum Computing Simulations w/ FPGAs

Joe Meng
17 min readAug 19, 2020

--

Project repo: https://github.com/joem96/qsim_fpga_project

Background

As quantum computing matures, both the number of quantum developers and open-source tools steeply rise. At the same time, technological restrictions slow down expansion of shared learning and development. One of these restrictions is the cost and maintenance of quantum computers. Quantum computers are insanely expensive to build and maintain and only companies like Google and IBM have produced quantum computers that are 50 qubits or bigger. As a result, not all developers get access to these machines, and when they do, in the case of IBM’s open source effort (IBM Q Experience), developers may face long queues. In turn, developers have turned to simulators that could model the behavior of quantum computers. However, simulators introduce another big restriction — that is computation speed. Quantum computers are fast because they can configure qubits, which exist in multiple states, simultaneously. But, if we simulate the behaviors of qubits on a classical computer, we are essentially taking away that magical parallelism introduced by quantum mechanics. The bigger the quantum computer we try to simulate in terms of number of qubits, the slower it gets exponentially.

Since linear algebra is a big component of quantum simulation, I figured it would be interesting to use Field Programmable Gate Arrays (FPGAs) to accelerate the simulation of quantum computation. FPGAs are very well-equipped to handle matrix multiplication. We use vectors to represent the possible states of qubits, holding the probabilities of each possible combinations of the qubit states. We use matrices to represent quantum gates because matrices map an input vector to an output vector, mathematically capturing the modification of the qubit’s possible states as it propagates through a quantum gate, remaining in an unobserved state.

Project Overview

I decided to create a project that serves as a prototype for hardware acceleration of quantum computing simulations using FPGA/SoC designs. The idea is that when we use just software running on a CPU to simulate quantum circuits, it gets exponentially slower as our circuit gets bigger. If we offload some of the heavy lifting from the software to special hardware like FPGA, we can reduce latency. As we simulate bigger quantum circuits (higher-qubit computations), we benefit from more and more increase in speed compared to pure software simulation. However, this advantage comes with a price — the more we benefit from speed, the more we have to pay in literal money (hardware architecture and memory are not free). Without spending too much money on hardware, I implemented the biggest quantum simulator possible in a way where it demonstrates the acceleration and also scalability. My prototype includes:

  • Software interface for developers to construct their quantum algorithm / circuit up to 5 qubits (more on why I chose 5 later).
  • A FPGA/SoC design that receives the circuit information and simulates the algorithm in an accelerated manner.
  • Graphical Plots to show the results of the simulated computation & the metrics of the speed-up.

Final Product

In my repo, my Python program includes an example quantum circuit (5-qubit Grover’s Algorithm). If you want to construct a different algorithm, just follow the same format but use different symbols. I included below what the Grover’s Algorithm looks like in the user interface code along with what exactly it represents in circuit form. I set the oracle in a way such that state 10111 or 23 will be phase-flipped as the qubits propagate through the oracle.

The Graphs below are the resulting plots after everything is run correctly. I included runs at different iterations (1 to 4). It shows that we are more likely in finding the marked state (state 23) as we query the oracle more times. To the right, I compared the time it took for the Numpy (a Python library) computation and the time it took for the FPGA computation.

5-Qubit Grover’s Algorithm Results w/ Iterations 1–4

Performance & Impact

I plotted the difference in computation time between Numpy and my FPGA design over a larger range of repetition number for my 5-qubit simulator. I implemented a function in Python that runs the Numpy computation with iterations from 1 to 100. At each iteration, I ran the computation 10 times and got the average elapsed time.

FPGA vs Numpy Speed Comparison (at Varying Iterations)

I also created a plot that compares the runtime at varying qubits, which demonstrates the bigger goal of this project — acceleration as the design scales. With my Arty A7 FPGA Board, I can implement simulations with 1-5 qubits by altering parameters in my hardware architecture. While for 6-12 qubits, I calculated each theoretical runtime— these are runtimes of the higher-qubit simulators assuming that I fit my hardware design onto bigger (and more expensive) FPGAs.

FPGA vs Numpy Speed Comparison (at Varying Qubits)

At 5 qubits, the speed-up is negligible (constant time complexity). But when we start to compare the runtimes at different qubits, we see an exponential speed-up. Turns out, my hardware architecture is O(n) time complexity (n being the number of qubits) while the Numpy runtime is close to O(2²ⁿ). In my Overall Speedup Analysis, I explained this in more detail.

Design Overview

Top-Level Block Diagram

The user interface is in the form of a python script, where users construct their quantum algorithm. When the python script runs on the computer, it compiles the data that represent the algorithm and sends it off to the FPGA through the USB connector and UART interface. The softcore processor, Microblaze, which is implemented by the FPGA, runs a C++ code which receives the transmitted UART data. The C++ program organizes those data and then re-transmits to my custom IP design (Hardware Accelerator) through the AXI interface. The Hardware Accelerator configures its memory from the received data and starts the computation. It then sends the results back to the Microblaze and eventually to the computer. The python script picks up these results and maps them to a graphical plot.

Hardware Accelerator (Hardware / RTL Design)

The hardware / RTL design is the most design-heavy part of my project because it is the key to accelerating the computation. In my block diagram above, you will notice that most of the components in the FPGA are Vivado-generated IP, which I color-coded in gray. This includes my HW_Accelerator block, which was also generated by Vivado as an AXI peripheral IP. I did this because I wanted Vivado to handle the AXI communication between Microblaze and my custom RTL design. By doing that, the only thing I had to do left was customize the HW_Accelerator block by adding logic that translates the reading/writing of AXI registers to the directions being sent to my custom RTL. My top level custom RTL block, soft_if, interprets those directions and further translates them into configuration signals for my main component, mat_ctrl.

mat_ctrl Block Diagram

The mat_ctrl holds the main architecture that executes the matrix multiplication. So, this means it needs to store the matrix, which represents the quantum circuit, and the vector, which represents the initial states of the qubits. In the case of this project, since I am simulating 5-qubit circuits, I need to store a vector that holds 32 16-bit values (32 because 2⁵), each of which represents the square-root probability of a possible existing state. Meanwhile, I will need to store a 32x32 matrix. In addition to storage, the mat_ctrl needs to access those values and send them to be matrix-multiplied.

Reason for Speedup #1:

I decided to store each of the row of the matrix in a separate BRAM so that I can process each row in parallel since each row can be independently multiplied with the vector. As a result, you can see that I instantiated 32 BRAMs to store the 32 rows of the matrix and also instantiated 32 row_vec_mult components so that each row can go through the multiplication process without waiting. This parallelism in accessing memory is the first reason for speeding up the computation.

This design choice is also the reason I ended up with 5 as the number of qubits for my project. I am using an Arty A7 board, which uses an Artix 7 FPGA. The Artix 7 FPGA only has 50 BRAMs. Keeping in mind that my design stores matrix rows in parallel BRAMs, you will realize that we have to double the number of BRAMs for each additional qubit being simulated. 6 qubits will require 64 BRAMs which is not feasible for the Artix 7. As a result, 5 is the maximum number of qubits I can implement with this design choice without buying new hardware.

Reason for Speedup #2:

The second reason also takes advantage of parallelism but this time within each row of the matrix. Ideally, we can access the rows of each matrix in one clock cycle and then multiply that row and the whole vector in the next cycle. But, we can’t in reality. Each row of matrix doesn’t fit into one row of BRAM (1 BRAM row only holds 64 bits). So, really, we have to break the matrix row down into smaller segments and store each into one row of the BRAM. In the mat_ctrl block diagram above, you can see that the matrix row is broken down into 8 segments, so each BRAM row only stores 4 matrix elements. That may seem unfortunate because we now need 8 clock cycles to start the process for each matrix row. However, the advantage comes from the fact that the processing of each row overlaps in terms of time. When we are accessing the second segment of the matrix row, the first segment is simultaneously getting multiplied. When we are accessing the third segment, the first and second segment continues advancing into the next stage of the process. This is all possible because at the end of the day, these hardware / RTL components are implemented by digital logic — they run concurrently to any process. When we send the first segment into the row_vec_mult block, we don’t need to wait for the first segment to propagate to the end in order to send the second segment. Instead, each segment flows through the same computation process as if they were flowing through a pipe, end-to-end.

FPGA vs. CPU Processing Concept

The image above depicts my FPGA running a computation process. I added a CPU running a similar process as a comparison. So, the FPGA design runs at a lower clock speed but the parallelism ultimately makes the processing of the complete matrix row faster. Of course, my CPU illustration is simplified and only used for illustration purposes. CPU’s usually run at 10x the speed of FPGAs (several GHz vs 100’s of MHz) instead of my example’s 2x. But it is also important to note that processes like “ele-wise multiplication” and “add elements together” in the CPU realm usually require multiple clock cycles because software is instruction-based.

Parameterization

I spent a big effort to parameterize my whole RTL design. I have a package file, mat_ctrl_pkg, which lists all the constants like the number of segments the vector is broken down into, the number of values per segment, the number of bits each element is broken down into… etc.

dwdwdw
mat_ctrl_pkg file

This is useful for two things. The first is debugging. When I was building my RTL design, I went for a particular architecture I had in mind. But, I wouldn’t know the exact values for a lot of things in order for my design to “fit” into the FPGA. One prime example is my add_all component. The add_all component is a submodule of row_vec_mult, which takes a vector of elements and adds them all up. That seems like a really simple task until you consider parameters. I felt like I needed to parameterize this block because the choice of values really affected the static timing analysis later on. Let’s say we have 8 elements. On one end, we can add 2 elements at a time and that would take log₂8 = 3 clock cycles for the final sum. On the other end, we can just add all 8 elements in 1 clock cycle. The latter is obviously faster but also could potentially run into timing violations later during place & route due to the extra amount of logic generated between registers. Parameterizing the block eases the process of finding the optimal point between speed and making timing.

I designed the add_all so that the number of stages is a parameter. I made a block diagram below to illustrate an example: 11 inputs and 3 stages.

add_all Block Diagram

First, I implemented a function that computes num_hand, which is the number of elements we add at once given the number of elements and stages. I also implemented a function that further calculates num_comp, which is how many elements we have per stage. Both functions are done before runtime. The hard part about parameterizing is that I needed to construct my architecture in terms of these parameters. To achieve that, I went with multi-dimensional arrays with sizes determined by the parameters. The green 2D array, reg_sigs, stores all the registered elements per stage while the gray 3D array, conc_sigs, stores all the concurrent signals that build the logic between registers. conc_sigs has an extra dimension because it needs to organize which elements are grouped together for adding. One of the concerns I had was that even though the number elements get reduced exponentially each stage (as seen in the diagram), I couldn’t create a 2D array that had entries with differently-sized arrays. I thought I was wasting a lot of registers as a result. Fortunately, Vivado was smart and only synthesized the used registers.

The other debugging example would be when I initially went with breaking my vector down into 8 segments — 4 elements each. After synthesis, Vivado showed that my FPGA couldn’t afford to build 4 multipliers per row because it would use up too many LUTs (utilization was 150%). After changing the parameters into 16 segments — 2 elements each in my package file, the synthesized result fit. The number of multipliers reduced by half and in addition to that, Vivado implemented the multipliers with DSPs instead of LUTs because the Artix-7 could afford to implement the 64 multipliers with its 90 DSPs, making the usage of resources far more efficient. This was one of the moment that made me glad I parameterized because it saved me a lot of time because I only had to change the package file’s constant value.

Besides debugging, parameterizing helps demonstrate one of the goals of this project — scalability. As mentioned previously, the acceleration with a 5-qubit simulation is too insignificant for actual developers. The main point, however, is the project’s concept and its ability to be applied to bigger quantum simulations (more qubits). With a more expensive FPGA board, one could take the same RTL design and literally only change the package file’s constants to build a bigger quantum simulation. Hypothetically, with current Xilinx products, one could apply this design onto a Virtex Ultrascale+ chip and build an 11-qubit simulator that runs at about the same speed as my current project, assuming everything meets timing. But that would also cost $70k+. At that point, I would probably change up my design and start looking into DDR storage for the computation’s matrix rather than making parallel BRAMs the dominant memory feature of the design.

Hardware Final Build

The final hardware build successfully fit onto the Artix A7 FPGA. As mentioned previously, I changed one of the parameters, number of segments per matrix row, from 8 to 16. That shifted from the over-usage of the LUTs to the efficient usage of DSPs. This also changed the number of BRAMs from 40 to 24. 16 segments per matrix row means that there are only 2 elements per BRAM row, which means the BRAM width only needs to be 32 bits (instead of 64 bits). As a result, Vivado inferred 18k-bit BRAMs (32 bit width) instead of the 36k-bit BRAMs (64 bit width). In the utilization report below, Vivado treats 18k-bit BRAMs as 0.5 BRAMs, yielding 16 BRAMs in total for implementing the storage of the matrix. The rest of the 8 BRAMs are allocated for the Microblaze.

Vivado’s Utilization Report

Timing is also not a problem for this design as the clock speed is only 100 MHz.

Vivado’s Timing Report

Overall Speedup Analysis

With the hardware architecture in mind, I think it is important to analyze the actual speedup as the design scales. In software, when we simulate n qubits propagating through quantum gates, we are essentially multiplying a 2ⁿ-by-2ⁿ matrix with a 2ⁿ-element vector. This operation is O(2²ⁿ) since in each of the 2ⁿ rows, we have to multiply and accumulate 2ⁿ elements.

In my hardware design, I use parallel BRAMs to independently store each row, so we reduce the time complexity by 2ⁿ (refer back to Reason for Speedup #1). In addition to that, breaking down each matrix row into s segments with the processing time for each segment as p, we don’t need to wait for each segment to finish processing before starting another segment process (refer back to Reason for Speedup #2). Because of that, the time complexity for each row is O(s+p) instead of O(s⋅p). Finally, p depends on the number of cycles we need for adding all elements in a segment together (refer back to add_all Block Diagram under the Parameterization). In the best case, The FPGA is so big that we can always add n elements together in constant clock cycles. In the worst case, we add 2 elements at a time, so with 2ⁿ/s elements in one segment, p = O(log(2ⁿ/s)) in terms of block cycles.

So, in the best case, our overall computation time complexity is constant and in the worst case, our overall computation time complexity is p = O(log(2ⁿ/s)) = O(n), making the overall computation O(n). Refer back to the Performance & Impact section to see the O(n) vs O(2²ⁿ) behavior in the graphs.

Middleman (C++ Design)

The C++ program running on the Microblaze basically serves as a middleman between the PC’s program and the FPGA. The C++ program reads from the UART IP (which gets data from the Python program) in order to get information on the vector and the matrix. The program then configures the Hardware Accelerator IP by writing to the different addresses. The good thing about Vivado handling all the AXI IP is that I just have to write to an address in C++ and I know a corresponding register will be written to in my custom RTL (in this case, the HW_Accelerator). I have a REGISTERS_DEFINITION.txt text file in my repo that shows which registers the program reads/writes to in order to send vector data, send matrix data, start the computation, read the computation result… etc. After reading the results from the RTL, the C++ program writes to the UART IP (which then transmits data to the Python Program).

If you look at the helper file, you will see that there are tons of functions I didn’t use (the ones under “UNUSED QUANTUM CONSTANTS…”). These exist because originally, I had the C++ do all the user interface processing (building the circuit). However, the Microblaze kept running out of space and the C++ libraries are limited. Therefore, I moved those processes to the Python program, leaving the C++ program to be an intermediary program. Nevertheless, it was fun writing C++ functions like the low-level decimal multiplication or kronecker product using pointers. These might be useful if I ever move the user interface processes back to the FPGA, which would be something to consider if I use a Zynq board.

User Interface (Python Design)

I implemented the user interface and some of the preprocessing in Python. The user constructs the desired quantum circuit using symbols. After that, the program takes those symbols and start building the matrix representation of the quantum gates. Implementing this in Python was very convenient because the numpy library took care of the matrix multiplications and the kronecker products. The only thing that I had to take care of was the decimal to hex conversion. My FPGA design represents each square-root-probability as a 16-bit signed number ranging from -1 to 1:

16-bit signed decimal (s1p14 format)

So, the Python program can’t just simply send out the decimal to the FPGA — there needs to be a conversion first. I wrote the hex_to_frac and frac_to_hex functions to accommodate sending data and receiving data from the FPGA, respectively. These two functions shift the decimal to the corresponding whole-number binary representation (or vice-versa) and then calls a 2’s complement function (found this function in geeksforgeeks.org) if the decimal is negative. Finally, the Pyserial library made it possible to read/write to the PC’s COM port, which the FPGA board connects to through the USB.

I used Matplotlib to plot a graph for the computation results. I also implemented the same computation locally in Python using Numpy in order to compare the results between a software computation and a hardware-accelerated computation and plotted a second graph showing the comparison.

In order to measure the Numpy computation’s elapsed time, I just used the time library to calculate the elapsed time. For the FPGA computation, it was a little more complicated. Using just software library would be inaccurate because there are a lot of overhead like sending data before the computation actually starts. So, I built an internal timer within my Hardware Accelerator custom IP. The internal timer starts whenever the custom IP gets the signal to start the computation and stops when the computation results are ready for software to read. The internal timer is just a clock-triggered counter. The elapsed time is the counter value * the clock period (10 ns since our clock is 100 MHz). I also calculated the average percent difference between the Numpy computation and the FPGA computation. The percent difference increases with the repetition number. This is most likely because the rounding error in my FPGA design becomes more apparent as more multiplications take place. Obviously, this happens to the Python computation too but the Python computation is less prone to the error because it uses 64-bit floating-point decimal representation rather than my 16-bit.

The third graph also compares the computation speed between the FPGA and Numpy but this time with varying number of qubits. With Numpy, I calculated the computation time of randomly-generated quantum circuits ranging from 1 qubit to 12 qubits. With the FPGA, I could only build to 5 qubits. Computation times for qubits 6–12 were theoretical and calculated based on the assumption that I could apply my hardware architecture to bigger FPGAs.

Conclusion

I ended up achieving my goal of building a 5-qubit quantum simulator using my Artix A7 board. I thought the prototype showed the acceleration well and most importantly, the design’s parameterization enabled further scaling. In my results, the FPGA computation was 100 faster than the software computation (at 4 iterations) but honestly, quantum developers would not even feel these microsecond differences. However, scaling the design to higher qubits would widen that said difference exponentially and that is when it really matters and can really speed up quantum algorithm developments (refer back to the Performance & Impact section to see the speed-up on a graph as the project scales to higher qubits). Refer back to the Overall Speedup Analysis section for analysis on why that’s the case.

It took me around 3 months to finish this project. I thought most of the work would be RTL design. Although I did spend a big chunk of my time doing that as well as verification, a lot of my time went into just exploring the different communication systems, like between the Microblaze and my custom IP using AXI or between the Microblaze and Python using USB/UART. In addition, I also spent a considerable amount of time coding the C++ and Python portion of the project. Because of that, the project turned from an FPGA project to more of a SoC project and I therefore learned a lot about embedded system designs as well.

Note: file names, component names, variable names, software libraries, section titles, and time complexities are italicized

--

--

Joe Meng

Hardware Engineer specializing in FPGAs and interested in topics like DSP, Quantum Computing, & Fintech.