Building the world’s smallest quantum classifier

Written together with Mark Fingerhuth

How can we use quantum computing to solve problems in machine learning? Researchers — most of them physicists — increasingly started asking this question in the past couple of years. Some first answers approached the problem in the way things are typically done in quantum computing: Take a classical machine learning algorithm and find a quantum algorithm that computes the same solution faster. (And, for those not familiar with computational complexity, “faster” in this context does not mean “absolutely faster”, but with better performance for growing problem sizes, such as more data.)

In our paper “Implementing a distance-based classifier with an interference circuit” (Schuld, Fingerhuth and Petruccione 2017), which was awarded a runner-up placing at the IBM Q Best Paper Award, we turned this approach around. Can we come up with a simple machine learning model, a so-called classifier, that can be implemented by a small quantum circuit? In a nutshell, classifiers are models, functions or algorithms that deduce from sample data points {(x, y)} of inputs x and class labels y to guess y for unseen inputs x.

Example of a supervised learning problem where x contains the pixels of an image and y can be an object shown on the picture. Image credits: Schuld & Petruccione: Supervised learning with quantum computers, Springer 2018/ Gertrud Deres and Caesar.

Idea

The quantum classification circuit we came up with is indeed very short: It consists of a single Hadamard gate and two single qubit measurements acting on an ancilla and a designated class qubit. (One of the measurements involves post-selection, but with reasonable success rates in practice.)

Data preparation

For this classifier to do something useful, we need to first encode the data into the amplitudes of a quantum state by a circuit U_data. In fact, one needs to first normalise the data such that it can be represented as a vector on a high-dimensional Bloch sphere. Quantum routines to encode data in amplitudes, so called arbitrary state preparation routines, are known to do this with a runtime that is linear in the data size, and this is arguably the best our algorithm can do in terms of runtime, since the data is the input to the problem.

The classifier

Once the data has been prepared, the single-gate-circuit can implement a very specific classifier which computes a weighted average of the class labels of all training inputs. The weight is higher the closer the input is to the new input we want to classify.

The strength of the contribution (saturation of color) of a training input to the final prediction depends on the squared distance to the new input (black square). In this case, the blue points are closer and have a stronger contribution, so the classifier will decide that the black square lies in class 1, as one would expect.

In the simplest version “closeness” is measured by the squared Euclidean distance, but different state preparation circuits U_data can change the closeness measure. This is similar to the kernel trick in classical machine learning.

Implementation

We wanted to run the example on the IBM Quantum Experience, which was at the time restricted to a 5-qubit machine. This turned out to be not as easy as we thought, since even for a 2-d example the state preparation circuit, which in principle consists of a handful of gates, was easily swallowed by noise. What Mark came up with was a hand-crafted circuit for two selected input vectors to do a proof-of-principle experiment.

Circuit for a proof-of-principle demo of the Hadamard classifier. Here is the Qiskit code.
Postselection probability p_acc, probability of guessing class 0 and probability of guessing class 1 for experiments with a dataset of a single training vector. The experimented was repeated with two different training vectors x’ and x’’ which were both of class 0. The quantum circuit classifies both correctly, but with high errors.

Even with this carefully crafted algorithm, the circuit theoretically prepared the states with low precision, in addition to the hardware errors. As you can see in the table to the left, the exact values (asterisk) and simulated results with the low-precision state preparation (triangle) where still rather far away from the results generated by the hardware (top number in each cell).

All in all, it seems that even for the world’s smallest classifier we need a little more time until we can do “hardware numerics”. On the upside, things are developing fast, so stay tuned!


If you want to dive a little bit deeper into the technical details, maybe you want to try this example:

Example

Say you have a sample x¹=(0,-1) of class y¹=0 and a sample x²=(-0.789, -0.615) of class y²=1, and you want to know which of the two classes a new data sample x=(-0.948, 0.318) belongs to.

Data for the toy example. We assume preprocessing has already normalised the data to lie on a unit circle. These are in fact the first two features of three vectors of the famous Iris dataset.

Referring to the paper for details, the state to prepare with U_data is proportional to this:

As you might see, two copies of the new input got encoded into the upper half of the superposition marked by the ancilla being 0. The training data is encoded into the lower half, and the class qubit entangled with each data point is set to the class the data point belongs to. The two qubits in the middle register simply index the entries of the data vectors.

Now, the Hadamard gate interferes amplitudes. In our encoding this means it computes the sum and difference of data points.

The postselective measurement “kills” half of the amplitudes.

The probability of the classifier to predict 1 is defined as the probability to measure the class qubit in state |1⟩, which is proportional to the sum of the two amplitudes. Due to the normalisation of the data, this weighs the training points according to their squared distance, which means that close training points of class 1 will contribute strongly to the probability of measuring |1⟩, while points that are further away can only contribute weakly.