Self-Supervised Quantum Machine Learning of Production Data with Amazon Braket

mikekohlhoff
awsblackbelt
Published in
11 min readNov 6, 2023

Written by Dr. Mykhailo Saienko, Sichen Zhang, Dr. Mike Kohlhoff

Cracks and rough surface on casted materials. Image by Freepik

Leveraging Industrial Data with Quantum Computing

Though classical machine learning methods are ubiquitous in industrial applications their performance can become costly or unstable under certain circumstance, such as the handling of unannotated data. Quantum algorithms, specifically Quantum Approximate Optimization Algorithm (QAOA), can be efficiently used to alleviate the challenges of classical deep machine learning methods. This combination of classical and quantum methods as Quantum Machine Learning can be an ideal, though challenging, compromise. In this blog post we present how QAOA is incorporated to improve the performance of a self-supervised machine learning algorithm.

With this explorative approach we offer a solution blueprint for our customers to evaluate and demonstrate industry-ready use cases. We identify relevant problems for our customers that promise to be beneficial when classical data problems are combined with the power of quantum algorithms.

A Hybrid Quantum Version of a Contrastive Learning Algorithm

We have used QAOA to modify the contrastive learning model called SwAV (Swapping Assignments between multiple Views, see the original paper). SwAV is an unsupervised algorithm in which predictions of contrasting feature representations are swapped for comparison during the training process. Here, the modification is to employ QAOA or some other algorithm in the training process. The modified SwAV model is trained on the public image dataset containing materials with and without cracks (see Kaggle dataset ref). The performance of the modified model is compared to the orginial SwAV model. The results are promising. While the classical SwAV may take up to 15–20 epochs before it converges, the modified models deliver comparable or better results after 1 or 2 training epochs in the majority of runs when the modified code is executed on a simulated circuit.

In this blog post, however, we don’t want to focus on the details of the training procedure but rather elaborate how we managed to:

  • adapt the basic algorithm to solve weighted Maximum Cut problems;
  • optimize it with respect to the number of tasks and shots;
  • extend it so as to accept an arbitrary problem size by means of a divide-and-conquer strategy.

Although there exist classical approximation algorithms which are currently better than the QAOA-algorithms we use, we show in this blogpost that the QAOA-based algorithms exhibit a very reliable performance in practical applications such as ours. Furthermore, QAOA-based algorithms might potentially exceed the best classically attainable approximation ratios if the underlying quantum circuits are large enough. Whenever larger and more precise quantum devices are available, we may re-evaluate these bounds and their practical consequences in the machine learning setting.

Classical Background: Contrastive Learning

Binary classification problems are ubiquitous in industry. From deciding on the credit-worthiness of a loan applicant to finding which molecules have mutagenic effect on cells to quality assurance in production, binary decisions permeate virtually every element in the value chain across all industrial sectors.

Many binary problems are “easy” in the sense that they can be solved with the basic supervised learning approach because the labelled data are available at no or low cost, the two classes are equally represented in the dataset, and the features crucial for classification are easy to compute or estimate but distinctive enough for the binary classifier to produce acceptable classification performance. In some cases, however, one may be able to produce a reasonable number of unlabeled images, yet there may be costly or impossible to label them easily.

Such lack of labelled data can be compensated by using self-supervised techniques such as contrastive representation learning. This technique produces networks that output for a given data point — an image in our case — a small set of abstract “features” containing all relevant information about the data point. These features are subsequently used as an input by a much smaller “downstream” model, such as a linear classifier, to solve the original task.

To ensure that the features are suitable for classification tasks, the self-supervised contrastive techniques, such as MoCo, DeepCluster, SimCLR, or SwAV, usually split all data points in one batch into a pre-configured number of clusters and minimize the distance (in the feature space) between data points belonging to one cluster while simultaneously maximizing the distance between data points from different clusters. The number of clusters is usually rather high, usually ranging between 64 and 256, so that the batch sizes of 256 images are considered “small” (see the original SwAV paper).

To avoid learning the so-called collapsed representation, in which all data points would be assigned to one cluster and all features would be near-constant for all data points, which would yield the minimal loss of zero, the clustering algorithm is configured to assign the same number of data points to each cluster. However, enforcing equipartition on a heavily imbalanced dataset can be difficult.

Quantum SwAV: Clustering with Brute Force and QAOA

We found an elegant way of addressing cluster assignment by framing it in terms of optimized equipartion of a graph. We replaced the classical cluster assignment (e.g., Sinkhorn-Knopp) algorithm with another approach called Weighted Maximum Cut (or MaxCut for short). This approach interprets N data points in a batch — or rather the N arrays of produced features which are supposed to represent the data points in the batch — as vertices in a graph while the weighted edges represent normalized dissimilarities between single points. Edges between identical points would have a weight of zero while those between two maximally different points would have a weight of 1. In our case, we would just compute the absolute value of the vector dot product between the output features for each point in a batch and normalize all such values by the maximal value across the entire batch.

The goal of the MaxCut approach is to partition the vertices in the graph into two subsets so that the sum of weights or dissimilarities between the vertices in one subset and those in the second subset is maximal. The two vertex subsets produced are not bound to be of the same size across batches in the dataset but each cluster is guaranteed to contain at least one vertex. This gets rid of the equipartition problem while keeping the risk of obtaining collapsed representation low.

Figure 1: Example a graph cut separating vertices 1 and 2 from vertices 3, 4, and 5. The weight sum of the edges cut is 4+2+5=11. This is not the maximum cut, however. Clustering vertices 1, 2, and 5 would yield a cut with the total sum of 4+2+5+6=17.

However, finding the optimal MaxCut solution is an NP-complete problem meaning that the optimal brute-force method of going through all edge combinations of the graph obviously has the exponential runtime. Worse still, the MaxCut problem is also APX-hard, i.e., there is no polynomial-time solution which gets arbitrarily close to the optimal solution. The best polynomial-time solution is the Goemanns-Williamson algorithm with the approximation ratio of around 0.878 which is conjectured to be the highest classically possible approximation ratio for the MaxCut. Furthermore, since the edge weights are strictly non-negative, we cannot simply invert all weights and transform the problem into an instance of the weighted minimum cut problem, for which more efficient solutions exist.

The brute force method is quite fast for smaller MaxCut instances with up to 20 data points. However, since we normally deal with batches of 256 points, we cannot ultimately rely on the brute force method only.

Luckily, the MaxCut problem can be encoded by a Hamiltonian of a special quantum system which can be efficiently optimized by a quantum computer by using the method called Quantum Approximate Optimization Algorithm — or QAOA, for short (see this Pennylane QAOA Tutorial for more details). The QAOA discretizes the process of finding the minimal energy of a quantum state in a given number p of steps by using the Trotter product formula on the Hamiltonian. The higher p is, the more precisely the minimal energy is computed. In original paper, Fahri estimated the approximation ratio for the MaxCut problem on 3-regular graphs to be around 0.6924 for p=1 and at least 0.7559 for p=2 although the bound is likely to be significantly higher on non-bipartite graphs. Larger p-values will increase the bound further but it is currently not known how high pmust be to achieve consistently higher approximation ratios than the Goemanns-Williamson algorithm.

This rendered the QAOA-powered MaxCut-solution more interesting to us than the well-understood Goemanns-Williamson algorithm, so we decided to investigate QAOA’s behavior before using Goemanns-Williamson while using the optimal brute force solution as a benchmark against which to measure the QAOA-based method.

Weighted MaxCut and QAOA: Technical Details and Service Infrastructure

At the time of writing this blog post, there was no suitable package to provide an efficient QAOA-based (weighted) MaxCut-solver, so we decided on an implementation from scratch.

Figure 2: Infrastructure for the quantum experiments on AWS services.

The starting point was the above mentioned AWS Pennylane tutorial on QAOA for solving basic (unweighted) MaxCut problems. However, while exposing the inner workings of the QAOA method in a quite detailed manner, the code in the tutorial had to be significantly modified for our use case.

First, the code had to be generalized to solve weighted MaxCut-problems. This was done by extending the U_C(gamma) function to accept weighted Hamiltonians. As shown in the following code listing, this amounted to changing one line of code which governed the extent of the Z-rotation operation:

# unitary operator U_C 
def U_C(gamma, distances, graph):
for edge in graph:
wire1 = edge[0]
wire2 = edge[1]

qml.CNOT(wires=[wire1, wire2])
qml.RZ(distances[wire1, wire2] * gamma, wires=wire2)
qml.CNOT(wires=[wire1, wire2])

Second, the implementation provided in the tutorial is not scalable, as it lacks several crucial code optimizations. The most crucial one was that the original code executed the circuit for each edge in the graphwhile computing the expected value of the Hamilton or sampling the cuts.

def objective(params):
gammas = params[0]
betas = params[1]
neg_obj = 0
for edge in graph:
# objective for the MaxCut problem
neg_obj -= 0.5 * (1 - circuit(gammas, betas, edge=edge, n_layers=n_layers))
return neg_obj

Third, the way the circuit was configured, it always computed the “empirical” expected value rather than the analytical one, i.e., it sampled N — N being equal to the number of shots — cuts based on a configured Hamiltonian, computed their values and returned the average as the expected value.

Having only one shot,

dev = qml.device("default.qubit", wires=n_wires, shots=1) 

the circuit from the tutorial would sample one cut from the configured Hamiltonian, compute its value and return it as the expected value.

Furthermore, the same circuit was used to produce “empirical” probabilities of cuts in a rather inefficient manner. One would call the same circuit 100 times with 1 shot each time to sample single cuts, from which the one with highest occurrence was picked as the potential maximum cut.

n_samples = 100
for i in range(0, n_samples):
bit_strings.append(bitstring_to_int(circuit(params[0], params[1], edge=None, n_layers=n_layers)))

We have addressed all those issued by having an objective function which called the circuit exactly once to compute the analytical expected cut value without any shots and by producing the most probable cut with one circuit execution without any manual sampling loops.

    # Compute expectation value analytically rather than by computing cuts
dev_exp = qml.device(device, wires=n_wires, shots=None)
# Computing probabilities can only be done empirically but use N shots rather than N tasks
dev_sample = qml.device(device, wires=n_wires, shots=n_shots)

def exp_circuit(graph, distances, gammas, betas, n_layers=1):
# Apply Hadamards to get the n qubit |+> state
for wire in range(n_wires):
qml.Hadamard(wires=wire)
# p instances of unitary operators
for i in range(n_layers):
U_C(gammas[i], distances, graph)
U_B(betas[i], n_wires)

# The full observable is the sum of sigma_z^j sigma_z^k over all edges (j,k) in real graph
sum_of_z_values = None
for (i, j) in graph:
if sum_of_z_values is None:
sum_of_z_values = qml.PauliZ(i) @ qml.PauliZ(j)
else:
sum_of_z_values += qml.PauliZ(i) @ qml.PauliZ(j)
# Group the Hamiltonian to have a 1-task no-shot execution afterward.
sum_of_z_values.compute_grouping()

# During the optimization phase we evaluate the objective using expval
return qml.expval(sum_of_z_values)

def sample_circuit(graph, distances, gammas, betas, n_layers=1):
# Apply Hadamards to get the n qubit |+> state
for wire in range(n_wires):
qml.Hadamard(wires=wire)
# p instances of unitary operators
for i in range(n_layers):
U_C(gammas[i], distances, graph)
U_B(betas[i], n_wires)

# Compute cut probabilities directly instead of sampling them. Results in one task with N shots and is executed much faster on simulators
return qml.probs(wires=range(n_wires))

node_exp = qml.QNode(exp_circuit, dev_exp)
node_sample = qml.QNode(sample_circuit, dev_sample)

def objective(params):
gammas = params[0]
betas = params[1]

# The expectation value of sum_(j,k) sigma_z^j sigma_z^k. Call expected value for the entire graph, not for each edge
exp_value = node_exp(graph, distances, gammas, betas, n_layers=n_layers)
neg_obj = 0.5 * (exp_value - len(graph))
return neg_obj

However, even with the more efficient QAOA-based MaxCut solver, it takes more than a minute to find a MaxCut on a simulator for a graph with more than 20–22 nodes (i.e. perform clustering for 20–25 images at a time). Furthermore, the runtime on a simulator increases essentially exponentially with the growing graph size so that solving a MaxCut problem for a 30-node-graph takes between 4 and 10 hours depending on the graph and the simulator. Considering that our batch sizes varied between 128 and 512, it was unrealistic to find the MaxCut solutions for the entire batch in one go.

To circumvent this problem, we devised a heuristic which splits the entire batch into N samples with a pre-configured size (this parameter is called sample_size in the code), finds the MaxCut-solution using QAOA — or brute force if configured — on each of those samples, and then merges the clusters across the samples based on their average distance to each other.

Figure 3: Illustration of the cluster-merging heuristic on a presumed set of labels. Given found clusters 1A, 1B on the first sample set and 2A, 2B on the second. With no labels available, it is unknown which clusters are “ok” and which are “nok”. However, we can estimate which cluster from the first sample is closer to other samples. To this end, we compute the average distance between the images in 1A and those in 2A and repeat the procedure for (1A, 2B), (1B, 2A), (1B, 2B). If the sum of the distances (1A, 2A), (1B, 2B) is lower than that between (1A, 2B), (1B, 2A) then we merge 1A with 2A and 1B with 2B. If not, we merge 1A with 2B and 1B with 2A. In the illustrated case, it is visually evident that 1A will be merged with 2B while 1B will be merged with 2A.

This heuristic combined with our efficient QAOA-based MaxCut solver enables us to apply the MaxCut-based clustering methods on any number of images.

Quantum Experiment Results and Outlook

Before the above-mentioned optimizations, it was difficult to solve the MaxCut problem for more than 20 vertices.

With the AdaGrad-Optimizer we required around 15 optimization steps to converge against the near-optimal results. Even with one QAOA layer, I.e., only one beta and one gamma parameter to optimize, we needed to compute the objective function up to 6 times per optimization step: Once to evaluate its value, once to compute the probabilities different solutions, and up to four times to compute the gradients. Furthermore, since the circuit was executed for each edge on the graph, and the graph was in most cases fully-connected, an essentially quadratic number of tasks with respect to the number of vertices was required to compute the objective function once. All in all, we needed around 15 * 5 * 400 = 30,000 tasks to complete the optimization. Considering that essentially all circuits were executed “empirically” by executing 100 shots and more, this resulted into nearly 3 million simulation runs for each instance of the MaxCut problem.

In the optimized version of the code, we have reduced the number of tasks required for executing the objective function or computing the probabilities of different solutions from being quadratic in the number of vertices to exactly 1 (albeit the circle evaluated is now more complex). Moreover, we have replaced all but one circuit calls from being empiric, i.e., requiring 100 or more shots, to being pure analytic.

This has brought the computation time on the local simulator on the ml.g4dn.xlarge instance from multiple days to around 12 minutes for graphs with 20 vertices. For smaller graphs with 8 vertices, the computation time varied between 2 and 3 seconds. With those computation times, one can apply the divide-and-conquer heuristic with the sample size for up to 20 to solve the MaxCut problem for graphs of almost any size within the reasonable time.

Nonetheless, these runtimes are still problematic for this method to be used routinely in critical loops like training neural nets using SwAV on larger datasets. For example, when compared to original routines, the QAOA-based SwAV training is still around 50–60 times slower on an epoch-by-epoch basis. At the same time, using real NISQ devices still yields too noisy results to be of practical use.

But once middle-sized quantum devices become more commonplace, we may achieve potentially even better approximation ratios than Goemanns-Williamson while running in a fraction of time of the latter. This may be a possible way to show quantum supremacy for specific practical tasks that are relevant in industrial applications.

--

--