Layerwise learning for Quantum Neural Networks with Qiskit

Published in

Qiskit

6 min readSep 6, 2023

By Gopal Ramesh Dahale

Barren plateaus make it exceedingly difficult to train the quantum circuits we use in quantum neural networks (QNNs). In this blog post, we take a look at how layerwise learning for QNN training, as introduced in [1], can help.

Layerwise learning for QNN training involves the gradual addition of individual circuit components during the training process. The circuit’s structure and associated parameters grow step-by-step during training. At the same time, we constrain randomization effects to specific parameter subsets throughout all training stages. This approach not only circumvents the issue of initiating training on a plateau but also diminishes the likelihood of accidentally reaching a plateau during the training process.

In the article below, we’ll explain how we conducted layerwise learning for QNN training using Qiskit. But before we do that, let’s first take a closer look at barren plateaus—what they are, and why they pose a problem for QNNs.

Barren Plateaus

Barren plateaus refer to saddle points within quantum circuit training landscapes where both the first and higher-order derivatives vanish [2]. This phenomenon is akin to the classical situation where neural networks encounter challenges due to vanishing gradients. Put simply, when parameterized quantum circuits (PQCs) are of sufficient depth and exhibit arbitrary random configurations, the resulting expectation values tend to be similar regardless of the set of parameters. As a result, the partial derivatives of an objective function based on these expectation values from random PQCs demonstrate notably low mean and variance. These characteristics make it extremely difficult to train quantum circuits using gradient-based methods on real hardware.

Variance of gradients of the expectation value for ZZ observable.

The above figure illustrates how, with increasing qubits and layer counts, the variance of partial derivatives produced from circuits of various sizes and pairwise CZ entanglement in each layer exponentially decays. The variance of gradients was computed across 500 random circuits. It’s worth noting that an initial layer of Ry(π/4) gates was incorporated into each circuit to avoid the potential bias of gradients related to the initialization in an all-zero state. As the qubit count increases, we see convergence in the variance with a large number of layers leading to a variance that follows an exponential decay in direct proportion to the number of qubits.

To help us visualize the problem and get a sense of what is happening in the cost landscape let’s look at two circuits with 50 layers each and containing 4 and 12 qubits. We will use the circuit shown below consisting of only 2 parameters θ and φ. We use the observable ZZ on the first two qubits and visualize the landscape for different values of θ and φ.

For a 4-qubit circuit, the cost landscape is trainable. However, For a 12-qubit circuit, much of the cost landscape is flattening and becomes difficult to train. This effect will worsen as the number of qubits and layers increases.

Layerwise Learning

The LL algorithm is characterized by two distinct phases:

Phase I

In the initial phase of the algorithm, we build the ansatz by progressively adding. We first have a circuit with s number of layers whose parameters are initialized to zeros. We train this circuit for a fixed number of epochs, after which we add another set of layers and freeze the parameters of the previous layers. The parameters to optimize are dependent on two hyperparameters p and q. The value of p governs the number of layers added in each step, while q determines the layer interval after which the parameters of prior layers are frozen. For instance, with p = 2 and q = 4, two layers are appended in each step, and layers preceding the current one by more than four are frozen.

Phase II

The second phase of the algorithm involves further training of the pre-trained circuit from phase I. Here, larger contiguous partitions of layers are trained simultaneously. A hyperparameter r is introduced to specify the percentage of parameters trained within a single step. We train each partition alternatively until convergence. This way we are training on a larger partition (compared to phase I) at once. By constraining randomness to shallower sub-circuits throughout the entire training process, the algorithm also effectively reduces the likelihood of encountering barren plateaus, which could arise due to stochastic or hardware noise present during the sampling procedure.

Opposed to layerwise learning is the traditional complete depth learning (CDL) in which all parameters are trained together. However, in a noisy environment, a single unfavourable update can affect the entire circuit and can trap it within a barren plateau.

Binary Classification

Let’s use Qiskit to look at an example of layerwise learning (LL) in action for the task of binary classification of MNIST digits—specifically the classification of digits six and nine.

To encode the training data into the quantum circuit, we use angle encoding. First, we run the principal component analysis (PCA) on the data and select the top 8 principal components exhibiting the highest variance. These components are scaled to lie within [0, 2π] and encoded to the data layer consisting of Rx gates.

For the ansatz, we use the Two-local circuit template from Qiskit’s circuit module. In this configuration, we add two layers during each iteration, freezing the parameters of layers preceding the current one by more than two. This arrangement ensures that we only train two layers in each iteration. The training procedure involves optimizing each set of layers for 20 epochs and repeating this process eight times, leading to a cumulative structure of 16 layers. The starting layer is shown below.

The starting layer for layerwise learning. The features are encoded with angle encoding using Rx gates only once. The ansatz is the Two-Local circuit from Qiskit.

In Phase II, the circuit is divided into two halves, and trained alternatively. For parameter updates, we use the Adam optimizer. The below plots show the convergence of loss during both phases of training.

Convergence of loss during training for phase I (left) and phase II (right)

Conclusion

We have observed the effects of barren plateaus within the training landscapes of Quantum Neural Networks (QNNs), and we’ve explored how the layerwise learning approach effectively addresses this challenge.

When evaluated against noiseless simulations and exact analytical gradients, both layerwise learning (LL) and complete depth learning (CDL) strategies will demonstrate comparable performance. However, LL strategies exhibit superior results on average when accounting for measurement strategies that align with experimental realism. This advantage can be attributed to a twofold effect: firstly, LL mitigates excessive randomization, and secondly, it concentrates training gradient contributions into a smaller number of parameters.

To sum up, layerwise learning significantly enhances the likelihood of successfully training a QNN. This attribute is particularly valuable when applied to Noisy Intermediate-Scale Quantum (NISQ) devices.

The full code is available on GitHub.

References

Skolik, A., McClean, J.R., Mohseni, M. et al. Layerwise learning for quantum neural networks. Quantum Mach. Intell. 3, 5 (2021). https://doi.org/10.1007/s42484-020-00036-4
McClean, J.R., Boixo, S., Smelyanskiy, V.N. et al. Barren plateaus in quantum neural network training landscapes. Nat Commun 9, 4812 (2018). https://doi.org/10.1038/s41467-018-07090-4