Optimizing quantum computations with the quantum natural gradient
The most successful types of quantum algorithms for use on near-term noisy quantum hardware are the so-called variational quantum algorithms, where a low-depth parametrized quantum circuit is chosen, and measured expectation values are used to create a cost function. A classical optimization loop is then used to find the set of circuit parameters for a quantum device that minimizes this cost function. Popular examples of such algorithms include the variational quantum eigensolver (VQE), the quantum approximate optimization algorithm (QAOA), and quantum neural networks (QNN).
Most recent implementations of variational quantum algorithms have used gradient-free or numerical classical optimization methods, such as the Nelder-Mead algorithm. However, the parameter-shift rule (as implemented in PennyLane) allows the user to automatically compute analytic gradients of quantum circuits. This opens up the possibility to train quantum computing hardware using gradient descent, the same method used to train deep learning models. One caveat, however, has surfaced with gradient descent — how do we choose the optimal step size for each parameter update of our variational quantum algorithms, to ensure successful and efficient optimization?
This is a question we tackled in our latest paper, a collaboration with our colleagues James Stokes and Giuseppe Carleo at the Flatiron Institute in New York City. Before we can fully explore the quantum implications, we will need to take a short detour to explore natural gradient descent, an idea first proposed almost two decades ago in machine learning.
The natural gradient
In standard gradient descent, each optimization step is given by
where L(θ) is the cost as a function of the parameters θ, and η is the learning rate or step size. In essence, each optimization step calculates the steepest descent direction around the local value of θₜ in the parameter space, and updates θₜ to θₜ₊₁ by this vector.
The problem with the above approach is that each optimization step is strongly connected to a Euclidean geometry on the parameter space. The choice of parametrization is not unique, and different parametrizations can distort distances within the optimization landscape.
For example, consider the following cost function L(θ), parametrized using two different coordinate systems, (θ₀, θ₁), and (ϕ₀, ϕ₁):
By performing gradient descent in the (θ₀, θ₁) parameter space, we are updating each parameter by the same overall Euclidean distance in Euclidean space (the step size), and not taking into account the fact that the cost function might vary at a different rate with respect to each parameter. This can cause the optimization to struggle to find, or even miss, the minimum altogether.
Instead, if we perform a change of coordinate system (or re-parametrization) of the cost function, we might find a parameter space where variations in L are similar across different parameters. This is the case with the new parametrization (ϕ₀, ϕ₁); the cost function is unchanged, but we now have a nicer geometry in which to perform gradient descent, and a more informative step size. This leads to faster convergence, and can help avoid optimization becoming stuck in local minima.
However, what if we avoid gradient descent in the parameter space altogether? If we instead consider the optimization problem as a probability distribution of possible output values given an input (i.e., maximum likelihood estimation), a better approach is to perform the gradient descent in the distribution space, which is dimensionless and invariant with respect to the parametrization. As a result, each optimization step will always choose the optimum step-size for every parameter, regardless of the parametrization.
In classical machine learning, the above process is known as natural gradient descent, and was first introduced over two decades ago by Amari (1998). The standard gradient descent is modified as follows:
where F is the Fisher information matrix. The Fisher information matrix acts as a metric tensor, transforming the steepest descent in the Euclidean parameter space to the steepest descent in the distribution space.
The quantum natural gradient
In a similar vein, it has been shown that the standard Euclidean geometry is sub-optimal for optimization of quantum variational algorithms (Harrow and Napp, 2019). In our paper, we leveraged the fact that the space of quantum states instead possesses an invariant metric known as the Fubini-Study metric tensor, represented by a matrix g, which can be used to construct a quantum analog to natural gradient descent:
where g+ refers to the pseudo-inverse. The Fubini-Study metric tensor has lots of interesting properties. For instance, it reduces to the Fisher information matrix when information is encoded in an orthonormal basis. As well, it has intriguing theoretical connections to imaginary-time evolution variational algorithms, as proposed in McArdle et al. (2018) and Yuan et al. (2018).
Evaluating the metric tensor on quantum hardware
While the full Fubini-Study metric tensor cannot be evaluated on quantum hardware, we can in fact compute a block-diagonal approximation. Luckily for us, it turns out the block-diagonal approximation to the metric tensor can be advantageous over standard (or vanilla) gradient descent.
For example, consider the following variational quantum circuit:
Each layer corresponds to a diagonal submatrix of the Fubini-Study metric tensor, with dimension determined by the number of parameters in the layer. Since there are two layers, each with two parameters, the block-diagonal approximation consists of two 2×2 matrices.
To numerically determine these two matrices, we simply construct subcircuits of the original variational circuit, as per the following two rules:
- Truncate the circuit prior to each layer.
- Perform measurements using the generators of the gates in the truncated layer.
For example, for the submatrix corresponding to the first parametrized layer, we truncate the circuit just prior to the first two rotations around the z-axis, and measure the corresponding qubits in the Pauli Z basis. Using these measurement results, the submatrix g⁽⁰⁾ is simply composed of all covariances between the measurements.
A similar procedure is then performed to compute the second submatrix g⁽¹⁾, corresponding to the second parametrized layer:
And that’s it! We can now construct a block-diagonal Fubini-Study metric approximation, and perform quantum natural gradient descent.
This functionality is built in to the latest version of PennyLane — simply define your variational circuit QNode, and use the QNode.metric_tensor() method to directly evaluate the block-diagonal approximation to the Fubini-Study tensor directly on quantum hardware. Internally, we also perform some measurement optimization, to ensure that commuting observables are all measured simultaneously. So if your quantum circuit contains L layers, than this calculation only requires L measurement settings!
Quantum natural gradient optimization
In addition to the new metric tensor functionality, the latest version of PennyLane also ships with a new QNGOptimizer, that automates the process of performing quantum natural gradient descent. Simply feed in your variational circuit cost function and your initial parameters, and the optimizer will automatically compute the metric tensor at each step. This can potentially help you speed up convergence of your variational algorithms and avoid getting stuck in local minima.
For more details, including the example code generating the Fubini-Study metric tensor, and using this to optimize a variational quantum circuit, be sure to check out our extended quantum natural gradient tutorial. For the mathematical specifics, including detailed derivations showing that the quantum natural gradient is equivalent to the natural gradient in the classical limit, the best resource is our paper, available on the arXiv.
Side note: something exciting is coming soon! Keep an eye out on https://qhack.ai for more details over the coming weeks.