Quantum Natural Gradient From the Ground Up

A more “natural” way of optimizing VQEs.

8 min readMay 6, 2020

None of us are strangers to optimization. You probably learned the concept of optimization in a beginner calculus course — it’s simply just solving for a maximum or minimum of a given function.

Turns out, this simple concept is very important to machine learning, specifically when it comes to minimizing a given cost function, which in turn is improving the performance of a given model. In fact, it’s so important, we’re always trying to find new ways to efficiently optimize our models.

And quantum machine learning (QML) is no different — well actually it’s significantly different since we’re dealing in a completely different space when optimizing out algorithms. But for the longest time we’ve been using classical optimization that’s built to take advantage of classical geometric spaces.

Don’t get me wrong, these methods of optimization have been doing pretty well for us; the thing is we can do a lot better.

Since variational algorithms have shown a lot of promise in terms of near-term, NISQ computations, it’s important to find better ways to optimize them.

That’s why today, we’re gonna take a deep dive into the Quantum Natural Gradient Descent (as presented in Stokes et al. (2019)) and how taking advantage of a parametrized quantum space allows for a more efficient method of optimization, compared to ordinary gradient descent.

But before we dive into all the exciting quantum stuff… we should probably take a step back and understand how classical gradient descent works.

Plain ol’ Gradient Descent

The most common and simple form gradient descent is (you’ve guessed it) regular gradient descent.

Gradient descent can be described with the following formula:

where θ_(n+1) is the new parameter, θ_n is our current parameter, n is our “step size” (which is a set amount we want to move our parameters by), and ∇L(θ) is the gradient with respect to our cost function. Note that we subtract the gradient * step size since the gradient gives us the direction of steepest ascent, but we want to move towards a minimum, so we take the negative to get the steepest descent.

Typically, we use gradient descent whenever we want to optimize a certain set of parameters so that we move lower and lower in a given cost function space.

To visualize, think of it as a point moving through the cost function space, like in the gif on the left.

Seems pretty simple, right? Just move towards the minimum and you’re done! Easy.

Well, problem is, often enough our cost function space is perfect like the one of the left, and there can be several local minima that our point can get stuck in.

Another problem that arises is that we are also “blindly” updating our parameters, based on how we’re moving in the parameter space. By optimizing with regular GD, we’re assuming that all our parameters hold equal value, which in many (if not, all) cases is untrue.

When optimizing a given model, what we really care about is our output distribution, which tells us how more about how our model is performing.

However, regular gradient descent doesn’t account for how a model’s output distribution is affected, it only cares about moving each parameter by a certain distance (step-size) in the direction of steepest descent. As a result, the optimization can be less efficient.

This is where a bunch of really smart people came together and said “enough” — time for a better gradient descent. And that’s how…

(Classical) Natural Gradient Descent

…was born.

(Classical) natural gradient descent is very similar to regular gradient descent, but instead of optimizing in the parameter space, it tackles the output distribution space, leading to a much more (in theory) efficient optimization process.

But how do we do this?

First things first is we have to define an entirely new metric. In regular gradient descent, we use a simple euclidean metric to measure distances between parameters, but in the case of output distributions, this simply won’t cut it.

To understand why, take a look at these two guassians:

*from* https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/

In terms of a Euclidean metric (the red line) these two gaussians have been moved by the same distance. However, observing the overlap between the initial and final gaussian, we can tell that the second image underwent a much larger transformation than the first image. For this reason, we can’t work in a Euclidean parameter space if we want to take into account the output distribution realized by a parameter.

When measuring the different between two output distributions, we have to use something called the Kullback Leibler (KL) Divergence.

The KL-Divergence is used to measure the overlap and closeness between two output distributions. Although it’s not technically a metric (since it’s not symmetrical) it can act approximately symmetric in specific cases, allowing us to use it. Using it allows us to work in the distribution space, rather than the parameter space, since what we care about most is the output distribution of our model.

The last step to understanding the natural gradient descent is learning about the Fisher Information Matrix.

We know that we need to use the KL-Divergence to optimize in the distribution space, the Fisher information matrix provides exactly that. It defines the local curvature in distribution space for which KL-divergence is the metric. [1]

So, putting all the pieces together, we end up with a resulting definition for natural gradient descent, which is:

It looks identical to the regular gradient descent, except for the fact that we have that giant F inverse right in the middle of everything. That’s out fisher information matrix, that allows us to optimize in the distribution space rather than in the parameter space, allowing for both more information on the geometry, and control in the movement of the model in the distribution space, separately from movement in the cost space.

Well, now that we know more about regular gradient descent and (classical) natural gradient descent, it’s finally time to get into…

Quantum Natural Gradient Descent

Just like the last two gradient descent methods, QNG is very similar, except for one detail.

Just like how in classical natural gradient we use the fisher matrix and KL-divergence to optimize in the distribution space, in QNG we take advantage geometric properties of parameterized quantum states, which is super useful when trying to optimize a variational algorithm.

Once again, we need an entirely new metric. Instead of using the Fisher metric, we want to transfer over to the Fubini-Study metric (or, the quantum fisher metric 😉).

All we are doing with this metric is defining distance within a quantum geometric space, which for quantum algorithms, really does seem… natural.

We can define the Fubini-Study matrix with:

Where |φ(θ)⟩ is our initial ansatz, and ∂|φ(θ)⟩/∂θ_i is the partial derivative of |φ(θ)⟩ with respect to θ.

A Worked Example

Now that we know how QNG works, let’s go through a worked through example of using it to optimize a simple VQE problem from Yamamoto (2019) in Xanadu’s PennyLane, and compare it to a regular gradient descent optimization. The following example contains code modified from PennyLane’s tutorials.

(check out the full tutorial at https://github.com/lanabozanic/quantum_natural_gradient/blob/master/qng_tutorial.ipynb)

Import necessary packages

2. Define our device and our ansatz. For this example, we are using the hardware efficient ansatz defined it Yamamoto (2019)

3. Define our coefficients and observables of our H2 hamiltonian, map all our qnodes based on the ansatz and observables, and finally define our VQE cost function (expectation value) with our coefficients and qnodes.

4. Optimize & Run!

In this example, we are going to be comparing the performance of vanilla (regular) gradient descent, QNG, and a QNG with a diagonal approximation (a more time-efficient QNG, that only computes diagonal terms of the FS-metric). All the gradient descents will start with the same initial parameters, (-0.2,-0.2, 0,0)

We are repeating the same thing for all our gradient descents, but changing them with their respective modifications (e.g. adding FS-metric, or it’s diagonal approximation etc.)

Finally, all that’s left to do is…

5. Plot our results

Taking a look at the performances of all three gradient descents, it’s clear to say that the QNG out-performed vanilla gradient descent by a long shot. The QNG, and QNG diagonal approximation found the ground state at steps 56 and 101 respectively, whereas the vanilla gradient descent took 350 steps (when given a convergence tolerance of 1e-06).

Taking advantage of the quantum geometric space worked out in our favour after all!

For me, it was super interesting to play around with this new algorithm, and I can’t wait to see its other use-cases in the future!

Before I end off, I wanted to give a special shout out to my mentor, Hannah Sim and partner Maggie Li for the collaboration on this project. For the past couple months, we’ve taken a deeper dive into the world of this quantum natural gradient descent working through papers & examples and developing some fun stuff. Make sure to stay tuned for more QNG fun in the future as we move forward with the project! 😉

Another special thank you to the amazing people at the Quantum Open Source Foundation (QOSF) for running the quantum mentorship program. I got to meet Hannah and partner up with Maggie to collaborate on this project through the program, and I’m super grateful for the opportunity. 🙏

References:

Natural Gradient Descent: https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/