Quantum Machine Learning: Inference on Bayesian Networks

Published in

Analytics Vidhya

11 min readAug 17, 2019

Over the last three months I’ve been doing a lot of work on probabilistic graphical models— a way of expressing probabilistic relationships in the form of a graph. These models are super useful when applied to machine learning tasks, with Bayesian networks, hidden Markov models, and many other important classes of learning models directly implementing different kinds of probabilistic graphs.

In this article, we will look at a really cool way to implement a Bayesian network on a quantum computer that makes computing conditional probabilities much faster, by reducing the time it takes to sample from the model.

In case you aren’t up to speed on graphical models or machine learning, don’t worry— we’re starting with a crash course on Bayesian networks.

What even is a (classical) Bayesian network?

Bayesian networks (Bayes nets for short) are a type of probabilistic graphical model, meaning they work by creating a probability distribution that best matches the data we feed them with.

With Bayes nets, we need to use a part of our data to train a model before we can use it to solve machine learning tasks, in contrast to an unsupervised learning model like k-means clustering (read my tutorial on building a quantum distance estimator for k-means here), which does not require separate training data.

For example, if we wanted to create a Bayes net that would give us movie suggestions, we might feed a Bayes net with the data below so that it can learn a good distribution:

After it learns this data, we might ask it to find the probability that we like another bunch of movies, and the model would give us something like this:

So how does it work?

Circles, arrows and the Markov assumption

We can represent a probability distribution in many ways: through a graph, using nodes (also called vertices) to represent variables, and edges to represent dependencies, or by using a table of probabilities, or even by expressing the distribution as a product of many factors.

To understand this better, let’s start with a super simple network, made up of just three variables:

Each circle represents a node, and each arrow a relationship between the two circles it connects.

The directions of the arrows are important as well — when two nodes are connected by an arrow, the node connected to the head of the arrow (called the child node) depends on the node connected to the tail of the arrow (called the parent node).

For example, in the graph above, node H depends on node E, which depends on node P.

We can represent the probability of the network variables taking on certain values using the product of multiple factors, one for each variable:

Real world networks can have hundreds of nodes — what would the factors look like then? Each variable would depend on all the variables that came before it — making calculating probabilities computationally expensive.

Bayes nets work around this problem by using a rule called the Markov assumption: a node is considered independent of all other nodes in a graph when we know the values of the nodes in its Markov blanket — the set of its parent and child nodes, as well as all the other parents of the child nodes.

Let’s look at the factors for the same graph, but this time using the Markov assumption:

Notice how the H isn’t dependent on P anymore? That’s because P isn’t in the Markov blanket of H, so the two nodes are conditionally independent given E, and their relationship is ignored.

In our example the computation only becomes a little easier — but as the distributions get more complex and more variables are involved, the difference in computation increases more, and Bayes nets really start to shine.

Now we know why Bayesian networks are fast — but how can we solve problems using them?

Sampling and approximate inference

Once we have trained a Bayesian network, we want to use it to answer questions that interest us, which we do using a process called inference. Some of the simplest inference methods involve randomly generating samples from distributions and using those samples to compute probabilities.

Let’s look at how we can do this with our example network:

We can give a meaning to each variable:

P: Preparation level
E: Exam performance
H: Graduating with honors

For simplicity, we can assume that each variable can only take one of two values, 0 or 1, with 1 indicating the more desirable option for a student (better preparation, exam performance, and graduating with honors), and 0 indicating the less desirable option.

Each variable is associated with a probability table that we use to find out how likely it is that the variable takes on a certain value given the assignment to its parents. For example, this is the table for the variable H:

We can generate random samples from our distribution using a technique called forward sampling:

Start with nodes that have no parents — in this case, P — and assign to them a random value based on the associated probability tables.
Move to the children of the nodes assigned values in the previous step, and update their probability tables to reflect the assignments of their parent nodes — so if the value of P was 1, generate a sample for E using the second row of its probability table.
Repeat step 2 until all variables have been assigned a value.

If we do this enough times, we can get a pretty good estimate of the probability that a variable takes on a certain value. But what if we want to compute probabilities conditioned on evidence?

For example, we might want to find out the probability of measuring H = 1 if P = 0.

The solution is surprisingly straightforward: we perform the same sampling procedure, but discard the samples that do not contain the evidence. If we have evidence E = (P= 0), we would check every sample we generate to see if it contained the right evidence, and keep the ones which do. This technique — throwing away samples with incorrect evidence — is called rejection sampling.

Now that we’ve caught up on Bayes nets and inference, let’s take things deeper into the quantumverse.

Representing variables using qubits

How do we encode the variables that make up the Bayesian network into qubits?

The easiest way would be to map each random variable to an individual qubit, and use rotations to change the probability that the qubit is measured as 0 or 1.

Let’s take a look at how this works (in case this looks unfamiliar, take a look at this article on qubit representation):

We start with a zero-initialized qubit — for which the angle θ has a value of zero— and apply a rotation to it to change the value of θ to something between 0 and π. The larger the angle of rotation is, the closer the qubit is to the |1> state at the bottom of the Bloch sphere, and the larger the probability that it will be measured as in the |1> state.

But how do we decide the angle of rotation? To answer that, we need to know how the probability of measuring a qubit as in the |1> state changes as you rotate it:

If we rearrange the terms to rewrite the second equation for θ:

Using this rule and the transformation from the last section, we can implement a Bayesian network on a quantum computer, and with rejection sampling, we also have a way to use the network to answer questions — but this method can also be wasteful.

To see why, suppose our evidence is E = (E = 1, H = 0) — looking at the probability table for H from above, only about one in five samples we draw will be useful.

To work around this, we need a way to to increase the probability of states that contain matching evidence — we can achieve this using a technique called amplitude amplification.

Oracles and reflections

Once we prepare a set of qubits to represent a Bayes net, we have a superposition of states with evidence that matches and is different from ours:

ψ represents the quantum state, e the evidence, and Q the target state: the superposition of all possible assignments to the variables when the evidence matches e.

To identify which states contain the right evidence, we use a special gate called an oracle, which is designed so that it inverts quantum states that contain the right evidence:

When the oracle is applied to states with incorrect evidence, nothing changes:

After applying the oracle O, our quantum state would look like this:

We can understand this better if we look at the amplitudes of the different parts of ψ. The chart below shows ψ before applying the oracle:

And after:

The dashed lines show the average amplitude of the states that make up ψ.

Next we apply a gate U to reflect every state about the average amplitude, which increases the probability of measuring the states with correct evidence:

U is equivalent to a reflection about the initial state ψ. The sequence of gates that implements a reflection about ψ, however, is usually long and complicated — instead, we can use a shortcut to perform an equivalent, but way simpler, operation.

Let’s first transform our qubit system into another basis where implementation is simpler. The key to this transformation is the circuit A that converts a zero-initialized state into ψ:

So ψ in the standard basis is equivalent to the zero state in the A basis — which means that all we need to do to implement U is to go into the A basis, reflect about the zero state, and then go back to the standard basis.

Since we only care about the magnitude of the amplitude of a state with the right evidence, not its sign, we can just flip the sign of the zero state, which is equivalent to reflecting all the other states about it:

If we combine the O and U gates, along with a minus sign, we get the Grover iterate:

For large networks with many possible states, we have to apply the Grover iterate multiple times, but since ours is small, we only need to do it once or twice.

We’ve learnt how to implement, and now sample from, a Bayes net — so let’s go build one.

Building a quantum Bayesian network

The network we’ll be building is based on the example we’ve been working with so far, with evidence E = (P = 1) :

The question we’ll be trying to answer is this: what is P(H = 0 | P = 1)? Before we start, let’s look at the probability tables for each variable:

Let’s import qiskit, create the registers, and the quantum circuit we need:

To make the rotations easier, let’s set up a small function that converts probabilities into angles:

Now we can set up the qubits for the variables — we use the P(1) value from the first table for the qubit representing P, and since we know that P = 1, we need to use the P(1) value from the second row of the P(E|P) table.

Let’s use the P(1) value from the first row of the P(H|E) table — we can change it later if we have to, once we measure E:

To make sure we sample from a state with correct evidence values, we use amplitude amplification, designing and oracle that flips the signs of states with evidence matching ours:

In case you’re not sure how the long sequence of CU1 and CX gates make up a triple controlled Z gate, you can read up on it here.

We now apply the Grover iterate twice, and then set up H — to do that we measure E, and rotate H conditionally. After that we can sample from our network, through measurement:

Finally, to calculate P(H = 0 | P = 1), we run the job multiple times, generating a new sample with every run, and throwing away the samples with incorrect evidence:

After running the algorithm and drawing 1000 samples, I got these results:

If we use the probability tables from above to compute P(H = 0 | P = 1), we get 0.2832 — which means our rejection sampling algorithm works!

Wrapping up

Super cool!

We just built a Bayesian network on a quantum computer, and used it to perform rejection sampling, which is amazing. We also saw that using amplitude amplification before sampling makes the algorithm so much more efficient — it increased the sample acceptance rate from 35% to almost 100% .

You can grab the entire project here . If you have any questions, drop a comment here, or get in touch — I would be happy to help!