# Understanding Probabilistic Graphical Models Intuitively

Neural Networks and Deep Learning are a rage in today’s world but not many of us are aware of the power of Probabilistic Graphical models which are virtually everywhere. Before I explain what Probabilistic Graphical Models(PGMs) are, let’s first have a look at some of their applications.

Google is based on a very simple Graph algorithm called page rank. It works somewhat like message passing algorithm.

• Netflix, Amazon, facebook all use Probabilistic Graphical Models to recommend what is best for you. Check out the Netflix algorithm below. It uses Latent factor models and non-negative matrix factorization (NNMF).
• PGMs can also apparently infer whether one is a Modi Supporter or Kejriwal Supporter. The lastest example about application of PGMs was seen during America Opinion Poll predictions. Heard the name “Nate Silver” orFiveThirtyEight”. FiveThirtyEight is a company that makes predictions about American Presidential Polls using Probabilty Graphical Models. The opinion polls, news articles, prior knowledge, candidate history, etc. are taken as input. Not only was it used to predict the winning chance percentages but also the constituency specific predictions that differentiated the constituencies between ‘Strongholds’ and ‘Swing States’.

Relation between Neural Networks and Probabilistic Graphical Models

Neural Networks and PGMs both are capable of solving inference and learning problems but the major difference between them arises is to how they incorporate prior knowledge in the existing model, here is where the PGMs win. Apparently PGMs are better than neural networks in case you are in search of a girlfriend. To better understand the relationship between the two go through this funny Girlfriend Search”example by Bharath Hariharan (PhD, UC Berkley):

“You can constrain a PGM to do what the NN does, and you can torture an NN to get at some of the information a PGM would give you. How requires some explaining.
Let’s say I like a girl I know through N mutual acquaintances, and want to ask her out on a date. Being low on self-esteem, I want to ask some subset of our mutual friends to boost me to her. However, a referral from someone she mistrusts may have only a weak positive influence, while a referral from someone she dislikes might even have a negative impact.
Now suppose n of my friends have already had the same idea, and have tried doing this before (assume, for the purposes of this argument, that the young lady in question is unbearably desirable). So I have in my possession, a set of observations, which I can compactly represent as an n bit vector {1,0,1,0 ….} where 1 is a boost, 0 means no contact.
The computational problem I need to solve is to find the set of friends that I should ask to boost me, i.e. the best vector. Note that this need not be one of the n vectors already tried.
I could try solving this problem using a neural network: the input layer would have N nodes that obtain the appraisal input, the output layer would be the girl’s response (again binary), and I can throw in a hidden layer of M neurons that all have N connections back up to the input layer, and encode various quanta of me-friend-girl dynamics that when pooled together in M forward connections to the output layer, determine her interest in me.
When I train this neural network with the n vectors I have, I will learn weights between 0 and 1 on the NM input-mid-layer connections and the M mid-layer-output connections. Intuitively, combinations of inputs that predict the output well (say her 3 closest friends) will be repeatedly reinforced, and will become bigger across n observations. So will the weight corresponding to the mid-level neuron that is receiving inputs from specifically that clique. However, there will be no easy way of knowing which of the M mid-layer neurons contains the 3-closest-friends information. The NN will function as a Delphic oracle — you can ask it about the fate of individual vectors, but not for reasons explaining its prediction.
I could also treat this problem as one of Bayesian reasoning, where I potentially receive observations of approval from N nodes, which leads to the formation of an impression (a random variable), which causes date acceptance (an observable event). In this case, I get to see the likelihood probability p(approval from friend i|impression), from which I have to estimate the conditional posterior probability p(impression|vector of all approvals) using Bayes theorem.
Going from p(approval i|impression) to p(all approvals|impression) though, is hard. Usually, machine learners tend to assume conditional independence for all i approvals, i.e. p(all approvals|impression) = product of all p(approval i | impression). This is simple to compute, but gives up on the possibility of modeling non-trivial correlations between inputs. For example, if hearing good things from either A,B and C or from C and F, but not from A and C together impresses the girl (assume that the girl’s social life is extremely rich), such effects won’t show up in such ‘naive’ Bayesian predictions.
To summarize, in any situation where a number of causes could contribute to an observable effect, you can try to learn the structure of cause-effect using either a neural network or a Bayesian model. Thus, the key similarity between NNs and PGMs is that they can both be used to learn network functions.”

This is one of the best examples I could find to explain the complex Neural Nework-PGM relationship. Now we know about the power of PGMs and understand the basic difference between PGMs and other learning techniques, so lets dive into the theory of PGMs. Note that the purpose of this write-up is just to give a basic understanding of PGMs to the reader so we will not go deep into the mathematics behind it.

Revise the basic theorems of probability

The main aim of Probabilistic Graphical Models is to provide an intuitive understanding of joint probability among random variables. So to understand PGMs one should be comfortable with joint probability terminologies.

Joint Probability Basics

Why PGMs?

Two variables X, Y each taking 10 possible values. Listing P(X, Y ) for each possible value of X, Y requires specifying/computing (10^2) many probabilities.

What if we have 1000 variables each taking 10 possible values?

⇒ (10^1000) many probabilities. Difficult to store, and query naively.

Solution : Structured Learning, specially Probabilistic Graphical Models (PGMs). PGMs use graphs to represent the complex probabilistic relationships between random variables P(A, B, C, …)

Benefits:

1. Compact and Intuitive representation of distributions of variables.
2. Relation between variables are intuitive (such as conditional independences) have fast and general algorithms to query without enumeration. e.g. ask for P(A|B = b, C = c)

When there is dependence between variables graphical models can help to reduce the computation required to infer something. With placement tensions at its peak, a good/simple graphical model to represent situation of final year students will be :

Understanding various types of Graphs and Conversions from one form to other

There are three important ways by which we can represent the random variables, these are :

### a. Directed Acyclic Graph (DAG):

Terminology : Parent and Descendant. Pa(x) = Parent of x.

Have a look at the graph above and see what you can infer from it. Is A independent of B? Is B conditionally independent of C given A? To answer these questions we need better understanding of directed graphs.

(The Inverted-V Graph) Draw a directed graph with three variables A,B,C such that A and B are conditionally independent given C :

p(a, b|c) = p(a|b, c)p(b|c) = p(a|c)p(b|c)

Solution : The graph above satisfies the condition given in problem statement. C is the parent of A & B. Lets see how. Using the bayes chain rule we saw earlier :

p(a, b|c) = p(a|b, c)p(b|c). Note that :

1. C is the parent of A
2. B is non-descendant of A. So we can say that :

⇒ p(a|b, c) = p(a|c), which implies :

⇒ p(a, b|c) = p(a|c)p(b|c)

Hence A and B are conditionally independent given C.

But what if C is not observed will the same answer hold then? Lets find out :

Now lets try one more example to clarify the concept.

DAG : The V Graph problem

Case 1 : C is not an observed Variable

p(a, b, c) = p(a)p(b|a)p(c|a, b) . But since A is not a parent or descendant of B, we have

• p(a, b, c) = p(a)p(b)p(c|a, b)

Now to comment on conditionality lets find out P(a,b).

Case-2 : C is observed

Now you must be thinking if we have to solve all the marginalizing summations to reach the final result, what the hell is the use of graphs. To answer the above questions without any calculations one should understand the concept of D-separation.

D-Separation : Fast way of commenting on Variable Independence

D-separation is nothing but graph terminology to comment on independence. Professor Daphne Koller in her Coursera course gives a nice way of remembering the D-separation rules.

Rules to find path of water trail:

• If there exists at least one path(water trail) between two variables, it means they are not D-Separated.
• In inverted-V graph shape or any other linear shape in case we encounter a observed variable our path is blocked.
• Only in case of V-shaped graph, it is must that either the vertex node of V or any of its descendants are observed.

Understanding through examples:

### b. Markov Random Fields or Undirected Acycic Graph (UAG):

A Markov random field, also known as a Markov network or an undirected graphical model , has a set of nodes each of which corresponds to a variable or group of variables, as well as a set of links each of which connects a pair of nodes. The links are undirected, that is they do not carry arrows. In the case of undirected graphs, it is convenient to begin with a discussion of conditional independence properties.

Why a need of Undirected Graphs felt at the first place? Christopher bishop explains it nicely :

In the case of directed graphs, we will see that it was possible to test whether a particular conditional independence property holds by applying a graphical test called d-separation. This involves testing whether or not the paths connecting two sets of nodes were ‘blocked’. The definition of blocked, however, was somewhat subtle due to the presence of paths having head-to-head nodes. We might ask whether it is possible to define an alternative graphical semantics for probability distributions such that conditional independence is determined by simple graph separation. This is indeed the case and corresponds to undirected graphical models. By removing the directionality from the links of the graph, the asymmetry between parent and child nodes is removed, and so the subtleties associated with head-to-head nodes no longer arise.

In undirected graphs we don’t have any arrows but the information is stored in something else called potential variables. To understand what is potential variable one must understand concept of “cliques”, which are defined as a subset of the nodes in a graph such that there exists a link between all pairs of nodes in the subset.

• Each variable of a clique must be connected to every other variable of the clique.
• A Maximal Clique is a clique such that it is not possible to include any other nodes from the graph in the set without it ceasing to be a clique.

The property of D-separation needs to be redefined for Un-directed graphs.

• In case of un-directed graphs: One set of nodes(A) is said to be conditionally independent of another set of nodes(B) given some node set(C)then there exist at least one path between any one node of A to some node of B.
• Rule : Any observed node in the path blocks the path.

Markov Blanket : In un-directed graphs there is no concept of Parent or descendant but to understand the relation of node with surrounding nodes we use the concept of markov blanket. The immediate surrounding neighbors of any nodes forms the markov blanket for that node. As we know any observed node blocks the path, we can infer the following rule :

For an undirected graph, a node is independent of any other node given its markov blanket.

Now lets learn about potential and energy functions of cliques :

The graph below has two independent cliques given by {x1, x2}, and {x2, x3, x4} (We may make some other combinations also if we wish, like {x1, x3} and and {x2, x3, x4}). The set {x1, x2, x3, x4} is not a clique because of the missing link from x1 to x4.

Let us denote a clique by C and the set of variables in that clique by xC. Then the joint distribution is written as a product of potential functions ψC(xC) over the maximal cliques of the graph:

Here Z is called the partition function is a normalization constant which ensures that the distribution p(x) is correctly normalized.

By considering only potential functions which satisfy ψC(xC) >=0 we ensure that p(x) >=0. Let’s make this concept of cliques and potential function more clear with an example. Have a look at the expression of p(x) for undirected graphs. It is nothing but multiplication of all the clique potentials.

The arguments of each clique potential are the member nodes of that clique.

But the main question is what actually is this potential function ψ?

We do not restrict the choice of potential functions to those that have a specific probabilistic interpretation as marginal or conditional distributions. This is in contrast to directed graphs in which each factor represents the conditional distribution of the corresponding variable, conditioned on the state of its parents. Since there is no restriction on what can be the potential function apart from the fact that they are strictly positive, lets go with one of the easiest yet a flexible function : the exponential.

where E(xC) is called an energy function, and the exponential representation is called the Boltzmann distribution. The joint distribution is defined as the product of potentials, and so the total energy is obtained by adding the energies of each of the maximal cliques.

One consequence of the generality of the potential functions ψC(xC) is that their product will in general not be correctly normalized. We therefore have to introduce an explicit normalization factor(Z). The presence of this normalization constant is one of the major limitations of undirected graphs. If we have a model with M discrete nodes each having K states, then the evaluation of the normalization term involves summing over KM states and so (in the worst case) is exponential in the size of the model.

If there is no restriction on choosing potential functions how in general do we decide them for our models?

This can be done by viewing the potential function as expressing which configurations of the local variables are preferred to others. Global configurations that have a relatively high probability are those that find a good balance in satisfying the (possibly conflicting) influences of the clique potentials. We turn now to a specific example to illustrate the use of undirected graphs.

Final Concept : How do we relate Directed and Undirected graphs?

Lets try to convert a directed graph into an undirected one with the knowledge of how we represent the joint probability p(x) in both cases.

The most obvious way to convert this into undirected graph is :

Whether this representation is correct or not can be gauged by only one yardstick : possibility of such potential function. If there exists a potential function such that p(x)_directed = p(x)_undirected, then this representation is correct. Let us see if this is indeed true.

But we won’t always be this lucky. Sometimes its not possible to get undirected graph from a directed graph by replacing arrows with undirected links. Have a look at the example below :

Can the directed graph (a) be represented as undirected graph (b), lets find out.

All those cases where undirected graphs can’t be directly obtained by replacement of arrows with undirected links can be summed up through this one rule :

If in the directed graphs any node has more than one parent which are not married(connected with each other), we can’t get a undirected graph from this directed graph without adding more links to it.

So if a node has more than one parent they should be married before they can be represented as undirected graph. So indirectly the undirected graph says that you cannot have a child unless you are married. (In other words Undirected graphs are super conservatives). So to not upset the moral compass of undirected graphs we always marry the co-parents for conversion into undirected graphs. This process is aptly named as “Moralization”.

Summing up the procedure :

To convert a directed graph into an undirected graph, we first add additional undirected links between all pairs of parents for each node in the graph and then drop the arrows on the original links to give the moral graph. Then we initialize all of the clique potentials of the moral graph to 1. We then take each conditional distribution factor in the original directed graph and multiply it into one of the clique potentials. There will always exist at least one maximal clique that contains all of the variables in the factor as a result of the moralization step. Note that in all cases the partition function is given by Z = 1.

Some final notes :

• Figure below shows an example of a directed graph that is a perfect map for a distribution satisfying the conditional independence properties A⊥B | ∅ and A⊥B | C. There is no corresponding undirected graph over the same three variables that is a perfect map.
• There can be more than one way of making undirected graph from a directed graph.

I will end the first part of the PGM tutorial here and give a brief intro to what we will be covering in the next post.

At no point we should lose track of what we are there to accomplish : using available data to comment on unobserved data. This problem can be broken down into two parts : Inference and Learning.

In next post we will learn more about Inference. For dealing with inference problem optimally direct and indirect graphs don’t suffice and we have to move further to Factor graphs. Factor Graphs are very similar to undirected graphs but offer lot more flexibility in usage. The important algorithms that we use for inference are :

• Sum-Product Algorithm (Marginal Query)
• Max-Product Algorithm (MAP Query)

These algorithms give exact inference. For most probabilistic models of practical interest, exact inference is intractable, and so we have to resort to some form of approximation. There are two ways of getting this approximate inference:

• Use algorithms based on Deterministic Inference(will not be covered in next blog)
• approximate inference methods based on numerical sampling, also known as Monte Carlo techniques

For approximate inference first we will cover the basic sampling algorithms:

• Rejection sampling
• Adaptive rejection sampling
• Importance sampling

But the above sampling techniques suffer from severe limitations particularly in spaces of high dimensionality. We therefore turn in this section to a very general and powerful framework called Markov chain Monte Carlo (MCMC), which allows sampling from a large class of distributions, and which scales well with the dimensionality of the sample space.

• Markov Chain
• The Metropolis-Hastings algorithm
• Gibbs Sampling (Special case of Metropolis-Hastings Algo)

Hope you enjoyed reading this blog and got some intuitive idea about Probabilistic Graphical Models. In case you have any problems or suggestions regarding what is discussed here feel free to comment. I will try my best to clarify your doubts.

(References — All the images which are not hyperlinked or cited are either made by me or edited version of figures from Christopher Bishop’s Book — Pattern Recognition and Machine Learning)