Restricted Boltzmann Machine, a complete analysis. Part 2: A Markov Random Field model

Published in

datatype

7 min readJan 7, 2020

This second post of the Restricted Boltzmann Machine is to explain about its core properties and establish fundamental tools/knowledge to solve it. We will see these things:

Restricted Boltzmann Machine is a Markov Random Field model.
Markov Random Field is a graphical model.
What is graphical model? Why they exist?
Markov Random Field definition.
Why Restricted Boltzmann Machine is a Markov Random Field model ?
How to solve/ find parameters for a Markov Random Field model from data?

The motivation of this writing is how to solve a RBM from beginning and throughout all formulations to build it. Nowadays, we can simply write a model in Pytorch or Tensorflow, use auto-gradient feature, and built-in gradient descent functions to solve it.

Of course, this is the best way to solve common problems that has big data, or easy to collect the data. But if you got the close form of the loss function’s derivative, everything will be thousand times faster.

Transfer learning only works for well-developed application domains: computer vision, natural language processing.
Things usually getting better if you can put your prior knowledge about the data/problem into the algorithms. Understanding and using principle algorithms along newest AI frameworks is the best way to speedup training time and accuracy. I apply this principle to our work daily and it is really effective.

I keep stuff in this writing is simple as much as possible in term of math, focus more in the reason why we need them and my personal experiences.

Restricted Boltzmann Machine properties

Let back on the main formula. Recall that:

x is the data generated by the model
h is the latent variable: n hidden units including h_1, h_2, … h_n to capture the dependencies between observations features

Why restricted?

RBM is the special case of Boltzmann Machine, the term “restricted” means there is no edges among nodes within a group, while Boltzmann Machine allows.

A Markov Random Field model

Markov Random Fields (MRF) is a class of probabilistic graphical model for undirected graphs of random variables that have Markov property. RBM is also a speciﬁc MRF model that form by a bipartite graph.

What about the Graphical Model?

From the observation that complex systems is the combination of simple parts [1], graphical models utilizes the probabilistic and graph theory to capture this property.

The graph nodes represent random variables and edges show the dependence between these variables.
The graph theory gives the data structure for encoding the factorization of information or the independence between sets of variables.
The probabilistic property ensure how parts interact or combine to build the model, and provide the inference procedure.
Thus the probabilistic side guarantees the capability of uncertainty modeling and graph theory handles the complexity.

From [1], it is a natural way to design a new system in solving problems in engineering. Many famous algorithms in machine learning, for example: Navie Bayes, Hidden Markov Models, Conditional Random Fields are graphical models.

Graphical models in a nutshell. **From**: Sutton, Charles, and Andrew McCallum. ”An introduction to conditional random ﬁelds.” Foundations and Trends in Machine Learning 4.4 (2012): 267–373

At this point, we already see the connection of RBM to other class of algorithms and the brief introduction about them. Then:

The way to solve a MRF algorithm is the way to solve the Restricted Boltzmann Machine.

Markov Random Fields

It takes ages to explain in detail about the MRF. Simply, I will explain the basic of MRF, why RBM is a MRF, tools from MRF to solve the RBM: Energy Function and MRF Inference.

A Markov Random Fields is a set of random variables that is satisfy the Markov properties in undirected graph. To be simple, the Markov properties formulates the dependencies among random variables and this dependencies are conditional independent in term of pair of variables, local set of variables, and variable subset.

Pairwise Markov property: Two un-adjacency random variables Xu,Xv are conditionally independent given all variables.
Local Markov property: A variable Xv is conditionally independent with all other variables given its neighbors Nv.
Global Markov property: Given three subsets: A,B,S ∈ V , if all paths from a vertex from A to another one in B need to pass through S. Then variables in set A is independent of its peers in set B given S.

Why conditional independent?

The biggest benefit is the reducing model complexity, less number of variables and terms thanks to the conditional independent.

For examples: a simple Bayesian network with 3 random variables: X1,X2,X3 and the join probability derivation of them by chain rule:

Join probability without conditional independent

If we apply the structure of the network that a variable only depends on its parents and independent with others given their parents. The factorization gives the shorter join probability formula:

Join probability with conditional independent

Indeed, not only shorter formulas, we need less parameters to store the join distribution: 2 + 2 + 1 vs 2³ − 1.

Why RBM is a MRF?

There is no inner connection between nodes in the hidden layer, they are conditionally independent given the observed features:

Then the RBM satisfies all Markov properties and itself undirected graph leads to RBM is a MRF model.

Energy Function

In the Hammersley-Cliﬀord theorem, a positive distribution p(x) > 0 assures Markov properties of an undirected graph G iﬀ p factorizes over G as product of factors, each factor represents one maximal clique. It’s worth to mention that a clique is a complete subgraph (subset of vertices form a complete graph).

A maximal clique C is associated with a nonnegative potential function ψC. Then the probability distribution of a Markov Random Field can be written as:

The denominator Z stands for normalization constant, is called parition function.

Or (4) equivalent with

where people call E(x) is the energy function:

In particular, this type of energy based function in Equation (6) is known as the Gibbs distribution which is commonly used in physics. The function E(x) > 0 indicates the total energy of a variable conﬁguration associated inside cliques. Obviously, the higher probability for p(x) corresponding to low energy E(x).

A simple recap:

Hammersley-Cliﬀord theorem provide a theory background to formulate the Markov Random Field.
According to this we can formulate the probability distribution of a MRF under cliques.
Assign a non-negative potential function for each clique and takes product of them
As the properties of sum-log and products functions, we got the energy function for the probability distribution in exponential function form or Gibbs distribution.

MRF inference

Given a dataset L including l samples: L = {x(1),x(2),…,x(l)} with the assumption that these data samples are generated from the same distribution and independent with others (i.i.d).

The inference task is ﬁnding a parameters setting for the probabilistic model to capture the underlying data distribution. Particularly, after the structure of a MRF is identify, the energy function is associated with a parameters set θ, and the inference is adjusting θ.

For each data point x, the likelihood for x is generated from a probabilistic model with the set of parameters θ is p(x|θ). In inference a graphical model, when the structure is known and all data observation is available, the learning phase uses the maximum likelihood criteria for ﬁnding parameters θ. The log-likelihood function for all samples in a dataset:

As there is no close form analytical solution for maximum the likelihood, we use the gradient ascent method on the log-likelihood function to solve it. Generally, the gradient ascent is iterative ﬁrst order method to ﬁnd the local optimum of given objective function. In order to seek the maximum objective value, it moves proportionally to the positive gradient direction in each step.

where α is the learning rate or the proportion of moving and θt is the current values of θ at time t.

At this point we already know:

The definition of MRF probability distribution p(x) in term of E(x): energy function.
Target to maximum likelihood of parameters given data p(x|θ).
Solving by gradient ascent.

REFERENCE:

[1] Michael Irwin Jordan. Learning in graphical models, volume 89. Springer Science & Business Media, 1998.