Restricted Boltzmann Machine, a complete analysis. Part 3: Contrastive Divergence algorithm

Nguyễn Văn Lĩnh
datatype
Published in
8 min readJan 9, 2020

This post will be a bit heavy math as usually it is hard to find a concrete writing to explain every steps of Contrastive Divergence algorithm for RBM. But the probabilistic machine learning will be more appealing for you after reading this. Just another practice!

Job to be done:

  • As RBM contains latent variables, we need to add them to the probability distribution and loglikelihood function.
  • RBM energy function, probability distribution
  • RBM’s likelihood function and its derivative.
  • Constractive Divergence algorithm.
  • Relationship with modern algorithms.

Latent variables models

In order to capture different dependencies between data visible features, the Restricted Boltzmann Machine introduces hidden variables. For example, in a set of documents, the hidden variables describe different abstract “topic” of them.

When including latent variables, the marginal distribution of input features is the sum of the join probability of visible features and each hidden variable. Denote the set H = (h_1,h_2,…,h_n) include n hidden variables.

Equation 10: marginal distribution of latent variable models.

In the energy based model for Gibbs distribution of an MRF, the equivalent form of Equation (10):

Equation 11: marginal distribution in term of energy function

where the partition function Z =sum(exp(−E(x,h))) with all possible x, h.

Log likelihood

When the associating with model parameters θ, the log likelihood with hidden variables is:

Log likelihood of latent variable based models

Derivative of the log likelihood

To apply the gradient ascent method to find θ, we need to take the derivative of L with respect to θ.

As

and following the Bayesian rule, then the final derivative form of the log-likelihood is:

Equation 14. Derivation of the log likelihood function

It is a crucial step for latent variable based models definition. What we need to find for the next step for the RBM case:

  • Definition of the energy function E for RBM.
  • The derivation of E(x,h) according to model parameters.

Modeling the Restricted Boltzmann Machine

Energy function

An energy based model: In Figure 1, there are m visible nodes for input features and n hidden nodes for latent features. We solve the binary RBM first, when the random variables value in (x,h) ∈ {0,1}. There are m input features for x and n hidden variables.

RBM energy function

where the real weight w_ij associated with the edge connects visible node x_j to a hidden node h_i. Also each visible node x_j has a real bias value b_j, and a bias c_i for each h_i, respectively. This energy function E(v,h) returns a scalar energy value for each configuration of model parameters θ = {W,B,C}where W = {w_ij},B = {b_j},C = {c_i}; ∀i ∈{1,2,n}, j ∈{1,2,m}.

An important property of RBM is there are no connection between nodes in the same layer, then the latent variables are conditionally independent given input features and vice versa:

Taking the derivative

The final step to define the optimization procedure for binary RBM is finding its derivative for the log-likelihood function as equation 14. For example: the derivation with respect to a weight w_ij that connects hidden nodes i to a visible j:

The first term can be rewrite shorter as:

As sum of all h_i over p(h_i|x) = 1 and remember that h_i is binary or h_i in {0, 1}. Then, the derivative of log likelihood for a sample is:

Equation 28: Derivation of log likelihood of RBM for one data point

We apply the same manner to get the derivative for bias term b and c for these nodes or simply adding 1^T vector into the weight matrix on both dimension.

For all samples in the dataset L, this quantity become:

where q(x) is the empirical distribution of visible features given dataset L, then p(h|x)q(x) is the empirical distribution of all model variables on given dataset. While p(h,x) is the model distribution of all possible 2m +2n values of visible and hidden variables. This lead to a general rule as:

Equation 29: Log likelihood derivative as the difference between empirical distribution and model distribution

This derivation contains the different of two exception when x fit into a data and x free. When two terms are matched, the gradient becomes zero and the learning algorithm will stop. In the other hand, the objective of learning algorithm is making the empirical expectation approaches the model expectation.

Algorithms

Common algorithms for training a RBM model approximates the second expectation term in Equation (28) to do the gradient ascent step. The first proposed algorithm utilizes simulated annealing [1] and [2] uses mean field approximation to evaluate the gradient. Another approach is using the MCMC methods to approximate this gradient, but it requires many steps until reaching the equilibrium state. However, these algorithms are inefficient in term of getting fast and accurate approximation.

The first efficient algorithm is Contrastive Divergence (CD) [3] which is a standard way to train a RBM model nowadays.

The idea is running k steps Gibbs sampling until convergence and k = 1 typically. This applies to the gradient for one sample (Equation (28)) and doing k time sampling as follow: at the time t, sampling h^tp(h|x^t) and sampling x^t+1 ∼ p(x|h^t) subsequently. Or simply:

  • Sampling hidden variable values h from sampling visible variables x at time t.
  • Use these just sampling hidden variables h to generate the visible variables x data again.

Then, the approximation form of Equation (28) is

where x^0 is the original input vector or at time 0 and x^k is generated from RBM model at time k. Thanks to the hidden variables are independent given visible variables and vice versa, we can do the sampling for them in parallel in each layer.

Gibbs sampling for RBM, reproduce from Figure 27.31 of [4]. By stepwise sampling, from a vector, generating a corresponding latent vector and repeatedly doing until reaching the equilibrium state
One step Contrastive Divergence Algorithm

By processing one sample a time, taking and follow the gradient direction, this algorithm follows the Stochastic Gradient family of algorithms. The advantages of this approach is clearly visible in a large dataset, when it needs to compute the sum gradient for all data samples which tremendously expensive cost in conventional gradient methods. Thanks to this property, the RBM method is quite suitable for large scale machine learning problem.

Sampling distribution

What distribution to sampling from distribution p(h|x) and p(x|h)?

Due to x_l ∈ {0,1} it belongs to the Bernoulli distribution and we only need to find the conditional distribution of a particular x_l given the rest of x expect l th index and h: p(x_l = 1|x_(−l),h). We denote that the set of all visible variables expect l is x_(−l) = {x_1,x_2,…,x_(l−1),x_(l+1),…x_m}. Then, the energy function can be rewrite as the function of (x_l,x_(−l),h):

It is important to group terms to E_2 function as

Then the p(x_l = 1|x_(−l),h) derivation follow the Bayes rule:

Clearly, it is the sigmoid function form, then

Then the final probabilistic model for visible and latent variables show that they are only depended on adjacent nodes state and its bias:

where Ber stands for Bernoulli distribution, as we as in the binary RBM.

Relationship with modern algorithms.

Clearly, the RBM is both generative and unsupervised model.

A generative model

It is the most interesting part of RBM when they can generate both visible and hidden layers by sampling from distribution. After that, the information to update model parameters is the difference between ‘generated’ and ‘real’ data. Immediately, the links to these algorithms are clearly visible: Variational auto-encoders (VAE) and Generative adversarial network (GAN).

Or in more detail:

Taxonomy of generative models based on the tractability of the density distribution. Adapted from Ian Goodfellow, Tutorial: Generative adversarial networks. https://www.semanticscholar.org/paper/Deep-generative-models%3A-Survey-Oussidi-Elhassouny/64a1e4d18c2fab4904f85b3ee1a9af4263dae348/figure/7

Some differences in the density setup, or number of latent layers, architecture among these above algorithms. It is about:

  • What your algorithm want to capture: latent space modelling, direct optimize the likelihood…
  • How you the interested distribution setup:
  • How to sampling from distribution: MCMC, Gibbs, over the neural network
  • How your loss function to close the relation between actual and generated data from distribution.

Usually, these papers are quite short or complicated mathematical models as the audience for researchers people. This binary RBM posts try to have a clear explanation for a original, simple enough case to approach these new algorithm. Essentially, they are the same, just we need to learn more in the right way.

REFERENCE

[1] David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.

[2] Hilbert J. Kappen and Francisco de Borja Rodr´ıguez. Efficient learning in boltzmann machines using linear response theory. Neural Computation, 10(5):1137–1156, 1998.

[3]Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.

[4] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.

--

--