Restricted Boltzmann Machine, a complete analysis. Part 1: introduction & model formulation

Nguyễn Văn Lĩnh
datatype
Published in
4 min readSep 12, 2018

Someday ago, my colleague asked me about RBM algorithm and what is the difference vs other learning algorithms. It is really interesting about this algorithm as Netflix used it along with SVD in their product. Last year, RBM already took me a month to understand fully about it in term of theory and practice.

This serial of articles is dedicated to shortly writing my analysis along with demo source code for RBM. This analysis is a detailed story about why the RBM created, which assumption, which way to solve, formulas to get to the final algorithm. I usually find that RBM papers typically come up to use the final formula and build on the top, which is hard to accept and understand for high curiosity people like me.

RBM Timeline

Let talk some historical aspect. Firstly, Restricted Boltzmann Machine is an undirected graphical model that plays a major role in Deep Learning framework nowadays. It is a relaxed version of Boltzmann Machine.

  • 1986: was introduced as Harmonium. Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, DTIC Document, 1986
  • 2001: fast approximated inference algorithm 1 as Contrastive
    Divergence (CD). Geoffrey E. Hinton. “Training products of experts by minimizing contrastive divergence”. In: Neural Computation 14.8 (2002)
  • 2007: first application (collaborative filtering ) for big data
    (Netflix movie rating) . Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey E. Hinton. “Restricted Boltzmann Machines for Collaborative Filtering”. In: International Conference on Machine Learning (2007)
  • 2006 → 2010, stacked RBM models to Deep Belief Network. Geoffrey E Hinton and Ruslan R Salakhutdinov. “Reducing the Dimensionality of Data with Neural Networks”
  • 2010 → now: slow down due to the superior of other DL
    frameworks, use for initializing networks parameters.

Briefly, it is a Swiss army knife, a probabilistic model, also finally figure out to include the sigmoid unit (non-linearity). Moreover, it is a clustering algorithm thanks to binary hidden factor. Then, it is a probabilistic model to capture the data distribution, nonlinear factor analysis, clustering algorithm, nonlinear data representation.

In term of practice, it is a very fast learning algorithm, short implementation, handle binary, real input data, even for missing values input vector. Follow that, numerous applications of RBM was found in

  • Dimensional reduction: Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks\r. Science, 313(5786):504–507, 2006.
  • Collaborative filtering: Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey E. Hinton. Restricted Boltzmann Machines for Collaborative Filtering. International Conference on Machine Learning, pages 791–798, 2007.
  • Classification: Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted Boltzmann machines. ICML, pages 536–543, 2008

In 2007, the first application of RBM that can handle big-data millions user ratings in Netflix competition, to model and predicts user preference on movies. In the period of 2006 to 2010, Deep Belief Network was invented by stacking multiple RBM models in order to build higher “abstract” data representation. It is the time people really get into the “deeper” model structure.

At this point, we clearly see the need of understanding RBM algorithm: the revenge for deep learning.

RBM: model definition

Usually, in high dimensional data, there may only some small number of degrees that explain/embedded most of the data variance, according to latent factors. For example, among a thousand face images, some underlying latent factors are gender, lighting condition, posing, emotions. One way to find these latent factors is the manifold learning, can be done through one of my favorite algorithms: Locally Linear Embedding.

Locally Linear Embedding example. https://cs.nyu.edu/~roweis/lle/faces.html
Density Estimation for a Gaussian mixture. http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_pdf.html

In modelling complex and high dimension data distribution, a popular probabilistic approach is using the mixture model, a typical algorithm is the Gaussian Mixture Model (GMM). In this GMM model, the assumption is all data points are generated by a mixture of Gaussian models.

Instead of modelling the sum of mixture models liked in GMM, RBM takes the product of simple Bernoulli distribution or “product of experts”, the aim to inference shaper posterior distribution (according to Hilton paper). In Figure 1, RBM contains two layers of units, the visible units from x_1 , x_2 , . . . x_m stand for m data input features, and the n binary hidden units including h_1 , h_2 , . . . h_n to capture the dependencies between observations features. According to this figure, each h_i models a particular dependency or relation of features.

From my writing

Although the RBM structure looks like two layers neural network, RBM has a closed-form representation of data distribution.

Recap:

  • RBM is a long history and influenced algorithm.
  • 2 layers model: input and hidden layer, both of them originally binary (follow the Bernoulli distribution)
  • The data representation is as the product of hidden nodes.
  • Swiss army knife

--

--