Intuition behind Restricted Boltzmann Machines

11 min readAug 31, 2019

In this post, I will try to shed some light on the intuition about Restricted Boltzmann Machines and the architecture behind it.

What is Boltzmann Machine?

Boltzmann machine is a network of symmetrically connected nodes which makes stochastic decision, to be turned on or off. It is an unsupervised machine learning algorithm which helps to discover latent features present in the dataset. These are non-deterministic generative Deep Learning models with only 2 types of nodes — hidden and visible nodes. There are no output nodes!!! This is because each node is treated as same. Each nodes generates states of the system and hence it is a generative model. When the input is provided, they are able to capture all the parameters, patterns and correlations among the data. This is why they are called Generative models and fall into the class of Unsupervised Deep Learning.

Why Restricted Boltzmann Machine?

In Boltzmann machine, each node is connected to every other node.. Connection between all nodes are undirected. And that too it has not been proven useful for practical machine learning problems. So by placing certain restrictions like no intralayer connection in both visible layer and hidden layer it can be made efficient.

What is Restricted Boltzmann Machine (RBMS) ?

Restricted Boltzmann Machine are shallow neural nets that learn to reconstruct data by themselves in an unsupervised fashion. They belong to energy based models. RBMS were invented by Geoffrey Hinton and can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning ,topic modeling and even Deep Belief Networks. RBM is a generative model. Let me explain it by first, difference between generative and discriminative model.

Discriminative: Consider a classification problem in which we want to learn to distinguish between Sedan cars (y = 1) and SUV cars (y = 0), based on some features of cars. Given a training set, an algorithm like logistic regression tries to find a straight line — that is, a decision boundary — that separates the suv and sedan.

Generative: Looking at cars, we can build a model of what Sedan cars look like. Then, looking at SUVs, we can build a separate model of what SUV cars look like. Finally, to classify a new car, we can match the new car against the Sedan model, and match it against the SUV model, to see whether the new car looks more like the SUV or Sedan.

Generative Models specify a probability distribution over a dataset of input vectors. We can do both supervise and unsupervised tasks with generative models:

In an unsupervised task, we try to form a model for P(x), where P is the probability given x as an input vector.
In the supervised task, we first form a model for P(x|y), where P is the probability of x given y(the label for x). For example, if y = 0 indicates whether a car is a SUV or y = 1 indicates indicate a car is a Sedan, then p(x|y = 0) models the distribution of SUVs’ features, and p(x|y = 1) models the distribution of Sedans’ features. If we manage to find P(x|y) and P(y), then we can use Bayes rule to estimate P(y|x), because:
𝑝(𝑦|𝑥)=𝑝(𝑥|𝑦)𝑝(𝑦)𝑝(𝑥)

RBM is undirected and has only two layers , Input layer and hidden layer. All visible layers are connected to hidden layer. No intralayer connections exists between them. The original Boltzmann machine had connections between all the nodes. Since RBM restricts the intralayer connection, it is called as Restricted Boltzmann Machine. It is important because it can extract meaningful features from a given input.

How does it work?

RBM is a 2 layer neural network. IN an RBM, we have a symmetric bipartite graph where no two units within the same group are connected. Simply, RBM takes the inputs and translate those into a set of binary values that represent them in the hidden layer. Then these, numbers can be translated back to reconstruct the inputs. Through several forward and backward passes, the RBM will be trained, and a trained RBM can reveal which features are the most important ones when detecting patterns.

Now lets import our required libraries. If they are not installed simply go to the terminal and type pip install and then the required library and it is done!!

import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
from PIL import Image
from utils import tile_raster_images
import matplotlib.pyplot as plt
%matplotlib inline

RBM layers: The first layer is the visible layer(input layer) and thesecond layer is the hidden layer, which posses i neurons in our case. Each hidden node can have either 0 or 1 values with a probability that is a logistic function of the inputs it receives from the other j visible units

Each node in the first layer also has a bias. We will denote the bias as “v_bias” for the visible units. The v_bias is shared among all visible units.

Here we define the bias of second layer as well. We will denote the bias as “h_bias” for the hidden units. The h_bias is shared among all hidden units.

We will be using the MNIST dataset to practice the usage of RBMs. The following cell loads the MNIST dataset . And then we have initialize v_bias and h_bias.

mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
trX, trY, teX, teY = mnist.train.images, mnist.train.labels, mnist.test.images, mnist.test.labels
v_bias = tf.placeholder("float", [784])
h_bias = tf.placeholder("float", [50])

We have to define weights among the input layer and hidden layer nodes. In the weight matrix, the number of rows are equal to the input nodes, and the number of columns are equal to the output nodes. Let W be the Tensor of 784 x 50 that represents weights between neurons.

W = tf.placeholder("float", [784, 50])

How to inference?

RBM has two phases:
1.Forward Pass
2.Backward Pass or Reconstruction
Forward pass: Input one training sample (one image) X through all visible nodes, and pass it to all hidden nodes. Processing happens in each node in the hidden layer. This computation begins by making stochastic decisions about whether to transmit that input or not (i.e. to determine the state of each hidden layer). At the hidden layer’s nodes, X is multiplied by a Wij and added to h_bias. The result of those two operations is fed into the sigmoid function, which produces the node’s output, p(hj), where j is the unit number.

𝑝(ℎ𝑗)=𝜎(∑𝑖𝑤𝑖𝑗𝑥𝑖)p(hj)=σ(∑iwijxi), where 𝜎()σ() is the logistic function.

Now lets see what 𝑝(ℎ𝑗)p(hj) represents. In fact, it is the probabilities of the hidden units. And, all values together are called probability distribution. That is, RBM uses inputs x to make predictions about hidden node activations.

As a result, for each row in the training set, a vector/tensor is generated, which in our case it is of size [1x2], and totally n vectors (𝑝(ℎ)p(h)=[nx2]).

We then turn unit ℎ𝑗hj on with probability 𝑝(ℎ𝑗|𝑉)p(hj|V), and turn it off with probability 1−𝑝(ℎ𝑗|𝑉)1−p(hj|V).

Therefore, the conditional probability of a configuration of h given v (for a training sample) is:

v0_state = tf.placeholder("float", [None, 784])
h0_prob = tf.nn.sigmoid(tf.matmul(v0_state, W) + hb)  #probabilities        of the hidden units
h0_state = tf.nn.relu(tf.sign(h0_prob - tf.random_uniform(tf.shape(h0_prob)))) #sample_h_given_X

Reconstruction part:

v1_prob = tf.nn.sigmoid(tf.matmul(h0_state, tf.transpose(W)) + vb) 
v1_state = tf.nn.relu(tf.sign(v1_prob - tf.random_uniform(tf.shape(v1_prob))))

Calculate Error: In each epoch, we compute the “error” as a sum of the squared difference between step 1 and step n, e.g the error shows the difference between the data and its reconstruction.

err = tf.reduce_mean(tf.square(v0_state - v1_state))

Training of the model: We want to give a high probability to the input data we train on. So, in order to train an RBM, we have to maximize the product of probabilities assigned to all rows v (images) in the training set V (a matrix, where each row of it is treated as a visible vector v).So, we have to update the weights wij to increase p(v) for all v in our training data during training. So we have to calculate the derivative.This cannot be easily done by typical gradient descent , so we can use another approach, which has 2 steps:

Gibbs Sampling
Contrastive Divergence

Gibbs Sampling:

First, given an input vector v we are using p(h|v) for prediction of the hidden values h.

𝑝(ℎ|𝑣)=𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑋⊗𝑊+ℎ𝑏)p(h|v)=sigmoid(X⊗W+hb)
h0 = sampleProb(h0)

Then, knowing the hidden values, we use p(v|h) for reconstructing of new input values v.

𝑝(𝑣|ℎ)=𝑠𝑖𝑔𝑚𝑜𝑖𝑑(ℎ0⊗𝑡𝑟𝑎𝑛𝑠𝑝𝑜𝑠𝑒(𝑊)+𝑣𝑏)p(v|h)=sigmoid(h0⊗transpose(W)+vb)
𝑣1=𝑠𝑎𝑚𝑝𝑙𝑒𝑃𝑟𝑜𝑏(𝑣1)v1=sampleProb(v1) (Sample v given h)

This process is repeated k times. After k iterations we obtain an other input vector vk which was recreated from original input values v0 or X.

Reconstruction steps:

Get one data point from data set, like x, and pass it through the net
Pass 0: (x) ⇒⇒ (h0) ⇒⇒ (v1) (v1 is reconstruction of the first pass)
Pass 1: (v1) ⇒⇒ (h1) ⇒⇒ (v2) (v2 is reconstruction of the second pass)
Pass 2: (v2) ⇒⇒ (h2) ⇒⇒ (v3) (v3 is reconstruction of the third pass)
Pass n: (vk) ⇒⇒ (hk+1) ⇒⇒ (vk+1)(vk is reconstruction of the nth pass)

What is sampling here (sampleProb)?

In forward pass: We randomly set the values of each hi to be 1 with probability 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑣⊗𝑊+ℎ𝑏)sigmoid(v⊗W+hb).

To sample h given v means to sample from the conditional probability distribution P(h|v). It means that you are asking what are the probabilities of getting a specific set of values for the hidden neurons, given the values v for the visible neurons, and sampling from this probability distribution. In reconstruction: We randomly set the values of each vi to be 1 with probability 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(ℎ⊗𝑡𝑟𝑎𝑛𝑠𝑝𝑜𝑠𝑒(𝑊)+𝑣𝑏)sigmoid(h⊗transpose(W)+vb).

Contrastive Divergence?

Step 1:Take input vector to the visible node

Step 2:Update the weights of all hidden nodes in parallel. All of the units in one layer are updated in parallel given the current states of the units in the other layer.

Step 3: Reconstruct the input vector with the same weights used for hidden nodes. Even though we use the same weights, the reconstructed input will be different as multiple hidden nodes contribute the reconstructed input.

Step 4: Compare the input to the reconstructed input based on KL divergence.

Step 5: Reconstruct the input vector again and keep repeating for all the input data and for multiple epochs. This is repeated until the system is in equilibrium distribution.

h1_prob = tf.nn.sigmoid(tf.matmul(v1_state, W) + hb)
h1_state = tf.nn.relu(tf.sign(h1_prob -tf.random_uniform(tf.shape(h1_prob)))) #sample_h_given_X
alpha = 0.01
W_Delta = tf.matmul(tf.transpose(v0_state), h0_prob) - tf.matmul(tf.transpose(v1_state), h1_prob)
update_w = W + alpha * W_Delta
update_vb = vb + alpha * tf.reduce_mean(v0_state - v1_state, 0)
update_hb = hb + alpha * tf.reduce_mean(h0_state - h1_state, 0)

Now lets start the session and initialize the variables:

cur_w = np.zeros([784, 50], np.float32)
cur_vb = np.zeros([784], np.float32)
cur_hb = np.zeros([50], np.float32)
prv_w = np.zeros([784, 50], np.float32)
prv_vb = np.zeros([784], np.float32)
prv_hb = np.zeros([50], np.float32)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

Lets look at the error of the first run:

sess.run(err, feed_dict={v0_state: trX, W: prv_w, vb: prv_vb, hb: prv_hb})

epochs = 5
batchsize = 100
weights = []
errors = []
for epoch in range(epochs):
    for start, end in zip( range(0, len(trX), batchsize), range(batchsize, len(trX), batchsize)):
        batch = trX[start:end]
        cur_w = sess.run(update_w, feed_dict={ v0_state: batch, W: prv_w, vb: prv_vb, hb: prv_hb})
        cur_vb = sess.run(update_vb, feed_dict={v0_state: batch, W: prv_w, vb: prv_vb, hb: prv_hb})
        cur_hb = sess.run(update_hb, feed_dict={ v0_state: batch, W: prv_w, vb: prv_vb, hb: prv_hb})
        prv_w = cur_w
        prv_vb = cur_vb
        prv_hb = cur_hb
        if start % 10000 == 0:
            errors.append(sess.run(err, feed_dict={v0_state: trX, W: cur_w, vb: cur_vb, hb: cur_hb}))
            weights.append(cur_w)
    print ('Epoch: %d' % epoch,'reconstruction error: %f' % errors[-1])
plt.plot(errors)
plt.xlabel("Batch Number")
plt.ylabel("Error")
plt.show()

What is the final weight after training?

uw = weights[-1].T

We can take each hidden unit and visualize the connections between that hidden unit and each element in the input vector. In our case, we have 50 hidden units. Lets visualize those.

Let’s plot the current weights: tile_raster_images helps in generating an easy to grasp image from a set of samples or weights. It transform the uw (with one flattened image per row of size 784), into an array (of size 25×2025×20) in which images are reshaped and laid out like tiles on a floor.

tile_raster_images(X=cur_w.T, img_shape=(28, 28), tile_shape=(5, 10), tile_spacing=(1, 1))
import matplotlib.pyplot as plt
from PIL import Image
%matplotlib inline
image = Image.fromarray(tile_raster_images(X=cur_w.T, img_shape=(28, 28) ,tile_shape=(5, 10), tile_spacing=(1, 1)))
plt.rcParams['figure.figsize'] = (18.0, 18.0)
imgplot = plt.imshow(image)
imgplot.set_cmap('gray')

Each tile in the above visualization corresponds to a vector of connections between a hidden unit and visible layer’s units. Let’s look at one of the learned weights corresponding to one of hidden units for example. In this particular square, the gray color represents weight = 0, and the whiter it is, the more positive the weights are (closer to 1). Conversely, the darker pixels are, the more negative the weights. The positive pixels will increase the probability of activation in hidden units (after multiplying by input/visible pixels), and negative pixels will decrease the probability of a unit hidden to be 1 (activated). So, why is this important? So we can see that this specific square (hidden unit) can detect a feature (e.g. a “/” shape) and if it exists in the input.

from PIL import Image
image = Image.fromarray(tile_raster_images(X =cur_w.T[10:11], img_shape=(28, 28),tile_shape=(1, 1), tile_spacing=(1, 1)))
### Plot image
plt.rcParams['figure.figsize'] = (4.0, 4.0)
imgplot = plt.imshow(image)

Example: Imagine that we have a destructed image of figure 3. Lets see if our trained network can fix it.

First we plot the image:

img = Image.open('destructed3.jpg')
img

Now let’s pass this image through the net:

# convert the image to a 1d numpy array
sample_case = np.array(img.convert('I').resize((28,28))).ravel().reshape((1, -1))/255.0

Feed the sample case into the network and reconstruct the output:

hh0_p = tf.nn.sigmoid(tf.matmul(v0_state, W) + hb)
#hh0_s = tf.nn.relu(tf.sign(hh0_p - tf.random_uniform(tf.shape(hh0_p)))) 
hh0_s = tf.round(hh0_p)
hh0_p_val,hh0_s_val  = sess.run((hh0_p, hh0_s), feed_dict={ v0_state: sample_case, W: prv_w, hb: prv_hb})
# reconstruct
vv1_p = tf.nn.sigmoid(tf.matmul(hh0_s_val, tf.transpose(W)) + vb)
rec_prob = sess.run(vv1_p, feed_dict={ hh0_s: hh0_s_val, W: prv_w, vb: prv_vb})

Here we plot the reconstructed image:

img = Image.fromarray(tile_raster_images(X=rec_prob, img_shape=(28, 28),tile_shape=(1, 1), tile_spacing=(1, 1)))
plt.rcParams['figure.figsize'] = (4.0, 4.0)
imgplot = plt.imshow(img)
imgplot.set_cmap('gray')

Conclusion:

This post was a simple Restricted Boltzmann Machine architecture. There are many variations and improvements on RBMs and the algorithms used for their training and optimization (that I will hopefully cover in the future posts). I hope this helped you understand and get an idea about this awesome generative algorithm. In the next post, we will apply RBMs to build a recommendation system for movies!

Have a cup of coffee, take a small break if required, and head to Part-2 of this article where we will discuss what actually shall make you stand out in the crowd of Unsupervised Deep Learning and we will build a movie recommender system.