A Complete Guide to Boltzmann Machine — Deep Learning

Soumallya Bishayee
10 min readApr 4, 2023

--

Boltzmann Machine is a directionless unsupervised generative deep learning network, used for recommended system. The basic structure of BM consists of visible nodes & hidden nodes. It has no output nodes. We have to fed into the hidden nodes. Here every node is connected to every nodes. It’s all about updating the weights of synapses between the nodes. Before we deep dive into the Boltzmann Machine, here are the content of this article as following :

  1. Theory [1.a) Intuition, 1.b) Structure, 1.c) Working principle, 1.d) It’s special type -Restricted Boltzmann Machine, 1.e) Contrastive Divergence, 1.f) Deep Belief Networks, 1.g) Advantages & Disadvantages of RBM, 1.h) Practical Application
  2. Implementation of Code in Pytorch
https://en.wikipedia.org/wiki/Boltzmann_machine

1.a) Intuition :

  1. Where we have less information or data, we can proceed with the Boltzmann Machine. Because it use neural networks with neurons that are connected not only to other neurons in other layers but also to neurons within the same layer. It does not wait for input data, it generates data within the network. Some parameters in a particular system we do not know. Boltzmann Machine find out those parameters via hidden nodes and generates their data. It provides a model showing which is normal or abnormal in the system. Once we trained the network with training data, it finds the correlation between the nodes, which helps us to get to know how a system works in real practical world.
  2. It gives us recommendation by which we can tell a new user/viewer will like to see which features.
  3. It can deal with that type of dataset where our data visualization tools (like Power BI, Tableau) can not proceed.

1.b) Structure : The boltzmann Machine consists of hidden nodes & Input nodes.

https://www.superdatascience.com/blogs/boltzmann-machine-boltzmann-machine

Like above picture the visible nodes are blue colored & hidden nodes are red colored. Every node is connected to every node via synapses/linkage. But the network does not discriminate between hidden nodes & visible nodes. It treats all nodes same. Every linkage carries some weights. There is no output layers because we are not giving any output value. RBM is bipartite which means there is no interlaying connections between input & hidden nodes.

1.c) Working principle of BM: We need to know the concepts of Energy Based Model first to understand Boltzmann Machine.

Energy Based Model (EBM) : The Boltzmann probability distribution function pi is defined as

pi=[​exp{−E(x)/T}/∑ₓ ​exp{−E(x)/T}] , where E(x) = the energy of the xth state in the system, T =free parameter (like Temperature)The equation describes the probability of finding a system in a state x when the system is in thermal equilibrium with a heat bath at temperature T. Lower the energy more probability of the state being in real. Z= ∑ₓ ​exp{−E(x)/T} is the partition function.

Like above picture System A has some molecules of gas in a high density region of a corner & in System B has same number of molecules but uniformly distributed throughout the system B. System B has the uniform density & as per Boltzmann distribution, System B is more stable & it’s probability of being real is more.

The Energy function E( v, h) is defined as

https://en.m.wikipedia.org/wiki/Energy_based_model

where E( v, h) is the energy of the state, vi = input of the state, hj = hidden state, ai, bj = the biases of vi & hj respectively, wi,j = the weight element of the matrix associated with vi & hj.

Now we will be assuming the energy of EBM is equivalent to weights of BM. Once the system is trained up, Restricted Boltzmann Machine always try to find out the lowest energy state possible. If we put the Energy Model equation of E(x) in p, we will get to know p is exponentially inversely proportional to E(x), which matches with our Boltzmann Machine concept where lower the energy higher the probability & higher the energy lower the probability.

1.d) Restricted Boltzmann Machine (RBM) : To avoid the overfitting we do not need the nodes connecting with nodes within same layer. Only visible nodes connected to hidden nodes, but no connection among input nodes within input layer & no connection among hidden nodes within hidden layer. It is used in Pattern recognition within multiple featured dataset & also in recommended system.

https://www.latentview.com/blog/restricted-boltzmann-machine-and-its-application/

Like above example of RBM, it is designed for recommended system. Here dataset having movies in features & viewers in rows are going under the training process. Now Genre A, Genre B, Actor X, Award Y, Director Z are the preferences given by the viewers. We will make a network which will give a recommendation to us that a new viewer having some parameters which movie he/she will prefer to see. To build this recommended system, we will train our network. To have a good accuracy the network must have to adjust the weights of synapses between the nodes repeatedly & this is exactly where RBM model helps us to this.

https://www.andreaperlato.com/aipost/boltzmann-machine/

At first step the input data will be fed into network. The features of the dataset are The matrix, Fight Club, Forrest Gump, Pulp Fiction, The Departed — these five movies. Viewers rated the movie & we have another dataset having the parameters like genre of the movie, oscar-winning or not & name of actor/director. When we will combine the dataset & fed into the input like above picture, one by one row the network will get the value of of input nodes. For the first movie ‘The Matrix’, the network will check the hidden nodes matches with ‘The Matrix’ or not. No matching values are found because ‘The Matrix’ is neither a drama, action nor it is a oscar winning movies & it is not acted by Dicaprio & Tarantino. The second movie ‘Fight club’ does not have any data. The third one Forrest Gump is a drama. ‘The Titanic’ is also a drama. Drama hidden node is learnt by Forest Gump & Titanic. Same way Dicaprio matches with Titanic & Oscar matches with ‘Forrest Gump’ & ‘Titanic’. So the matched hidden nodes are colored yellow & unmatched hidden nodes are colored red. So the network now knows which input nodes are activated for which hidden nodes. Then backward propagation happens. RBM will reconstruct inputs based on hidden nodes. During training if the reconstruction is incorrect, the weights are adjusted. Then again reconstruction happens, again if the reconstruction is incorrect, the weights are adjusted. This process is continued till we achieve maximum accuracy in the network.

During this reconstruction those vacant input nodes are filled with data by network, which it gives us the recommendation that new user will love to watch the movie or not. For example, the second movie ‘Fight Club’ will not be watched by a new viewer if it is a action movie. Because this movie has not any parameters which are liked by viewers for other movies as well. So the value of ‘0’ is updated in input node of ‘Fight club’. The last one ‘The Departed’ movie learns from those hidden nodes & it matches with drama, Dicaprio, Oscar. So a new viewer will love to watch this movie. Because this movie has some parameters which are liked by viewers for other movies as well. So the value of ‘1’ is updated in input node of ‘The Departed’.

1.e) Contrastive Divergence : This is algorithm that allows us RBM to learn & update weights. The gradient descent will not work here as it is directionless. The first step is called Gibbs Sampling. We have an input vector v0 & probability p(h0|v0) where h0 = hidden values. Then we will use p(v0|h0) to find the v1. If the no of iteration=k, our reconstructed input vector=vk,

To understand this, let’s take a simple example of 5input nodes & 5 hidden nodes. At first step a hidden node is created by all input nodes. This way all hidden nodes are created. Then all hidden nodes will now reconstruct input nodes one by one. So after updating value of input node, it is no longer as same as previous input node even after changing the weights. So this way the all input node changes. But again we should remember our each hidden nodes are also based on input nodes. So hidden nodes will also be changed. This process will be continued till the energy of the state is minimized just like we have learnt from ‘Energy Based Model’. The weights of RBM is defined by the energy of EBM. If we draw a curve between the energy of state versus the epoch we will be getting a curve like this : —

The more step we will go ahead in the contrastive divergence the less the energy of the state will be & less the probability of the state will be. Just like above picture E@3rd state < E@2nd state < E@1st State. We get the change of probability with respect to weights. Here <vi⁰hj⁰> is the initial state of the system & <vi∞hj∞> is the final state of the system. The process is called ‘gibbs sampling’. We can drag the energy curve by gibbs sampling, which technically means adjusting weights with reassembling our input values to reach the minimum energy state.

1.f) Deep Belief Networks (DBN) : The structure of DBN is equivalent to stacked RBM. This is a complex & advanced type of networks.

https://www.researchgate.net/figure/Stacked-RBM-abstraction-of-a-DBN-19_fig5_318277197

Multiple RBM are stacked over each other. There will be two types of training. If there are 4 layers, we will train the network with the input data in the input nodes. Then the first hidden layers are made up by these data & second and third hidden layers are also made up consecutively one by one. When the reconstruction starts at the top of the network, at first the second hidden layer will be reconstructed, then the first hidden layer will be reconstructed, because second hidden layer based on first hidden layer. Finally the input layers to be reconstructed to minimize the energy of the state as per energy based model.

1.g) Advantages & Disadvantages of RBM :

Advantages : 1.) Because of restriction in RBM, it works faster than original BM as there is no sense of input nodes communicating each other. 2.) The result of output of the hidden layer can be used by other machine learning model or neural network.

Disadvantages : 1. Adjustment of weights through constructive divergence is not as much efficient as gradient descent through backpropagation.

1.h) Practical Application :

  1. In hand written digit recognition, the RBM is used & if output of the hidden layers which will reconstruct the input layers are fed into the CNN, we can recognize the digits.
  2. Any type of Recommended system that a company wants to accumulate the customer/users/viewers etc.

Now we will implement the code & anyone can use the same syntax for the same type of problem of Restricted Boltzmann Machine.

2. Implementation of Code in Pytorch

We will be working with a dataset of movie ratings given by users.

These are dataset of Movies, Users, Ratings
# Importing the libraries
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.optim as optim
import torch.utils.data
from torch.autograd import Variable
# Importing the dataset
movies = pd.read_csv('ml-1m/movies.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')
users = pd.read_csv('ml-1m/users.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')
ratings = pd.read_csv('ml-1m/ratings.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')
# Preparing the training set and the test set
training_set = pd.read_csv('ml-100k/u1.base', delimiter = '\t')
training_set = np.array(training_set, dtype = 'int')
test_set = pd.read_csv('ml-100k/u1.test', delimiter = '\t')
test_set = np.array(test_set, dtype = 'int')
# Getting the number of users and movies
nb_users = int(max(max(training_set[:,0]), max(test_set[:,0])))
nb_movies = int(max(max(training_set[:,1]), max(test_set[:,1])))
# Converting the data into an array with users in lines and movies in columns
def convert(data):
new_data = []
for id_users in range(1, nb_users + 1):
id_movies = data[:,1][data[:,0] == id_users]
id_ratings = data[:,2][data[:,0] == id_users]
ratings = np.zeros(nb_movies)
ratings[id_movies - 1] = id_ratings
new_data.append(list(ratings))
return new_data
training_set = convert(training_set)
test_set = convert(test_set)
# Converting the data into Torch tensors
training_set = torch.FloatTensor(training_set)
test_set = torch.FloatTensor(test_set)
# Converting the ratings into binary ratings 1 (Liked) or 0 (Not Liked)
training_set[training_set == 0] = -1
training_set[training_set == 1] = 0
training_set[training_set == 2] = 0
training_set[training_set >= 3] = 1
test_set[test_set == 0] = -1
test_set[test_set == 1] = 0
test_set[test_set == 2] = 0
test_set[test_set >= 3] = 1
# Creating the architecture of the Neural Network
class RBM():
def __init__(self, nv, nh):
self.W = torch.randn(nh, nv)
self.a = torch.randn(1, nh)
self.b = torch.randn(1, nv)
def sample_h(self, x):
wx = torch.mm(x, self.W.t())
activation = wx + self.a.expand_as(wx)
p_h_given_v = torch.sigmoid(activation)
return p_h_given_v, torch.bernoulli(p_h_given_v)
def sample_v(self, y):
wy = torch.mm(y, self.W)
activation = wy + self.b.expand_as(wy)
p_v_given_h = torch.sigmoid(activation)
return p_v_given_h, torch.bernoulli(p_v_given_h)
def train(self, v0, vk, ph0, phk):
self.W += (torch.mm(v0.t(), ph0) - torch.mm(vk.t(), phk)).t()
self.b += torch.sum((v0 - vk), 0)
self.a += torch.sum((ph0 - phk), 0)
nv = len(training_set[0])
nh = 100
batch_size = 100
rbm = RBM(nv, nh)
# Training the RBM
nb_epoch = 10
for epoch in range(1, nb_epoch + 1):
train_loss = 0
s = 0.
for id_user in range(0, nb_users - batch_size, batch_size):
vk = training_set[id_user:id_user+batch_size]
v0 = training_set[id_user:id_user+batch_size]
ph0,_ = rbm.sample_h(v0)
for k in range(10):
_,hk = rbm.sample_h(vk)
_,vk = rbm.sample_v(hk)
vk[v0<0] = v0[v0<0]
phk,_ = rbm.sample_h(vk)
rbm.train(v0, vk, ph0, phk)
train_loss += torch.mean(torch.abs(v0[v0>=0] - vk[v0>=0]))
s += 1.
print('epoch: '+str(epoch)+' loss: '+str(train_loss/s))

# Testing the RBM
test_loss = 0
s = 0.
for id_user in range(nb_users):
v = training_set[id_user:id_user+1]
vt = test_set[id_user:id_user+1]
if len(vt[vt>=0]) > 0:
_,h = rbm.sample_h(v)
_,v = rbm.sample_v(h)
test_loss += torch.mean(torch.abs(vt[vt>=0] - v[vt>=0]))
s += 1.
print('test loss: '+str(test_loss/s))

--

--

Soumallya Bishayee

Data Scientist Enthusiastic & interested in world of ML,DL & AI