Restricted Boltzmann Machine (RBM) with Practical Implementation

Amir Ali
The Art of Data Scicne
20 min readMay 26, 2019

In this Chapter of Deep Learning book, we will discuss the Boltzmann Machine. It is an Unsupervised Deep Learning technique and we will discuss both theoretical and Practical Implementation from Scratch.

This chapter spans 6 parts:

  1. What is the Boltzmann Machine?
  2. An Energy-Based Model.
  3. Restricted Boltzmann Machine.
  4. How do Restricted Boltzmann Machine Work?
  5. Contrastive Divergence.
  6. Practical Implementation of Restricted Boltzmann Machine.

1. What is the Boltzmann Machine?

Boltzmann Machine was first invented in 1985 by Geoffrey Hinton, a professor at the University of Toronto. He is a leading figure in the deep learning community and is referred to by some as the “Godfather of Deep Learning”.

· Boltzmann Machine is a generative unsupervised model, which involves learning a probability distribution from an original dataset and using it to make inferences about never before seen data.

· Boltzmann Machine has an input layer (also referred to as the visible layer) and one or several hidden layers (also referred to as the hidden layer).

· Boltzmann Machine uses neural networks with neurons that are connected not only to other neurons in other layers but also to neurons within the same layer.

· Everything is connected to everything. Connections are bidirectional, visible neurons connected to each other and hidden neurons also connected to each other

· Boltzmann Machine doesn’t expect input data, it generates data. Neurons generate information regardless they are hidden or visible.

· For Boltzmann Machine all neurons are the same, it doesn’t discriminate between hidden and visible neurons. For Boltzmann Machine whole things are system and its generating state of the system.

The best way to think about it is through an example nuclear power plant.

· Suppose for example we have a nuclear power station and there are certain things we can measure in a nuclear power plant like the temperature of the containment building, how quickly the turbine is spinning, the pressure inside the pump, etc.

· There are lots of things we are not measuring like the speed of the wind, the moisture of the soil in this specific location, its sunny day or rainy day, etc.

· All these parameters together form a system, they all work together. All these parameters are binary. So we get a whole bunch of binary numbers that tell us something about the state of the power station.

· What we would like to do, is we want to notice that when it is going to in an unusual state. A state that is not like a normal state which we had seen before. And we don’t want to use supervised learning for that. Because we don’t want to have any examples of states that cause it to blow up.

· We would rather be able to detect that when it is going into such a state without even having seen such a state before. And we could do that by building a model of a normal state and noticing that this state is different from the normal states.

· That what Boltzmann Machine represents.

· The way this system work, we use our training data and feed into the Boltzmann Machine as input to help the system adjust its weights. It resembles our system not any nuclear power station in the world.

· It learns from the input, what are the possible connections between all these parameters, how do they influence each other and therefore it becomes a machine that represents our system.

· We can use this Boltzmann Machine to monitor our system

· Boltzmann Machine learns how the system works in its normal states through a good example.

Boltzmann Machine consists of a neural network with an input layer and one or several hidden layers. The neurons in the neural network make stochastic decisions about whether to turn on or off based on the data we feed during training and the cost function the Boltzmann Machine is trying to minimize.

By doing so, the Boltzmann Machine discovers interesting features about the data, which help model the complex underlying relationships and patterns present in the data.

This Boltzmann Machine uses neural networks with neurons that are connected not only to other neurons in other layers but also to neurons within the same layer. That makes training an unrestricted Boltzmann machine very inefficient and Boltzmann Machine had very little commercial success.

Boltzmann Machines are primarily divided into two categories: Energy-based Models (EBMs) and Restricted Boltzmann Machines (RBM). When these RBMs are stacked on top of each other, they are known as Deep Belief Networks (DBN).

2. An Energy-Based Model.

Energy is a term that may not be associated with deep learning in the first place. Rather is energy a quantitative property of physics. E.g. gravitational energy describes the potential energy a body with mass has in relation to another massive objected due to gravity. Yet some deep learning architectures use the idea of energy as a metric for the measurement of the model’s quality.

One purpose of deep learning models is to encode dependencies between variables. The capturing of dependencies happens through associating of scalar energy to each configuration of the variables, which serves as a measure of compatibility. High energy means bad compatibility. An energy-based model tries always to minimize a predefined energy function. The energy function for the RBMs is defined as:

As can be noticed the value of the energy function depends on the configurations of visible/input states, hidden states, weights, and biases. The training of RBM consists of the finding of parameters for given input values so that the energy reaches a minimum.

3. Restricted Boltzmann Machine.

What makes RBMs different from Boltzmann machines is that visible node isn’t connected to each other, and hidden nodes aren’t connected with each other. Other than that, RBMs are exactly the same as Boltzmann machines.

As you can see below:

· RBM is the neural network that belongs to the energy-based model

· It is a probabilistic, unsupervised, generative deep machine learning algorithm.

· RBM’s objective is to find the joint probability distribution that maximizes the log-likelihood function.

· RBM is undirected and has only two layers, Input layer, and hidden layer

· All visible nodes are connected to all the hidden nodes. RBM has two layers, visible layer or input layer and hidden layer so it is also called an asymmetrical bipartite graph.

· No intralayer connection exists between the visible nodes. There is also no intralayer connection between the hidden nodes. There are connections only between input and hidden nodes.

· The original Boltzmann machine had connections between all the nodes. Since RBM restricts the intralayer connection, it is called a Restricted Boltzmann Machine.

Since RBMs are undirected, they don’t adjust their weights through gradient descent and backpropagation. They adjust their weights through a process called contrastive divergence. At the start of this process, weights for the visible nodes are randomly generated and used to generate the hidden nodes. These hidden nodes then use the same weights to reconstruct visible nodes. The weights used to reconstruct the visible nodes are the same throughout. However, the generated nodes are not the same because they aren’t connected to each other.

4. How do Restricted Boltzmann Machine Work?

In an RBM, we have a symmetric bipartite graph where no two units within the same group are connected. Multiple RBMs can also be stacked and can be fine-tuned through the process of gradient descent and back-propagation. Such a network is called a Deep Belief Network. Although RBMs are occasionally used, most people in the deep-learning community have started replacing their use with General Adversarial Networks or Variational Autoencoders.

RBM is a Stochastic Neural Network which means that each neuron will have some random behavior when activated. There are two other layers of bias units (hidden bias and visible bias) in an RBM. This is what makes RBMs different from autoencoders. The hidden bias RBM produces the activation on the forward pass and the visible bias helps RBM to reconstruct the input during a backward pass. The reconstructed input is always different from the actual input as there are no connections among the visible units and therefore, no way of transferring information among themselves.

The above image shows the first step in training an RBM with multiple inputs. The inputs are multiplied by the weights and then added to the bias. The result is then passed through a sigmoid activation function and the output determines if the hidden state gets activated or not. Weights will be a matrix with the number of input nodes as the number of rows and the number of hidden nodes as the number of columns. The first hidden node will receive the vector multiplication of the inputs multiplied by the first column of weights before the corresponding bias term is added to it.

And if you are wondering what a sigmoid function is, here is the formula:

So the equation that we get in this step would be,

where h(1) and v(0) are the corresponding vectors (column matrices) for the hidden and the visible layers with the superscript as the iteration v(0) means the input that we provide to the network) and a is the hidden layer bias vector.

(Note that we are dealing with vectors and matrices here and not one-dimensional values.)

Now this image shows the reverse phase or the reconstruction phase. It is similar to the first pass but in the opposite direction. The equation comes out to be:

where v(1) and h(1) are the corresponding vectors (column matrices) for the visible and the hidden layers with the superscript as the iteration and b is the visible layer bias vector.

4.1. The learning process

Now, the difference v(0)−v(1) can be considered as the reconstruction error that we need to reduce in subsequent steps of the training process. So the weights are adjusted in each iteration so as to minimize this error and this is what the learning process essentially is. Now, let us try to understand this process in mathematical terms without going too deep into mathematics. In the forward pass, we are calculating the probability of output h(1) given the input v(0) and the weights W denoted by:

And in the backward pass, while reconstructing the input, we are calculating the probability of output v(1) given the input h(1) and the weights W denoted by:

The weights used in both the forward and the backward pass are the same. Together, these two conditional probabilities lead us to the joint distribution of inputs and the activations:

Reconstruction is different from regression or classification in that it estimates the probability distribution of the original input instead of associating a continuous/discrete value to an input example. This means it is trying to guess multiple values at the same time. This is known as generative learning as opposed to discriminative learning that happens in a classification problem (mapping input to labels).

Let us try to see how the algorithm reduces loss or simply put, how it reduces the error at each step. Assume that we have two normal distributions, one from the input data (denoted by p(x)) and one from the reconstructed input approximation (denoted by q(x)). The difference between these two distributions is our error in the graphical sense and our goal is to minimize it, i.e., bring the graphs as close as possible. This idea is represented by a term called the Kullback–Leibler divergence.

KL-divergence measures the non-overlapping areas under the two graphs and the RBM’s optimization algorithm tries to minimize this difference by changing the weights so that the reconstruction closely resembles the input. The graphs on the right-hand side show the integration of the difference in the areas of the curves on the left.

This gives us intuition about our error term. Now, to see how actually this is done for RBMs, we will have to dive into how the loss is being computed. All common training algorithms for RBMs approximate the log-likelihood gradient given some data and perform gradient ascent on these approximations.

5: Contrastive Divergence.

Boltzmann Machines (and RBMs) are Energy-based models and a joint configuration, (v,h) of the visible and hidden units has energy given by:

Where vi, hj are the binary states of visible unit ii and hidden unit j, ai, bj are their biases and wij is the weight between them.

The probability that the network assigns to a visible vector, v, is given by summing over all possible hidden vectors:

Z here is the partition function and is given by summing over all possible pairs of visible and hidden vectors:

This gives us:

The log-likelihood gradient or the derivative of the log probability of a training vector with respect to weight is surprisingly simple:

Where the angle brackets are used to denote expectations under the distribution specified by the subscript that follows. This leads to a very simple learning rule for performing stochastic steepest ascent in the log probability of the training data:

where α is a learning rate. For more information on what the above equations mean or how they are derived, refer to the Guide on training RBM by Geoffrey Hinton. The important thing to note here is that because there are no direct connections between hidden units in an RBM, it is very easy to get an unbiased sample of ⟨vihj⟩data. Getting an unbiased sample of ⟨vihj⟩model.

However, is much more difficult. This is because it would require us to run a Markov chain until the stationary distribution is reached (which means the energy of the distribution is minimized — equilibrium!) to approximate the second term.

So instead of doing that, we perform Gibbs Sampling from the distribution. It is a Markov Chain Monte Carlo (MC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution when direct sampling is difficult (like in our case). The Gibbs chain is initialized with a training example v(0) of the training set and yields the sample v(k) after k steps.

Each step t consists of sampling h(t) from p(h∣v(t)) and sampling v(t+1) from p(v∣h(t)) subsequently (the value k=1 surprisingly works quite well).

The learning rule now becomes:

The learning works well even though it is only crudely approximating the gradient of the log probability of the training data. The learning rule is much more closely approximating the gradient of another objective function called the Contrastive Divergence which is the difference between two Kullback-Liebler divergences.

When we apply this, we get:

Where the second term is obtained after each k steps of Gibbs Sampling.

Here is the pseudo-code for the CD algorithm:

Example: Recommender System of Movies

Let us assume that some people were asked to rate a set of movies on a scale of 1–5 and each movie could be explained in terms of a set of latent factors such as drama, fantasy, action and many more. Restricted Boltzmann Machines are used to analyze and find out these underlying factors.

The analysis of hidden factors is performed in a binary way, i.e, the user only tells if they liked (rating 1) a specific movie or not (rating 0) and it represents the inputs for the input/visible layer. Given the inputs, the RMB then tries to discover latent factors in the data that can explain the movie choices and each hidden neuron represents one of the latent factors.

Let us consider the following example where a user likes Lord of the Rings and Harry Potter but does not like The Matrix, Fight Club and Titanic. The Hobbit has not been seen yet so it gets a -1 rating. Given these inputs, the Boltzmann Machine may identify three hidden factors Drama, Fantasy and Science Fiction which correspond to the movie genres.

Using Latent Factors for Prediction

After the training phase, the goal is to predict a binary rating for the movies that had not been seen yet. Given the training data of a specific user, the network is able to identify the latent factors based on the user’s preference and sample from Bernoulli distribution can be used to find out which of the visible neurons now become active.

The image shows the new ratings after using the hidden neuron values for the inference. The network identified Fantasy as the preferred movie genre and rated The Hobbit as a movie the user would like.

The process from training to the prediction phase goes as follows:

  • Train the network on the data of all users
  • During inference-time, take the training data of a specific user
  • Use this data to obtain the activations of hidden neurons
  • Use the hidden neuron values to get the activations of input neurons
  • The new values of input neurons show the rating the user would give yet unseen movies

Note: If you want this article check out my academia.edu profile.

6.Practical Implementation of Restricted Boltzmann Machine.

Recommender System

From Amazon product suggestions to Netflix movie recommendations — good recommender systems are very valuable in today’s World. And specialists who can create them are some of the top-paid Data Scientists on the planet.

We will work on a dataset that has exactly the same features as the Netflix dataset: plenty of movies, thousands of users, who have rated the movies they watched. The ratings go from 1 to 5, exactly like in the Netflix dataset, which makes the Recommender System more complex to build than if the ratings were simply “Liked” or “Not Liked”.

Your final Recommender System will be able to predict the ratings of the movies the customers didn’t watch. Accordingly, by ranking the predictions from 5 down to 1, your Deep Learning model will be able to recommend which movies each user should watch.

Our model will be Deep Belief Networks, complex Boltzmann Machines. And you will even be able to apply it to yourself or your friends. The list of movies will be explicit so you will simply need to rate the movies you already watched, input your ratings in the dataset, execute your model and voila! The Recommender System will tell you exactly which movies you would love one night you if are out of ideas of what to watch on Netflix!

Let’s solve the problem

Part 1: Data Preprocessing

In this part, we are doing Data Preprocessing.

1.1 Import the Libraries

In this step, we import three Libraries in Data Preprocessing part. Basically, Library is a tool that you can use to make a specific job. First of all, we import the numpy library used for a multidimensional array then import the pandas library used to import the dataset. Then import torch the Pytorch library and import several packages of that. torch.nn as nn for initializing the neural network. torch.nn.parallel for parallel computations. torch.optim as optim for the optimizer. torch.utils.data for data loading and processing. autograd for implementing automatic differentiation

1.2 Import the dataset

In the next step, we import the users, ratings, and movie dataset. In our case, our dataset is separated by double colons. The dataset does not have any headers so we shall pass the headers as none. We then set the engine to Python to ensure the dataset is correctly imported.

We then use the Latin-1 encoding type since some of the movies have special characters in their titles. The first column of the rating dataset is the user ID, the second column is the movie ID, the third column is the rating and the fourth column is the timestamp.

1.3 Preparing the training set and test set

Let’s now prepare our training set and test set. Our test and training sets are tab-separated; therefore we’ll pass in the delimiter argument as \t. As we know very well, pandas import the data as a data frame. However, we need to convert it to an array so we can use it in PyTorch tensors. We do that using the np.array command from Numpy. We also specify that our array should be integers since we’re dealing with integer data types.

1.4 Getting the Number of Users and Movies

In order to build the RBM, we need a matrix with the users’ ratings. This matrix will have the users as the rows and the movies as the columns. The matrix will contain a user’s rating of a specific movie. Zeros will represent observations where a user didn’t rate a specific movie.

In order to create this matrix, we need to obtain the number of movies and the number of users in our dataset. For no_users we pass in zero since it’s the index of the user ID column. The way we obtain the number of users is by getting the max in the training and test set, and then using the max utility to get the maximum of the two. We then force the obtained number to be an integer by wrapping the entire function inside an int.

1.5 Converting the data into an array with users in lines and movies in columns

Next, we create a function that will create the matrix. The reason for doing this is to set up the dataset in a way that the RBM expects as input. We create a function called, which takes in our data as input and converts it into the matrix.

First, we create an empty list called new_data. We then create a loop that will go through the dataset, fetch all the movies rated by a specific user, and the ratings by that same user. Notice that we loop up to no_users + 1 include the last user ID since the range function doesn’t include the upper bound.

Since there are movies that the user didn’t rate, we first create a matrix of zeros. We then update the zeros with the user’s ratings. When appending the movie ratings, we use id_movies — 1 it because indices in Python start from zero. We, therefore, subtract one to ensure that the first index in Python is included. We append the ratings new_data as a list. This will create a list of lists. Now let’s use our function and convert our training and test data into a matrix.

1.6 Converting the data into Torch tensors

Since we’re using PyTorch, we need to convert the data into Torch tensors. The way we do this is by using the FloatTensor utility. This will convert the dataset into PyTorch arrays.

1.7 Converting the Rating into Binary rating 1 (Liked) or 0 (Not Liked)

Next, we convert these ratings into binary ratings since we want to make a binary classification. We, therefore, convert the ratings to zeros and ones. Remember that we already have zero ratings in the dataset representing where a user didn’t rate the movie. We replace that with -1 to represent movies that a user never rated. We then convert the ratings that were rated 1 and 2 to 0 and movies that were rated 3, 4 and, 5 to 1. We do this for both the test set and the training set.

Part 2: Building our Model

In this Second Part, we will Build our Restricted Boltzmann Machine (RBM).

2.1 Creating the RBM Architecture

Now we need to create a class to define the architecture of the RBM. Inside the init the function we specify two parameters; the first variable is the number of visible nodes nv, and the second parameter is the number of hidden nodes nh.

Next, we initialize the weight and bias. We do this randomly using a normal distribution and using randn from the torch. The weight is of size nh and nv. We then define two types of biases. a is the probability of the hidden nodes given the visible nodes, and b is the probability of the visible nodes given the hidden nodes. In declaring them we input 1 as the first parameter, which represents the batch size.

The next step is to create a function sample_h which will sample the hidden nodes. It takes x as an argument, which represents the visible neurons.

Next, we compute the probability of h given v where h and v represent the hidden and visible nodes respectively. This represents the sigmoid activation function and is computed as the product of the vector of the weights and x plus the bias a. The product is done using the mm utility from Torch. Since we’re doing binary classification, we also return Bernoulli samples of the hidden neurons.

Next, we create a function sample_v that will sample the visible nodes. The function is similar to the sample_h function.

The next function we create is the training function. It takes the following parameter; the input vector containing the movie ratings, the visible nodes obtained after k samplings, the vector of probabilities, and the probabilities of the hidden nodes after k samplings.

Now we set the number of visible nodes to the length of the training set and the number of hidden nodes to 100. The number of visible nodes corresponds to the number of features in our training set. The number of hidden nodes determines the number of features that we’d like our RBM to detect. We also set a batch size of 100 and then call the class RBM.

Part 3: Training the RBM Model

The first step in training the RBM is to define the number of epochs. We then define a loop where all the training set will go through. After each epoch, the weight will be adjusted in order to improve the predictions. Finally, we obtain the visible nodes with the ratings of the movies that were not rated by the users.

Part 4: Testing the RBM Model

Next, we test our RBM. In this stage, we use the training set data to activate the hidden neurons in order to obtain the output. This is how we get the predicted output of the test set. We then use the absolute mean to compute the test loss.

If you want dataset and code you also check my Github Profile.

End Notes

If you liked this article, be sure to click ❤ below to recommend it and if you have any questions, leave a comment and I will do my best to answer.

For being more aware of the world of machine learning, follow me. It’s the best way to find out when I write more articles like this.

You can also follow me on Github for code & dataset follow on Aacademia.edu for this article, Twitter and Email me directly or find me on LinkedIn. I’d love to hear from you.

That’s all folks, Have a nice day :)

--

--