An Introduction to Boltzmann Machines with Memory: Dynamic Boltzmann Machines
Part 1: Energy Based Models
Before diving into Dynamic Boltzmann Machines, let’s start by jogging your memory a little… or maybe more than a little. Let’s time-travel back to your physics lectures from school. Can you remember your teacher telling you why gas is spread evenly across the room and not collected in one corner? Does the term Maxwell-Boltzmann Distribution ring a bell? Yes, that's where the fundamentals of Dynamic Boltzmann Machines originate. The connection might seem vague at the moment, but by the end of this article it should all make sense.
For now, remember that the Maxwell-Boltzmann equation, which forms the basis of the Kinetic Theory of Gases, defines the distribution of speed for a gas at a certain temperature. The key takeaway is:
“At room temperature, gases are more likely to be spread out evenly across a room as this configuration minimizes the total energy of the system they are a part of.”
In fact, this is the core of energy-based data models: Their goal is to find model configurations that reduce the energy of the system they represent.
Part 2: Boltzmann Machines
Keeping this thought in mind, let’s look at a Boltzmann Machine:
Some key features stand out immediately:
- Each node is connected to all other nodes.
- There is no single output in the machine, only hidden units and visible units, where the visible units represent the data. This model tries to understand the distribution of the data and recreate the data based on that distribution.
Even with only a few nodes, there are too many connections in a Boltzmann Machine to make efficient computations. To resolve this, researchers came up with Restricted Boltzmann Machines (RBMs) in which the visible and hidden nodes are independent among themselves. (See the architecture in Figure 2, for example).
Before we see why the Boltzmann Machine works, let’s see how it works. The following example explains the functioning of the Boltzmann Machine in a really crisp manner.
Use Case: Predict whether a user will like a song or not.
Let’s say we have trained an RBM on a music dataset, where rows are users, and columns are songs. Cell values are either 0 or 1 depending on whether the user disliked or liked the song, and blank if the user has not rated the song.
We want to use this machine to predict whether a user will like a song they haven’t yet rated.
Figure 2 is our sample data, indicating some songs this particular user liked or disliked. There are some songs we don’t have any information for. A training period provides the optimal weights for the machine.
The nodes that contain a user’s preference information (green nodes) activate some of the hidden (blue) nodes based on the connection weights, as can be observed in Figure 3. The hidden nodes represent some of the important features the Boltzmann Machine has extracted from the training data. The nodes have been labelled for our understanding; in reality, they can be anything.
Based on the activation of the hidden nodes, we recreate the training data, which gives us some values in place of the missing values. These are, in fact, the predictions the model has made for this particular user. Pretty neat, right?
Perhaps you might be scratching your head and wondering why this is working. Well, the Boltzmann Machine algorithm is based on three simple ideas:
- Associative Memory in Action (Hebb’s Rule): The weights for positively correlated inputs are increased. In other words, neurons that fire together, wire together.
- Weights are chosen to maximize the probability of observed data by minimizing the corresponding energy. (Yes, that’s the ideology we described in the beginning.)
- The network can reconstruct the input using symmetric weights for correlated items. In the music recommendation example, once we have the weights for our system, we can fill in the missing information using the user ratings we have based on the weights of the system.
Fascinating, isn’t it?
Imagine, what this technology could achieve if we were to add a time component – that is, if the data for a node included not only how it interacted with other nodes in the present, but also how it interacted with those nodes in the past. In fact, researchers within IBM Research in Tokyo proposed this process in 2015 in the journal Scientific Reports.
Let’s consider how it works.
Part 3: Dynamic Boltzmann Machines
Let’s say that the target sequence for this “Boltzmann Machine with Memory” is a 7 by 35 bitmap image of the word SCIENCE:
We want this machine to generate the entire training dataset on its own or to generate the entire training dataset based on a cue from the original data. For example, given a cue bitmap image containing “SCI”, it would generate “ENCE” on its own from the weights it has learned.
To get a picture of the system in our mind’s eye, consider Figure 5.
The machine contains seven nodes. A 7 by 35 bitmap image representing the word SCIENCE is the target sequence. One training period consists of showing the machine this target sequence once. The target is broken down into 35 strips of 7 values and fed into the machine in the same sequence as its appearance in the target. Figure 5 shows what five such input strips would look like. We see that the first strip of our bitmap, consisting of all 1s, is reflected in the first value of the inputs of the nodes. This is how we input values into the machine during training.
When we first initialize the machine with arbitrary weights and ask it to generate a sequence, it creates something completely random. After we train the machine for 130,000 training periods, it's able to generate the entire sequence on its own. This means that during training, it optimizes its weights to learn not just the co-occurrence of bits in one timestep, but across an entire sequence.
Magical, isn’t it? But there is solid logic behind this magic trick. Let’s unravel the mystery and see why this “Boltzmann Machine with Memory” works.
This is the structure of a Dynamic Boltzmann Machine (DyBM). In a Boltzmann Machine, a node contains information about which nodes activate it at a certain point in time. This makes it aware about events which are occurring together but does not give it the ability to look back and build associations across different timesteps. But in a DyBM the connections between the nodes represent how the nodes interact over time, not just at any particular timestep. They DyBM facilitates this by adding a conduction delay between the nodes.
With this new architecture, a node has information about which other nodes catalyzed its activation at some timestep T = t through their own activities in the past timesteps T = t – 1, T = t – 2, and so on. This “memory” is added to a node in the form of a memory unit. This unit alters the probability that a node is activated at any moment, depending on the previous values of other nodes and its own associated weights.
Let’s consider we have 2 nodes A and B. On a high level, we’re extending the notion that “neurons that fire together, wire together” across the dimension of time. For example, imagine that the activation of A consistently leads to activation of B after two timesteps. A non-dynamic Boltzmann Machine dosn’t capture this pattern, but with a DyBM, since the value of A travels to B after some delay, I can capture the pattern that B = 1 sometime after A = 1. Now the probability that B = 1 at timestep T = t will vary based not only on the value of A at T = t, but also on the values of A at T = t – 1, T = t – 2, and so on, depending on the amount of conduction delay between A and B.
The machine stores the values so that recent values are given higher weight, which makes sense since in general the most recent parts of a time series are the most informative about the latest trend. A DyBM stores this information in the eligibility traces. The Synaptic Eligibility Trace of B contains the weighted sum of the values that have reached B from A after some conduction delay. The Neural Eligibility Trace of B contains the weighted sum of its past values.
Similar to standard recurrent neural networks, we can unfold the DyBM through time. The unfolded DyBM is a Boltzmann Machine having an infinite number of units, each representing the value of a node at a particular time.
So, this is the Dynamic Boltzmann Machine: an architecture that has the power to recreate the training data not just at one point in time, but across a sequence of that data.
DyBMs are fascinating, and the part that follows drives the point home.
Part 4: Faceoff Between RNN-Gaussian-DyBM and LSTM
All the examples we’ve seen till now have dealt with binary data (Bernoulli Distribution). IBM researchers went one step further and created a DyBM that could model Gaussian distributions and made it possible for users like us to model time-series data using DyBM and its variations.
To check the efficiency of a DyBM, I ran some tests comparing an RNN-Gaussian-DyBM (a DyBM with RNN layer) and the current state-of-the-art, Long Short-Term Memory Neural Network. The results were exciting. Feel free to run these tests on your own, based on the script available here.
Let’s see how a DyBM compares to an LSTM on a time-series use case.
Use Case: Predict the value of the next sunspot number.
We’ll use data containing the Monthly Sunspot Number calculated in a lab in Zurich from the year 1749 to 1983. The data is open source and available from Datamarket - Monthly sunspot number, Zurich, 1749-1983.
First, let’s see how the LSTM performed.
- Architecture: LSTM Dimension = 10.
- Performance over 10 epochs: Mean Test Score LSTM = 0.08877 RMSE
- Per epoch time to learn: 8.689403 sec.
Now we’ll see how an RNN-Gaussian-DyBM performed on the same data.
Brace yourself, this is going to be a very interesting journey of discovery… Ready?
- Architecture: RNN Dimension = 10 and Input Dimension = 1
- Performance over 10 epochs: Mean Test Score LSTM = 0.07848 RMSE
- Per epoch time to learn: 0.90547 sec.
Not only does the RNN-Gaussian-DyBM run 10 times faster in this case, it also offers better performance.
As we scale the number of epochs for the two models, the time difference between the training of these two models increases drastically. The DyBM response time is much faster, so the train – test – deploy cycle shrinks and you can improve models much faster.
But we haven’t yet discussed the best part about DyBMs: You can speed them up with GPU acceleration. A DyBM running on a GPU on IBM Watson Studio Local with Power AI IBM Cloud Service can make predictions for over 2000 time series, each of length more than 500, in less than 10 seconds per epoch. By contrast, a DyBM running on CPU will take slightly over 30 minutes to do the same task. Compare this result with the performance of an LSTM on CPU from the previous example. Just imagine the computational power this provides. See here for more information about accelerating DyBMs with GPUs. It’s worth noting that you can also accelerate LSTMs with GPUs, and the performance comparison between an accelerated DyBM and an accelerated LSTM will vary.
Next time you want to solve a time-series problem, give Dynamic Boltzmann Machines a try. Consider starting with the IBM Research Tokyo GitHub repository for Dynamic Boltzmann Machines, which you can find here.
See below for additional in-depth research about Energy-Based Models, Boltzmann Machines and Dynamic Boltzmann Machines:
- A Tutorial on Energy-Based Learning – Yann LeCun, Sumit Chopra, Raia Hadsell, Marc’Aurelio Ranzato, and Fu Jie Huang
- Boltzmann Machines – Geoffrey Hinton
- Boltzmann Machines and Energy-Based Models – Takayuki Osogami (IBM Research - Tokyo)
- Seven neurons memorizing sequences of alphabetical images via spike-timing dependent plasticity – Takayuki Osogami and Makoto Otsuka
- Nonlinear Dynamic Boltzmann Machines for Time-Series Prediction – Sakyasingha Dasgupta and Takayuki Osogami (IBM Research – Tokyo)