# Introducing Paysage

*Paysage** is a new **PyTorch**-powered python library for machine learning with Restricted Boltzmann Machines. We built Paysage from scratch at **Unlearn.AI** in order to bring the power of GPU acceleration, recent developments in machine learning, and our own new ideas to bear on the training of this model class.*

*We are excited to release this toolkit to the community as an open-source software library. This first release is version 0.2.0. We will be pushing regular updates from here on out, with many new features in the pipeline. **😉*

*The next release (0.3.0) is scheduled for July 10, 2018.*

**Contents:**

**Paysage is a powerful library for training RBMs — and more-generally, energy-based neural network models.**

Paysage allows one to construct models

- composed of several different kinds of layer types,
- of any depth.

Paysage implements a number of training methods such as,

- layerwise training for deep models,
- (persistent) contrastive divergence with MCMC,
- TAP-based, sampling-free extended mean field methods.

Paysage is built on top of two backends: numpy and PyTorch. The latter unlocks the power of the GPU for greatly optimized training and sampling.

We at Unlearn are excited to release this project and let the community take it to new heights.

**What is an energy-based model?**

Energy-based models (EBMs) are machine learning models which arise from the specification of an energy function** E(X; θ)**. This function yields a scalar value for each configuration of some specified variables

**X**, and depends on some

*model parameters*,

*θ**,*which can be adjusted. Therefore the energy function forms a parameterized family of surfaces over the variables

**X**. Training such a model involves searching through the parameters to find the energy surface which closely matches some desired configuration. A probabilistic perspective on such models is available by virtue of the Boltzmann distribution,

So an EBM also represents a probability distribution over **X** for each choice of ** θ**. From this perspective, training can then be thought of as fitting the probability distribution

**to some desired distribution — perhaps one which is measured by some given dataset.**

*p*(X;*θ*)**Visible and hidden variables**

It is often the case that only some subset of the configuration variables **X** are observed in some training set. In this case the observed variables are called *visible *and the rest *hidden* or *latent* variables of the model. So **X** splits into **X = (V, H)**. In this case the family of probability distributions defined by the model which are relevant to training are the marginals over the hidden variables,

**EBMs are natural tools in unsupervised learning**

Suppose one observes a bunch of samples of variables** V**. These samples can be thought of as random draws from some hypothetical *data distribution* ** p_d(V)**. An EBM can be trained to approximate the hypothetical data distribution by fitting

**to**

*p*(V;*θ*)**). The resulting model allows one to,**

*p*_d(V- draw new representative samples from the model distribution, thus simulating the processes which generate the training data,
- answer counter-factual questions about the data variables,
- make discriminative predictions between training data variables by conditioning the model distribution,
- impute missing data from the model distribution.

Furthermore, the fact that an EBM describes an explicit, parametrized probability distribution is a benefit compared to some of the other kinds of generative models.

**Energy-based neural networks**

An energy-based neural network is an EBM in which the energy function **E(v,h; θ)** arises from the architecture of an artificial neural network. It is probably best to understand this by way of example. Consider a two-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GBRBM). Such a model has, say,

**n**

*visible*real-valued units, and say

**m**

*hidden*binary-valued units. So a configuration of visible and hidden variables constitutes a pair of vectors

**(v,h)**with an

**n**-dimensional real vector

**v**, and an m-dimensional binary vector

**h**. The visible and hidden units are separated into two

*layers*, the visible called a

*Gaussian layer*and the hidden a

*Bernoulli layer*.

The energy function for this model takes the form,

This energy function has a contribution from each layer (a quadratic potential from the Gaussian layer and a linear potential from the Bernoulli layer) along with a quadratic term which connects the layers mediated by the *weight matrix* **W**.

From this one specification we can in principle write down the model probability distribution function ** p(v; m,s,b,W)**. The most popular training scheme is to maximize the log-likelihood of the data given the model. That is, we want to compute,

The goal of this optimization problem is to find the parameters *θ** *which maximize the likelihood of having observed the training data from the model distribution. There is more details on training further down in this blog post.

**Restricted Boltzmann Machines**

Restricted Boltzmann Machines (RBMs) are an important special class of energy-based neural networks. By definition an RBM has at least one hidden layer and carries the systematic restriction that the weight matrices only connect units in adjacent layers. This restriction allows a means of sampling the model distribution (block Gibbs sampling) which is vastly more efficient than the alternatives in the unrestricted case. Paysage was primarily designed to facilitate training of RBMs.

What follows is some more in-depth commentary about a couple of the features of Paysage and an annotated example of training a 2-layer Gaussian-Bernoulli Boltzmann Machine on MNIST.

**Generalities on model construction in Paysage**

Paysage allows you to construct any model whose energy function takes the form:

in which **H_i()** is the layer **i**’s *energy* function, **r_i()** it’s *rescale* function, **W_i** is the weight matrix connecting layers **i** and **i-1**. (Here we have conveniently set** h_0 = v** to compactify the formula)

In Paysage training and sampling such models relies on block Gibbs sampling; so the layer energy functions must lead to conditional probabilities **p(v|rest)**, **p(h_i|rest)** that are relatively efficient to compute.

Paysage provides three built-in layer types in the layers module, GaussianLayer, BernoulliLayer, and OneHotLayer. Building a model is as simple as stacking such layers together:

from paysage.models.dbm import BoltzmannMachine

from paysage.layers import GaussianLayer, BernoulliLayer, OneHotLayer

# Instantiate a deep model with four layers:

dbm = BoltzmannMachine([BernoulliLayer(300), GaussianLayer(100), BernoulliLayer(30), OneHotLayer(10)])

**How training works:**

Maximizing the log likelihood over the parameters can be attempted via stochastic gradient descent. The expectation of the gradient of the log likelihood is,

which involves evaluating expectations over the model distribution. The primary difficulty in training these kinds of EBMs is the ‘intractability’ of the probability distributions **p(v,h; θ)**. In particular, the denominator,

**, in the definition is often impossible to compute analytically. Fortunately there are a number of numerical schemes for apporoximating expectations over distributions of this exponential type.**

*Z*Paysage implements Markov-chain Monte Carlo via block-Gibbs sampling to evaluate the gradient above (see [1] for the classic presentation of ‘Boltzmann learning’). Alternately, Paysage implements a sampling-free training scheme for RBMs arising from extended mean-field approximations to the free energy of the model (adapted from [2]).

[1] Ackley, David H; Hinton Geoffrey E; Sejnowski, Terrence J (1985), “A learning algorithm for Boltzmann machines”, *Cognitive science*, Elsevier, **9** (1): 147–169

[2] Tramel, Gabrie, Manoel, Caltagirone, Krzakala

A Deterministic and Generalized Framework for Unsupervised Learning with Restricted Boltzmann Machines.

**An annotated example**

Let’s now look in detail at the code required to build and train a basic RBM on some real dataset. We can take the binary-valued MNIST example as a reference. A utility to download this dataset is contained in paysage’s mnist example folder. The real-valued data consists of 28x28 images of digits with pixel values in **{0,1}**.

Since we are going to train with minibatches we need to instantiate a data batcher. The ‘in_memory_batch’ object puts the whole dataset in memory. While using the GPU backend, this means all the data will be loaded into VRAM. So you can only use this for relatively small datasets like MNIST. There are 6000 samples in the training set. We will use a batch size of 100 in this example.

import pandas

from paysage import backends as be

from paysage import batch

from paysage import preprocess as pre

# We have to convert the data to {0,1}:

transform = pre.Transformation(pre.binarize_color)

# fraction in train/validate split

train_fraction=.95

batch_size = 100

samples = be.float_tensor(pandas.read_hdf('path_to_mnist_data'.h5, key='train/images').as_matrix())

data = batch.in_memory_batch(samples, batch_size, train_fraction, transform=transform)

In Paysage a model is specified by a sequence of layer types. Each layer is a standalone object in the ‘layers’ module. The RBM model above is instantiated as a BoltzmannMachine object like so:

from paysage.models.dbm import BoltzmannMachine

from paysage.layers import BernoulliLayer

num_hidden_units=200

vis_layer = BernoulliLayer(data.ncols, center=False)

hid_layer = BernoulliLayer(num_hidden_units, center=True)

# fix gauge by setting hidden params constant

hid_layer.set_fixed_params(hid_layer.get_param_names())

rbm = BoltzmannMachine([vis_layer, hid_layer])

The next step is to initialize the model.

rbm.initialize(data, 'pca', epochs = 20, verbose=False)

Here we select PCA initialization. This scheme initializes the weight columns to the PCA components of the data — iteratively so as to work on arbitrarily large datasets.

Now we select the fitting method. We will train the model on MNIST with the most standard optimization algorithm, stochastic gradient descent.

from paysage import fit

sgd = fit.SGD(rbm, data, fantasy_steps=10)

During training there are a number of generative metrics which are evaluated on the validation data after each epoch. We add an extra that is not part of the default set. The ‘PerformanceMonitor’ is a member of sgd:

from paysage import metrics as M

sgd.monitor.generator_metrics.append(M.JensenShannonDivergence())

Next we select a learning rate schedule and an optimizer.

from paysage import schedules

from paysage import optimizers

learning_rate = schedules.PowerLawDecay(initial=1e-3, coefficient=5)

opt = optimizers.ADAM(stepsize=learning_rate)

ADAM is the default. The PowerLawDecay is a decaying learning rate schedule. Finally we train with the most standard version of Boltzmann-learning, persistent contrastive divergence, PCD.

num_epochs = 200

mc_steps = 10

beta_std = 0.95

sgd.train(opt, num_epochs, method=fit.pcd, mcsteps=mc_steps, beta_std=beta_std)

Here we have selected PCD with 10 iterations of Markov-chain Monte Carlo. The model is sampled with a *driven sampler, *an advanced sampling method which accelerates training speed.

Once training completes we’ll want to see the metrics progression and fantasy particles.

from . import mnist_util as util

util.show_metrics(rbm, cd.monitor)

valid = data.get('validate')

util.show_reconstructions(rbm, valid, show_plot, n_recon=10, vertical=False)

util.show_fantasy_particles(rbm, valid, show_plot, n_fantasy=5)

util.show_weights(rbm, show_plot, n_weights=100)

mnist_util is an MNIST-specific utility which provides plotting functions to evaluate the performance of models. Here we show the model metrics, some reconstructions from the model, some random samples drawn from the model distribution, and then the columns of the learned weight matrix presented as images.

Here are some fantasy particles:

And some weights chosen at random plotted as images:

There you have it! Hopefully this little introduction-tutorial will get you started experimenting with Paysage yourself.

There are more examples of models trained with Paysage in the recent ML survey, A High-bias, Low-variance Introduction to Machine Learning for Physicists.

We hope that this open-source repository will help the community engage with the power of RBMs. Contributions welcome!