Python/MXNet Tutorial #1: Restricted Boltzmann Machines using NDArray.
This post series introduces you to the main concepts of using deep learning library MXNet in Python. We will focus on the implementation of simple machine learning models and will give a brief overview of the models when required.
MXNet is an open-source machine learning library with key contributors from a variety of organisations such as Amazon, Microsoft, Stanford University, Carnegie Mellon University and the University of Washington. It is especially well suited for the development of deep neural networks and comes with support for state-of-the-art models such as convolutional neural networks (CNNs), and long short-term memory networks (LSTMs).
There are three major factors for Engineers and Scientists when selecting a deep learning framework and MXNet shines at all of them:
- Scalability. MXNet not only supports CPU but also GPU computing and is able to scale across multiple GPUs on multiple hosts. Even more fascinating is a highly linear speedup obtained from multiple GPUs. You may find a benchmark published my Amazon CTO Werner Vogels.
- Portability. MXNet runs on a ubiquitous number of environments such as your Windows Desktop, Amazon Linux in Cloud or even a smartphone. On top of it, you can run trained models inside a browser.
From a programming perspective MXNet, offers a mixture of both imperative and declarative paradigms. So on the one hand, you can do tensor computations similar to Numpy, and on the other one, use symbolic expressions to declare computation graphs, similar to Theano and Tensorflow. For example in the context of deep learning, declarative programming is useful for specifying a structure of neural networks, while imperative programming is used for parameter updates and debugging.
Imperative programming: NDArray
In the first part of this tutorial, we discuss imperative programming scenario. And the basic building block for imperative programming in MXNet is NDArray. This object represents a tensor (or multi-dimensional array) and is indeed similar to Numpy’s NDArray. The main difference though is how NDArray is handled by MXNet’s engine. All operations on it are asynchronous and computed lazily if needed. They represent some future computations which are gathered by the engine to construct an internal dataflow graph. Such a powerful technique enables MXNet to perform a number of different optimizations as well as easily blend imperative code with declarative symbolic expressions.
You can read a much more detailed and interesting overview of MXNet implementation in this paper.
Restricted Boltzmann Machines overview
If you are already familiar with Restricted Boltzmann Machine(RBM) save yourself some time and skip this section. If not let’s first briefly remind ourselves of the theory behind it. Also if you are interested in a more detailed theoretical introduction to RBM you may want to check this tutorial.
Invented by Geoff Hinton, a Restricted Boltzmann Machine is a two-layer stochastic neural network (stochastic means neurons’ activations have a probabilistic element) which can be used for instance for dimensionality reduction, classification or feature learning. The first layer of the RBM is called the visible, or input layer, and the second is the hidden layer. As a fully-connected bipartite undirected graph, all nodes in RBM are connected to each other across layers by undirected weight edges, but no two nodes of the same layer are linked. The standard type of RBM has a binary-valued nodes and also bias weights.
The way it works is, RBM tries to learn a binary code or representation of the input. Specifically, it learns to store each input you show him as a pattern of activation of it’s hidden units. So that when you set the hidden units to exactly that pattern, the RBM should be able to reproduce the input. In this way, RBM can be seen as similar to Autoencoders, but the weights are the same in both directions.
We already know that neurons in RBM are stochastic. The probability of each neuron to be one (versus zero) is given by the logistic sigmoid function of the input it receives plus its bias. Here is a mathematical notation for computing values of hidden and visible layers:
Train RBM: Contrastive Divergence
To train RBM, that is to estimate weights we will use Contrastive Divergence(CD) algorithm. CD is based around the concept of “fantasizes”. The “fantazy” is the input you receive if you sample hidden units given visible, then sample visible units given hidden and repeat it n times (it’s called Gibbs sampling). CD algorithm updates weights in such a way, as to make a sampling of true observations (i.e. training samples) more likely and their “fantazies” less likely. Here is the summarized procedure for a single-step Contrastive Divergence (CD-1):
- Compute the hidden units using a training sample as the input.
- Compute the outer product of the input and hidden units, called “positive phase”.
- Make a Gibbs Sampling Step (i.e. resample visible units from hidden, then again resample hidden units from the newly obtained visible).
- Compute the outer product of the “fantazy” visible and hidden units, called “negative phase”.
- Update the weights by the difference between “visible phase” and “negative phases” times a learning rate.
- Do a similar thing to visible and hidden biases.
At this point, we should have enough domain knowledge to proceed to the practical part of implementing Restricted Boltzmann Machines in MXNet.