NTM-Lasagne: A Library for Neural Turing Machines in Lasagne

The first part of this post is about the theory behind Neural Turing Machines. The second is about the library itself, and benchmarks on tasks from Google DeepMind’s original paper.


Recurrent Neural Networks (RNNs) have become a major player in sequence modeling over the past few years. Due to their efficiency and flexibility, it is tempting to think that they will constitute the future of sequence learning. However, research in Deep Learning is moving at such a fast pace that no one knows what that future is going to be made of.

In fact, among the recent innovations in Deep Learning, the idea of adding some form of external memory to existing architectures has led to promising results, especially in Natural Language Processing (NLP). For the most part, these memory-augmented networks have been motivated by the need to reproduce the notion of working memory theorized by neurosciences, which is responsible for inductive reasoning and the creation of new concepts.

Here, we will focus on one of the earliest examples of memory-augmented neural networks: the Neural Turing Machines (Graves, Wayne & Danihelka, 2014). We will see in details how they work, as well as how they can be suitable for learning algorithmic tasks.

Along with this blog post, Snips is also open sourcing NTM-Lasagne on Github, our Neural Turing Machines library built with Theano, on top of the excellent Lasagne. It includes code, pre-trained models, as well as all the examples covered in this post.

Memory-augmented Neural Networks

A feature that all neural networks share, whether they are recurrent or feed-forward, is that they map data into an abstract representation. However, adding knowledge from the external environment, like a whole thesaurus in NLP, is very hard.

Rather than artificially increasing the size of the hidden state in the RNN, we would like to arbitrarily increase the amount of knowledge we add to the model while making minimal changes to the model itself. Basically, we can augment the model with an independent memory that acts as a knowledge base from which the network can read and write on demand. You can think of the neural network as the processing unit (CPU), and this new external memory as the RAM.

The choice of the structure for the memory is crucial to keep the read/write operations fast, no matter the size of the memory. Multiple designs have been proposed by the Deep Learning community. For example the Neural Turing Machines, as well as the Neural Random Access Machines (Kurach & Andrychowicz et al., 2015) and the Neural GPU (Kaiser et al., 2015), use a tape with read and write heads. In Grefenstette et al., 2015, the authors use continuous versions of structures like stacks or (double-ended) queues.

An interesting side-effect of these memory-augmented networks is the ability to keep track of intermediate computations. For instance, in the case of Question-Answering (QA) problems in NLP, it can be valuable to memorize a story as the model reads it to eventually answer a question. The Memory Networks (Sukhbaatar et al., 2015) and the Dynamic Memory Networks (Kumar et al., 2015), for example, take advantage of the memory to perform well on QA tasks.

The idea of adding extra memory to RNNs is not new. In fact, the Long Short Term Memory networks (LSTMs, Hochreiter & Schmidhuber, 1997) already have a basic memory cell on which information is stored at every time-step. For an introduction to LSTMs, I recommend reading this great post by Chris Olah that details step-by-step how they work, including the role of this memory cell. To make it short, the main function of this cell is to simplify the learning of RNNs, and maintain long term dependencies between elements in the input sequence.

So, what is the point of adding a memory with such low-level addressing mechanisms, compared to LSTMs? The answer is structure.

LSTMs typically have a distributed representation of information in memory, and perform global changes across the whole memory cell at each step. This is not an issue in itself, and experience has shown that they sure can memorize some structure out of the data.

Unlike LSTMs, memory-augmented networks encourage (but don’t necessarily ensure) local changes in memory. This happens to help not only to find the structure in the training data, but also to generalize to sequences that are beyond the generalization power of LSTMs, such as longer sequences in algorithmic tasks (we will see some examples below).

You can picture the value of memory-augmented networks over LSTMs through the idea of the cocktail party effect: imagine that you are at a party, trying to figure out what is the name of the host while listening to all the guests at the same time. Some may know his first name, some may know his last name; it could even be to the point where guests know only parts of his first/last name. In the end, just like with a LSTM, you could retrieve this information by coupling the signals from all the different guests. But you can imagine that it would be a lot easier if a single guest knew the full name of the host to begin with.

Neural Turing Machines

In their respective domains, Computability Theory and Machine Learning have helped pushing forward the abilities of computers. The former defined what a computer is capable of computing, whereas the latter allowed computers to perform tasks that are easy for humans but were long thought to be impossible for computers, like computer vision for example.

Alan Turing himself studied the strong connections between computing and prospects of Artificial Intelligence in the 1940s. His Turing Machine is a classical computational model that operates on an infinite memory tape through a head that either reads or writes symbols. Although this abstract computer has a very low level interface, it is powerful enough to simulate any algorithm. The Neural Turing Machine (NTM) takes inspiration from these two fields to make the best of both worlds.

A Neural Turing Machine unfolded in time. The controller (green), a feed-forward neural network, depends on both the input x and the read vector r. It emits read (blue) and write (red) heads to interact with the memory M.

It is worth noting that there is an interesting connection between Turing Machines and RNNs. The latter are known to be Turing-complete (Siegelmann, 1995). It means that just like with Turing Machines, any algorithm can be encoded by a RNN with carefully hand-picked parameters (such parameters may not be learnable from data, though).

The NTM can be seen as a differentiable version of the Turing Machine. Similarly to a Turing Machine, it has two main components: a (bounded) memory tape, and a controller that is responsible for making the interface between the external world (ie. the input sequence and the output representation) and the memory through read and write heads. This architecture is said to be differentiable, in the sense that both the controller and the addressing mechanisms (the heads) are differentiable. The parameters of the model can then be learned using Stochastic Gradient Descent (SGD). Let’s describe these components in more details.

Controller

The controller is a neural network that provides the internal representation of the input that is used by the read and write heads to interact with the memory. Note that this inner representation is not identical to the one that is eventually stored in memory, though the latter is a function of this representation.

The type of the controller represents the most significant architectural choice for a Neural Turing Machine. This controller can be either a feed-forward, or recurrent neural network. A feed-forward controller has the advantage over a recurrent controller to be faster, and offers more transparency. This comes at the cost of a lower expressive power though, as it limits the type of computations the NTM can perform per time-step.

Another example of Neural Turing Machine, where the controller is an LSTM.

Read / Write mechanisms

The read and write heads make the Neural Turing Machines particularly interesting. They are the only components to ever interact directly with the memory. Internally, the behavior of each head is controlled by its own weight vector that gets updated at every time-step. Each weight in this vector corresponds to the degree of interaction with each location in memory (the weight vector sums to 1). A weight of 1 focuses all the attention of the NTM only on the corresponding memory location. A weight of 0 discards that memory location.

Parameters for the weight updates. These parameters are specific to each head (5 parameters per head). Each box corresponds to a 1-layer neural network with the corresponding activation function.

Moreover, we would like these weights to follow two requirements: they should support local changes (read or write) on the memory while keeping their updates differentiable, as we want to train the NTM end to end. To this end, the weight vector gets updated through a series of intermediate smooth operations.

There is a total of four operations per update: content addressing, interpolation, convolutional shift and sharpening. They all depend on parameters produced by the controller. More precisely, these parameters are functions of the hidden state emitted by the controller.

Details of the four operations in the update of one of the heads’ weight vector. At a higher level, this example can be interpreted as “GOTO pointer, SHIFT LEFT”.

In the content-addressing operation, the NTM focuses its attention on the memory locations that are “close” to some key. This allows the model to retrieve specific informations in memory. Roughly speaking, a bit like pointers in C. This content-based weighting is then gated with the weight vector from the previous time-step in the interpolation operation. If the gate is zero, then the content-addressing is skipped. On the contrary, if the gate is one, then the previous weights are ignored.

The convolutional shift then smoothly shifts the weights left or right. This is very close to the head shifting in a classical Turing Machine. Finally, the shifted vector is sharpened to get a weight vector as focalized as possible. Keep in mind, though, that none of these operations actually ensures local changes on the memory, despite the strong emphasis to produce sparse weights. As a consequence, you could end up with a “blurry” weight vector to weakly queries the whole memory.

Once the head has updated its weight vector, it is ready to operate on the memory. If it is a read head, it outputs a weighted combination of the memory locations: the read vector. The latter is then fed back to the controller at the following time-step. If it is a write head, the content of the memory is modified softly (depending on its weights) with an erase and an add vector, both produced by the controller.

NTM-Lasagne benchmarks and examples

Inspired by the paper from Google DeepMind, we benchmarked our NTM library on algorithmic tasks. In each experiment, the goal is to learn an algorithm only by looking at a series of examples of inputs and outputs. More specifically, we only feed the Neural Turing Machine with random inputs, with the corresponding expected outputs from the algorithm we intend to learn. No prior knowledge on the nature of the algorithmic task is given to the NTM. In a way this is a convenient problem, as we can sample as much data as we want, contrary to working with limited and noisy real world data.

Example of the training procedure of the NTM. The model is given random input sequences and updates its parameters at every step. The output image represents the output from the model at that stage of training, with the corresponding error with regards to the expected output. The state of the memory as the model reads the input and write the output is displayed on the right, with the positions of the corresponding heads. Inspired by A. Graves’ animation at NIPS

We are here especially interested in the capacity of the NTM to understand the concept of the algorithms, meaning that the NTM should be able to generalize to sequences of any length, including sequences longer than what it saw during training. This is what we will call the generalization performance.

The code for all the following experiments is made available with the library, along with pre-trained models, so that you can test them by yourself. All the hyper-parameters and architectural details are also available, and we will briefly remind them here. You can skip the following section if you want the avoid the technical details.

In all these examples, we use a single-layer Neural Turing Machine with a 100-units feed-forward controller. The memory has 128 locations controlled by 1 read head and 1 write head, just like in the original paper. One key thing we found particularly helpful is the use of rectify activations (ReLU). The latter showed better generalization performance over the typical tanh activation, and helped getting sparse representations of the sequences.

Example of the Neural Turing Machine on a test sequence, with the state of the memory.

Copy

The Copy task is the most basic algorithmic task, in which we test the capacity of the NTM to store and access information from memory. The goal is to reproduce the input sequence as an output. Of course, this task only tests a small fraction of all the operations available in the addressing mechanism.

It is important to point out that we are not trying to learn an autoencoder, where the goal would be to reconstruct the whole input at once. Instead, the NTM reads the input sequence one item at a time, and writes its output one item at a time as well. Think about it as reading a book (word by word) and then having to copy it back, just from memory. This is already a lot more challenging.

What is even more challenging is that we are not only interested in the capacity of the NTM to copy sequences of length similar to what it was trained on, but to effectively learn the copy algorithm and generalize to longer sequences. To continue the parallel with books, let say that you are learning to copy pocket books but eventually you are asked to copy the Odyssey.

In this example, we trained a Neural Turing Machine on random sequences of length up to 5. It showed perfect generalization (ie. an error below 1 bit per sequence) on test sequences of length up to 120. In fact, this is not surprising since the NTM actually hits the limit of the memory, thus is not able to generalize any further.

This dashboard shows how the Neural Turing Machine learns an algorithmic task (in this case, the copy task). The model is given the input sequence on the right and outputs the sequence given on the bottom right hand side. The two figures on the left show the status of the read & write heads, unfolded in time (each column represent the weight vector at a given time-step). The red line marks the limit of the data in the input sequence. This test sequence has a length of 20 (more than 4 times longer than the training sequences).

Repeat Copy

To test increasingly complex algorithms, it is interesting to check if the NTM is able to learn simple algorithmic primitives, like a for loop. In the Repeat Copy task, the NTM receives an input sequence, along with a representation of the desired number of copies. The goal is to reproduce the data described in the input sequence a specific number of times.

Contrary to Google DeepMind, we used a unary representation of this number of copies (ie. a repeated binary flag) instead of a scalar in [0, 1]. The scalar representation, while suitable for training, strongly relies on normalization and does not leave much room for generalization. As a matter of fact, a similar effect was reported in their paper.

Example of the Repeat Copy, on an input sequence of length 5 with 5 repeats. The red line marks the end of the data in the input sequence. The NTM was trained on sequences of length up to 5 with a number of repeats up to 5.

Note that the NTM reads the raw input sequence and does not get any hint on where the data and the number of copies are. The model has to figure out by itself where the relevant information is during training.

Associative Recall

The Associative Recall task is a more advanced example that learns a basic query/answer system. Taking advantage of the content addressing, this task allows us to test if the NTM is able to learn C-like pointers. In this simple case, the NTM is given items consisting of multiple binary vector, separated by delimiters, along with a query item at the end of the input sequence. The objective is to retrieve the item (answer) that immediately follows the query item in the input sequence.

Once again the NTM gets no hints about the nature of the data. In particular, it has to understand that the items are delimited by a special binary flag, and distinguish it both from the actual data and the query delimiters.

Example of the Associative Recall task, with 5 items of length 5. The red line marks the end of the data and the green line the end of the query. The NTM was trained on sequences of at most 6 items of length up to 3.

Dyck words

In addition to all the regression examples presented above, we are also interested in testing Neural Turing Machines on classification tasks. Taking inspiration from formal languages, we could test the recognizability of some language only from examples. A good compromise in complexity is to test the NTM on the language consisting of strings of balanced parentheses (also called Dyck words).

Example of a Dyck word

Dyck words have been extensively studied and are closely related to many objects in combinatorics. In the context of recurrent networks, they are also very interesting: well formatted parentheses typically involve long-term dependencies, known to be the Achilles heel of RNNs.

A very simple algorithm can be used to check whether a word of parentheses is a Dyck word though. This algorithm involves a stack and is described below.

Experience proves that NTMs are actually able to encode such a procedure! However, contrary to the previous examples, the NTM uses its heads not necessarily to store information in memory, but as a way to keep track of the status of the stack. This shows that the NTM is not only able to use its memory as a linear storage, but also to perform complex computations with long-term dependencies, regardless of its content.

Example of the Dyck Word task on some random Dyck word of length 100. The model uses the read head to keep track of the status of the stack. Every time the NTM sees an opening parenthesis, the read head shifts to the right. Every time it sees a closing parenthesis, the read head shifts to the left. The NTM was trained on sequences of length at most 40.

Conclusion

As we explained in this post, Neural Turing Machines show promising properties in terms of generalization compared to LSTMs. The way information is stored and modified locally, in an external memory, makes it an interesting solution to handle problems that require reasoning.

We hope that the library we are releasing with this post will help the Deep Learning community to make progress in the exploration of this fascinating topic. Up to now, this is the first open-source library that is able to reproduce the Copy, Repeat Copy and Associative Recall tasks from Google DeepMind’s original paper with this level of quality.

A lot of progress has been made towards artificial perception with Deep Learning, and Convolutional Neural Networks in particular. The NTMs open new possibilities to go a step further and understand how concepts are created. After all, algorithms are concepts for computers. This is a new take on Artificial Intelligence: trying to teach machines things they can do, the same way we would learn them. Let’s see where this gets us!


You can download the NTM-Lasagne library on Github.

If you want to hear more about Neural Turing Machines, Machine Learning or Artificial Intelligence in general, follow us on Twitter at @tristandeleu and @snips.

By the way, if you are interested in AI, Snips is hiring!