Neural Turing Machines : an “artificial” working memory ?

14 min readAug 2, 2018

The main challenges to a perfect Neural Turing Machine model.

TL; DR: Neural Turing Machines are a new RNN architecture based on the human memory, and though they show promising results, they still suffer from drawbacks common to Neural Network architectures.

Neural networks have been known for a while in the scientific community but it is only until the shift towards connectionism in the late 1980s that they have been considered as a promising approach to model human intelligence. However, they still suffer from many drawbacks which leads to two main criticisms.

The first deals with the fixed size of neural networks. Indeed, as a neural network is trained for a specific task with a fixed number of parameters, it seems impossible for NNs to solve variable-length inputs problems. Time dependent inputs such as sequences (speech and text for instance) are perfect examples of variable-length inputs. One option to solve this issue would be to “pad” sequences with non-informative values to “fill-in” the gaps so that all inputs would have the same length. In practice, this is sometimes computationally inefficient, not mentioning this is definitely not how the brain deals with sequences. The second option to solve the variable-length inputs problem would be to use Recurrent Neural Networks (RNNs), which are designed to loop over the sequence items without length limitation.

The second drawback is the capacity of the network to store and retrieve pieces of information. Though they are known to be capable of handling long-term dependencies in sequences, RNNs are not good at storing specific information at specific locations and suffer from the “vanishing gradient” problem. LSTMs have become popular RNN alternatives but can they really store pieces of information across hundreds or thousands of time steps ?

In this article, I propose to explore the mechanisms of a Neural Turing Machine and give a very brief explanation of its link to human working memory. I will give results of experiments I replicated based on research papers and let you judge whether or not Neural Turing Machines can be considered as working memory algorithms.

Background

Until the 2010s, sequence modeling was mostly the affair of parametric methods such as Hidden Markov Models and Conditional Random Fields. These were, and are still, used in various domains such as speech recognition, sequence classification, part-of-speech tagging, and so on.

Though deep learning exists since the 1950s, the advances in computer hardware leading to more powerful machine and granting higher computation capacity has led to a frantic enthusiasm for these in the last decade.

In the field of sequence modeling, Recurrent Neural Networks have become standard in many applications ranging from speech recognition to sequence labeling, as well as Natural Language Processing. The capacity of recurrent neural networks to « store » amounts of information over several time steps overcomes many limitations of HMMs, one of them being the Markov assumption stipulating that, for a 1st order HMM, the observations and the states only depend on the previous state, thus forgetting all the past history of the sequence. Though RNNs are subject to limitations as well such as the vanishing gradient problem, new architectures like Long Short-Term Memory cells (LSTM), coined by Hochreiter & Schmidthuber (1997), and Gated Recurrent Units (GRU), introduced by Cho (2014), tend to alleviate these problems by adding gating mechanisms inside the cells. These gating mechanisms allow the cell to control the flow of information and more specifically the proportion to which the cell allows new input, modifies its state and outputs new information. These cells have demonstrated their capability to outperform RNNs in various tasks, by storing a larger history of information through time.

But RNNs, even if enhanced with LSTM and GRU cells, present one major inconvenient : their storage capacity. Indeed, in order to add more capacity, one needs to add more units in the recurrent cell. This makes it difficult for RNNs to deal with increasing information size, and is not a scalable solution as the complexity of an RNN is O(n²) in terms of computation time. Another issue with RNNs is the way they deal with their internal « memory », aka their internal state: it is modified, heavily or slightly, at each computation step, which makes it difficult to store an information.

Let us now step into human cognition. When we perform a task, such as adding two numbers, we need to store the information in a temporary buffer, or « working memory ». Once the task has been performed, this memory is flushed. We will not dig into the whole processes involved in a computation and the different types of memory (long-term, episodic, etc.), but let’s keep in mind for the moment that the working memory is a component which allows us to store information which will be retrieved later on. The idea of incorporating a form of external memory in a neural network is quite recent but has gained a lot of attention in the field of deep learning. By adding an external memory bank, researchers hope they will be able to increase the model’s storage capacity without having to increase the size of the model.

In 2014, Alex Graves at Google Brain (paper) made a parallel between the working memory mechanism and a Turing Machine. A Turing Machine is a mathematical model imagined by Alan Turing in 1936, and is considered as a fundamental concept of computer science. It is made of an infinite tape on which are written binary digits 0 and 1. These bits are accessed and modified by read and write heads. It is believed that Turing inspired Jon Von Neuman who is at the origin of the first known computer or at least the first computer architecture, called the Von Neuman architecture, on which computers are built today.

Neural Turing Machines

A Neural Turing Machine is made up of two components : a controller and a memory bank. The controller “controls” the inputs and outputs of the cell while the memory is accessed, ie “read from” and “written to”, by parameter heads.

The controller is the interface between other layers of the network and the memory. It restricts the access to the memory via « heads » which are responsible of writing to or reading from the memory. Two types of controller were proposed by Graves in his original paper: Recurrent and FeedForward. Recurrent controllers are typically LSTMs which already have a kind of memory storage mechanism, although it still has the limitations evoked earlier on. Feedforward controllers can be as simple as a dense layer, and therefore doesn’t have all the time-storage capacities of a RNN, but has the advantage of being faster. Basically the NTM cell steps are :

Take an input vector X
From this input vector output another vector Y
From Y, create read and write heads
Let the heads perform read and write operations based on the weight vectors they compute

Read and write heads are responsible for outputting weight vectors which will be used to interfere with memory. They take as inputs:

the controller output Y
the previous weightings to read/write
the memory.

In order to be able to use backpropagation, these weightings cannot be discrete, ie the read and write weight vectors cannot attend to one single location. To overcome this issue, Graves’ idea was to use a distribution over memory locations. To put it simply, the heads attend to each location to various degrees : they read and write “everywhere”. Of course, reading and writing “everywhere” doesn’t mean that the same amount is read or written to in equal proportions: the heads can attend sharply or weakly to various locations. All weight vectors therefore obey to the following:

source: original paper

Addressing
In order to produce a read/write weight vector, a head takes as inputs the controller output, the previous read weight vector and the memory. Two strategies can be adopted in order to produce a weight vector : content-based addressing and location-based addressing.

Content addressing

The first step is the content addressing step. A M-length key k is compared to each column of the memory M with a similarity measure such as the cosine distance, multiplied with a key strength scalar β and normalized afterwards to produce a N-length normalized vector w. The equations are written below:

Eq (5): A distance metric computes proximity between the key k and each of the memory rows. The key strength beta is used to emphasize the proximity score. (source: original paper)

Eq (6): The distance metric used is the cosine distance (source: original paper)

We have a weight vector, so we could stop right here and move to the read and write parts, or we could use a more sophisticated strategy to compute our weight vectors, called location addressing.

Location addressing

The location addressing is done in 3 steps.

Interpolation

The first step of location-based addressing is called the interpolation step. The aim of this step is to blend the previous weight vector with the weight vector obtained from the content addressing step. A scalar parameter between 0 and 1 controls this blend:

Equation (7): Interpolation is a convex combination of a previous weight vector and the weighting obtained from content addressing (source: original paper)

Shifting

The next step of the location-based addressing is the convolutional shift. Just like a Turing Machine is able to move the location of the heads left or right to access different locations, an NTM is capable of shifting its attention focus left, right, or leave it unchanged.

For a given number of shifts k, the allowed shifts will span between -k and +k. This operation is done modulo N as the size of s is less than the size of the weighting vector. Bear in mind that the shift vector is passed through a softmax in order to obtain normalized probabilities.

Equation (8): The shift vector moves the weighting vector left or right (source: original paper)

Sharpening

After the head has moved (or not) its focus with the convolutional shift, we might find ourselves with a slightly blurred weighting, which is not what we really want given the interest of an NTM lies in the fact it can attend to specific locations. To sharpen the weighting vector, a sharpening step multiplies the weighting by a scalar greater than 1. A softmax is then taken to have a normalized probability over weightings :

Equation (9): The sharpening operation encourages the heads to focus on specific locations rather than a blurry uniform weighting (source: original paper)

Reading & writing
The read and write operations are basically a normalized weighting over the memory locations, similar to attention mechanisms. These weightings define a continuous distribution over the memory locations to make the whole operation differentiable. The reading operation is a simple linear combination of the memory locations :

Eq (2) : the read vector is a weighted convex combination of the memory locations (source: original paper)

The writing operation is a convex combination of erasing and writing to the memory locations. In addition to a read head, a write head outputs two vectors: an erase vector e and an add vector a. Writing to the memory will then be made by first erasing to locations defined by the write weighting vector, then adding to locations defined by the same weighting vector. Again, notice that by erasing and writing over all locations in different proportions makes the whole operation differentiable:

Equation (3): parts of memory are erased according to the weighting vector (source: original paper)

Equation (4): New information is added to locations defined by the weightings. (source: original paper)

Experiments

Unless specified otherwise, all experiments below were run under Tensorflow 1.8 with the following parameters:

learning rate 1e-4
RMSProp optimizer
128 memory locations
Size of memory rows: 20
LSTM controller with 100 units

Copy
In this experiment the network is presented with a random-length sequence of 8 bits vectors. For the training, the length is set between 1 and 20. The goal is to teach the NTM how to reproduce the input. However, there is one subtlety: the network cannot output anything before the whole input sequence has been presented to it. If it was not the case, an obvious solution to the copy task would be to store the numbers in the RAM with variables and query them when asked. Or the NTM would just be given a number, and spit the number out step by step. Easy, this is the identity function…

The copy task. Notice how the network has to wait for the end token before being asked to output the sequence (Image by the author)

The NTM therefore has to find a way to store the whole sequence before spitting it out. Let’s imagine a concrete example with a human:

A list of numbers between 1 and 64 is read aloud to the subject
The list can be between 1 and 20 items long
The subject cannot write down the numbers when the list is read
When the experimenter has finished dictating, the subject has to recite back the list in one shot

This is the copy experiment. To check if the NTM successfully learned a mechanism to query and store sequential information, we tested the NTM on sequence of length 30 and 50, and discovered it could generalize well. The NTM therefore learned to memorize long sequences over 20 items, which is better than human working memory which can usually store around seven items (Miller, 1956).

We ran the experiment and obtained the following graphs. The animation on the left shows how the NTM learns (steps shown are taken every 500 computation steps).

Copy experiment: Inputs are on the top row, outputs on the bottom row (Image by the author)

Repeat copy
In this experiment, the network is presented with the same sequences as in the copy experiment, with an additional number which tells him how many times it must output the input sequence.

The animation is slightly different from the animation above : the top row figures show, from top to bottom, the input, and the output, replicated thrice. The bottom row shows, from top to bottom, the input sequence and the NTM output. Notice also how this time we observe jumps in the loss of the evaluation sequences (blue curves). This instability is a common issue known in NTMs, and makes them difficult to train.

Repeat copy experiment: Input are on the top row, outputs are on the bottom row. (Image by the author)

The results look quite promising at first glance but the Neural Turing Machine however suffers from a couple of issues :

1) Like many deep learning architectures, the NTM offers a lot of choices for hyperparameter tuning : the memory size (memory locations and memory vector size), the memory initialization, the controller type, the number of read and write heads, etc. The choice is often done through experiments and is therefore rarely justified from a mathematical point of view.
2) The NTM is limited by the size of its memory. If the length of the sequences to be remembered is greater than the number of memory locations, the algorithm fails to produce the expected output. Experiments showed that an NTM trained on sequences of length up to 20 with 20 memory locations converges during training but doesn’t generalize to 30 and 50-digits sequences.
3) Experiments we carried also showed that the choice of the optimizer is very important in order to reach convergence. ADAM seems to fail with the same learning rate as RMSProp…
4) Finally, NTM memories have to be reset after each training case. This behavior is maybe not how the working memory works. I am not an expert in human psychology but it seems like working memory doesn’t wipe off its content when new information arrives. Instead, it seems it can handle different pieces of information for a limited amount of time and still be able to recall them.

Variants of the Neural Turing Machine

Wei Zhang (Zhang et al, 2015) argues in his paper that although convergence can be reached with small memory sizes, the model struggles to converge when the memory size gets larger. He proposes slightly modified versions of the NTM, where the main innovation lies in the use of several memories instead of a single memory bank. The use of several memory layers and the control of the memory dynamics should help the model to avoid overfitting by making large updates on the memory. A distinction is made between controller memory, which is modified directly by controller outputs through the write heads, and hidden memory, which is not modified directly by the controller outputs. The hidden memory is a “smoothed” copy of the controller memory, in the way that it keeps a record of it through a convex combination as shown below:

NTM1 Model from Zhang (2015). Mc is the memory interacting with the controller, Mh the “hidden” memory

Though Zhang finds out that his models of structured memory NTMs reach convergence faster than traditional NTMs, our experiments show no real evidence for this on the copy task. We’ll call Zhang’s model NTM1.

NTM vs NTM1 — location addressing & LSTM controller

Both models converge, however the Zhang model is unstable for long sequences. Furthermore, we don’t notice significant speed-up when using NTM1.

Blue: NTM1, Orange: Standard NTM (Image by the author)

NTM vs NTM1 — content addressing & LSTM controller

Content addressing seems to slow down convergence compared to location addressing.

Blue: NTM, Pink: NTM1 (Image by the author)

NTM vs NTM1 — location addressing & FeedForward controller

The feedforward controller seems to cause instability for long sequences during evaluation. During training, NTM1 is slower than NTM.

Green: NTM, Grey: NTM1 (Image by the author)

NTM vs NTM1 vs LSTM

Neural Turing Machines seem to perform a whole lot better compared to LSTMs for the copy task. Convergence is faster and they generalize better to longer sequences.

Blue: NTM1, Orange: NTM, Red: LSTM _(Image by the author)

Conclusion

We have reviewed the principles underlying a Neural Turing Machine, by showing how it differed from Recurrent Neural Networks and could represent a promising direction for future research.

The NTM has shown it could beat RNNs on simple tasks like copying a sequence one or more times, and could therefore mimic the human capacity to listen to a sequence and output it right after its completion. However, this task is still extremely simple and the NTM only manages to learn it right under certain circumstances. We think that the lack of justification for hyperparameter tuning and the instability of NTMs, coupled with the memory limitations when dealing with long sequences, pours some cold water on these promising results.

Human capacities are still ahead of deep learning.

Code will be released soon on the RogerVoice Github account. Find more about Rogervoice at www.rogervoice.com

Benjamin is a Data Scientist at RogerVoice. He works on Speech Recognition as well as Emotion Detection from Speech and Speaker Diarization. He’s a huge theater fan and loves Knopfler as well as Kundera.