Hierarchical Multiscale Recurrent Neural Networks

Published in

Paper Club

7 min readJul 27, 2017

This week we chose to review a relatively recent paper on RNNs, https://arxiv.org/abs/1609.01704.

Background summary:

Previous work has been done that shows models self organizing in a hierarchical structure as well as RNNs whose layers are stacked in reverse order of update frequency. Stacking the layers in reverse order of update frequency lets the model build up higher and higher levels of abstraction, word -> sentence -> paragraph -> chapter. These are known as multiscale networks because the layers represent different levels of abstraction. Self organization prevents the need to manually describe each layers role and allows the network to learn the relevant roles by itself. Other models have been build such that the layer hierarchy is hardcoded, meaning that layers update at a fixed interval that is a function of the layer number. This has the clear disadvantage in that the network doesn’t learn a hierarchy and thus it is likely not to be optimal. Despite the recent successes of RNNs, previous work is lacking because RNNs have not yet been shown to self organize and identify hierarchical structures when the data is unbounded.

Specific Questions:

CNNs are really good at discovering hierarchical information about images. For example the first layers recognize basic properties like edges while the later layers see complex structures like an entire face. This type of hierarchy is present in temporal data as well(audio, video, text) but temporal data is typically unbounded. Sentences differ in length, the same sentence can be said slower or faster etc. Can RNNs learn the hierarchical multiscale structure of unbounded temporal data?
Does the structure learned for temporal data improve training accuracy?
Is the straight-through estimator an efficient way of training a model?

Methods:

In order to understand the novel architecture introduced in this paper, let’s do a quick review of the LSTM architecture.

LSTM Cell. i, f and o, are the input output and forget gates respectively

At every iteration of input to the LSTM it does the following:

Combines the previous cell state with the input.
Creates a new output
Updates the hidden state after passing the previous cell state through a “forget gate”

An important point here is that these updates occur at every step. The act of
“forgetting” with the forget gate is nuanced — the model forgets certain parts of the previous cell state but not necessarily all of it.

An RNN can also be viewed in its unrolled form:

The architecture in this paper introduces an HM(hierarchical multiscale)-LSTM that looks like this in the unrolled form:

Unrolled HM-LSTM. The Solid lines represent updates that occur. Notice that updates *within layers occur while the layer detects boundaries and updates between layers occur after a boundary has been detected.*

They key distinction here is that the dotted lines represent updates that don’t happen. The model has an additional parameter that it passes between each layer called a binary boundary detector(we’ll call this Z). This simply passes a value of 0 or 1 to the layer above. Let’s talk about these updates in the context of the second layer in the picture above(not drawn). The first layer looks for words, and the second layer is detecting phrases. This is the HM part of the title in the paper. Hierarchy because of the stacked layers and multiscale because phrases take a bigger picture view than words.

Update(Current layer’s Z was previously 0 and Z of the layer below is now 1): We haven’t found a phrase yet but the layer below us just found a word. Update our current cell state building up on the previous words we’ve found.
Copy(Current layer’s Z was previously 0 and layer below’s Z is also 0): Previous layer hasn’t found a word yet, nothing for current layer to do so keep the same cell state we had before.
Flush(Current layer’s Z was previously 1): We found a boundary(phrase) in the last time step. Pass the phrase to the layer above us — possibly a layer that’s determining paragraphs. Reset our current state to empty. No point in keeping the last phrase around since it’s not relevant for detecting the next phrase.

The Flush operation here is much different than a typical RNN because it completely erases all information obtained in this cell

More formally, these operations are defined by the following formulas.

Mathematical definitions for updates. f, c, i, g and forget gate, cell state, input gate, and cell proposal vector respectively

Based on the math, it looks like the hidden state on a FLUSH operation would maintain some information about the cell state. Shouldn’t a RESET completely remove this state?
The Cell Proposal Vector is not a part of a standard LSTM and is not explained in the context of this model. How does this fit in to the model?

An interesting thing to note is that the binary boundary detector is either 0 or 1 and as is therefor non-differentiable. In order to compute the gradient during back propagation, the authors used a method called straight-through estimator (Hinton 2012) which basically replaces the step-wise function with a differentiable one.

I don’t fully understand the implications of this, it seems like this is kind of cheating. I probably need to read the Hinton paper to see how this is used

Results:

Performance of the model was measured against writing samples because a clear hierarchy exists— words, phrases etc. Against the Text8 dataset, data obtained from wikipedia that only has letters and spaces, the model achieves state of the art performances:

On the Hutter Prize wikipedia dataset, the model ties previous NN state of the art results and on Penn Treeback comes close but does not beat current state of the art (1.24 BPC vs 1.23 BPC).

Visualizing the results we can see that the first layer typically fires on word boundaries.

On the Penn Treebank data we can see the first layer capturing prefixes (“tele”) and the second layer capturing larger sentences.

The author claims that these results are separated on reasonable semantic boundaries. This is interesting and certainly fairly clear for the words. The second layer boundaries are less clear but also seem to be picking up reasonable segments. “consumers may” — “want to move their telephones a”” — “little closer to the tv set <unk>”.
One advantage the author claims over a typical RNN is that the structure is more interpretable. For text I definitely agree that the binary boundary detector makes it easier to inspect the network. Forgetting a portion of the previous state is much harder to visualize.
The author also makes several claims about performance. Intuitively, since these updates don’t have to occur at every step, a performance increase makes sense. However if you’re going to make claims about performance, you need to provide benchmarks, training speed, loss decrease per epoch, etc). Without these metrics it’s hard for me to believe the authors claim.

The model was also tested against handwriting sequence generation and performed better than a standard LSTM.

The comparison is between a standard LSTM but does not compare against current state of the art models so it’s hard to interpret this. All I can do is assume it is worse than current state of the art models but I don’t know by how much. This result on it’s own is not impressive.

Words and concepts I had to look up:

Multiscale RNN — RNN that operates on abstractions that build on one another. Viewing this from CNN perspective, the multi scale part would be layer 1 “seeing” edges and the following layers “seeing” more sophisticated structures.
temporal coherency
vanishing gradient problem — As networks get very deep, the gradients tend towards zero in the earlier layers and thus backpropagation losses it’s effectiveness in helping those layers learn. LSTM is an RNN variant that helps to solve this problem.
update rates / update times ~ learning rate??
binary boundary detector — Additional parameter to an RNN that tells the layer above whether or not the current layer found a relevant sequence.
straight-through estimator— Use a replacement function that is similar to a stepwise function but differential so that it’s still possible to train
hierarchical RNN - rnn where layers are updated with different weights
recurrent generalization of stochastic depth
layer-wise capacity control
parameterized boundary detector
bits-per character evaluation metric?
How much faster was the network to train(claims about computation time)?
cell state
Cell Proposal vector