Notes on Hierarchical Multiscale Recurrent Neural Networks

Introduces a novel update mechanism to learn latent hierarchical representations from data.

Introduction

State-of-the-art on PTB, Text8 and IAM On-Line Handwriting DB. Tied for SotA on Hutter Wikipedia.

Lots of prior work with hierarchy (hierarchical RNN / stacked RNN) and multi-scale (LSTM, clockwork RNN) but they all rely on pre-defined boundaries, pre-defined scales, or soft non-hierarchical boundaries.

Two benefits of discrete hierarchical representations:

  • Helps vanishing gradient since information is held at higher levels for more steps.
  • More computationally efficient in the discrete case since higher layers update less frequently.

Model

Uses parameterized binary boundary detectors at each layer. Avoids “soft” gating which leads to “curse of updating every timestep”.

Boundary detectors determine operations for modifying RNN state: COPY, FLUSH, UPDATE:

  • UPDATE: similar to LSTM but sparse, according to boundary detector.
  • COPY: copies cell and hidden states from the previous timestep to the current timestep. Similar to Zoneout (recurrent generalization of stochastic depth) which uses Bernoulli distribution to copy hidden state across timesteps.
  • FLUSH: sends summary to next layer and re-initializes current layer’s state.

Discrete (binary) decisions are difficult to optimize due to non-smooth gradients. Uses straight-through estimator (as an alternative to REINFORCE) to learn discrete variables. The simplest variant uses a step function on the forward pass and a hard sigmoid on backward pass for gradient estimation.

The slope annealing trick on the hard sigmoid compensates for the biased estimator but minimal improvement from experimental results. Also introduces more hyperparameters.

Implemented as a variant of LSTM (HM-LSTM) with custom operations above. No experimental results for variant with regular RNN (HM-RNN).

Results

Learns useful boundary detectors, visualized in the paper.

Latent representations possibly imperfect, or at least, not human: spaces, tree breaks, some bigrams, some prefix delineation (“dur”: during, duration, durable).

Only results on character-level compression tasks and handwriting, no explicit NLP tasks, e.g. machine translation, question-answering, or named entity recognition.

Conclusion

Thanks to those who attended the reading group session for their discussion of this paper! Lots of good insights from everyone.

Follow me on Twitter for more posts like these. If you’d like help deploying similar models in production, I do consulting.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.