Awesome AI papers: How to make long distance relationships work

Erm, I meant, in the context of language modelling and backpropagation.

Rowen Lee
Nurture.AI
3 min readApr 23, 2018

--

This article is part of a weekly series of AI paper summaries. Check out more at the nurture.ai medium publication or the official nurture.ai website.

Model architecture

TL; DR

Long term dependencies in RNNs are typically modelled using backpropagation through time (BPTT). However, this method tends to lead to vanishing or exploding gradient problems for long sequences. Furthermore, the memory requirement for BPTT is proportional to sequence length. Therefore, BPTT may be infeasible when the input sequence is too long. Current approaches to address these weaknesses include LSTMs, gradient clipping and synthetic gradients. This paper introduces an alternative method by means of adding auxiliary losses.

The “aha” moment

It is not required to backpropagate gradients until the beginning of an input sequence. This paper proposes a truncated BPTT method, where gradients flow back or forward for a truncated number of steps. Random time steps are selected as anchor points from which the gradients are backpropagated. These points serve as a temporary memory for the recurrent network to remember past or future events in a sequence.

Plan of attack

Our goal is to create an LSTM model that reads and classifies a sequence by inserting an unsupervised auxiliary loss at anchor points. The auxiliary loss either recreates a subsequence prior to an anchor point or predicts a subsequence after an anchor point. The former loss is applied at randomly selected anchor points in a network named r-LSTM (reconstruction), the latter in a p-LSTM (prediction). Each LSTM type has two optimisation methods, i.e full and truncated backpropagation. For the former, gradients flow until the end (for p-LSTM) or beginning (for r-LSTM) of the input sequence; for the latter, gradients stop after 300 timesteps. During training, the auxiliary loss is minimised along with the main objective loss (classification loss).

Results

A r-LSTM and p-LSTM are evaluated on pixel-by-pixel image classification tasks (where pixels are sequentially fed into the model) and character-level document classification tasks. They demonstrated comparable or better performance compared with a normal LSTM model without auxiliary loss.

Competing approaches

One way to sidestep BPTT for long sequence inputs is the Transformer, which uses an attention mechanism. However, the Transformer requires access to the entire input sequence during inference. This works well only for short input sequences.

Industry Implications

This method of learning long term dependencies could be used in online learning, where input data arrives in long continuous streams. In general, this method epitomises efforts to creating AI models that are increasingly efficient in memory storage and computing power. These models will integrate seamlessly into our lives, hopefully for the best.

Questions left open

  1. Could this approach be applied to the Transformer?
  2. Its performance could be enhanced further by incorporating an auxiliary loss into its architecture.
  3. Should we just scrap backpropagation? There has been scepticism on backpropagation, which until today has no evidence of being a model for the human brain. Although other optimisation methods exist, backpropagation seems to be the default go-to among the Deep Learning community. That is until someone comes up with a groundbreaking alternative that will certainly alter the landscape of Artificial Intelligence.

Read the full paper here.

Interested to read more? Head over to nurture.ai to view more weekly paper summaries and discuss interesting question left open by the paper here.

Rowen is a research fellow at Nurture.ai. She believes the barrier to understanding powerful knowledge is convoluted language and excessive use of jargons. Her aim is to break down difficult concepts to be easily digestible.

--

--