RNN cell structures: STAR, GRU, and LSTM, respectively.

Deep Multi-layer RNNs That Can Be Trained

Published in

EcoVisionETH

8 min readMay 8, 2020

Recurrent Neural Networks (RNNs) have established themselves as a powerful tool for modeling sequential data. They have led to significant progress for a variety of applications, notably language processing and speech recognition. This post briefly explains a new recurrent unit named STAR that makes building deeper multi-layer RNNs possible. For an in-depth description, please see our publication: Gating Revisited: Deep Multi-layer RNNs That Can Be Trained.

**Sequential data** is any kind of data where the order matters, it can be time-series such as a speech signal or video but also any other data where the order is important such as DNA sequences.

What is a Recurrent Neural Network (RNN)?

The basic building block of an RNN is a computational unit (or cell) that combines two inputs: the data of the current time step in the sequence and the unit’s own output from the previous time step. Formally, an RNN cell is a non-linear transformation that maps the input signal x at time t and the hidden state h of the previous time step to the current hidden state:

with W the trainable parameters of the cell. The input sequences have an overall length of T, which can vary. The loss is computed -depending on the task and labels- using either the final state; the complete sequence of states or a single sequence. Learning amounts to fitting W to minimize the loss, usually with stochastic gradient descent.

When stacking multiple RNN cells on top of each other, the hidden state of the lower-level is passed on as input to the next-higher level l (figure below). In mathematical terms, this corresponds to the recurrence relation:

Temporal unfolding leads to a two-dimensional lattice with depth L and length T, as shown in the figure below. The forward pass runs from left to right and from bottom to top. Whereas gradients flow in the opposite direction.

The general structure of an unfolded deep multi-layer RNN.

Deeper RNNs to model more complex sequences

In general, abstract features are often represented better by deeper architectures [1]. Especially in computer vision, it is very common to use deeper networks in which early layers capture low-level cues e.g. edges and blobs, while deeper layers capture more object-specific cues e.g. part of the object. In the same way that multiple hidden layers can be stacked in traditional feed-forward networks, multiple recurrent cells can also be stacked on top of each other, i.e., the output (or the hidden state) of the lower cell is connected to the input of the next-higher cell, allowing for different dynamics.

The problem: gradient instability and exploding number of parameters

Several works [2, 3, 4] have shown the ability of deeper recurrent architectures to extract more complex features from the input and make better predictions. However, such architectures are usually composed of just two or three layers because training deeper recurrent architectures still present an open problem. More specifically, deep RNNs suffer from two main shortcomings:

they are difficult to train because of gradient instability, i.e., the gradient either explodes or vanishes during training;
the large number of parameters contained in every single cell makes deep architectures extremely resource-intensive.

Both issues restrict the practical use of deep RNNs and particularly their usage for image-like input data, which generally requires multiple convolutional layers to extract discriminative, abstract representations.

To investigate the first issue we look at how the magnitude of gradients changes across the computation lattice. We take a look at the vanilla RNN (vRNN) with orthogonal initialization and the LSTM (Long short-term memory):

The magnitude of gradients across the computation lattice.

As the gradients flow back through time and layers, for a network of vRNN units they get amplified whereas for LSTM units they get attenuated. A theoretical argument in our publication also supports this numerical analysis.

STAckable Recurrent (STAR) Network

Based on this analysis, we introduce a novel RNN cell designed to avoid vanishing or exploding gradients while reducing the number of parameters.

Our proposed STAR cell in the layer l at time t takes the input h from the previous layer (in the first layer, x) at time t and nonlinearly projects it to space where the hidden vector h lives. Furthermore, the hidden state from the previous time step and the new input are combined into the gating variable k. k controls how information from the previous hidden state and the new input are combined into a new hidden state. The complete dynamics of the STAR unit is given by the expressions:

We repeat the same numerical simulations as we did earlier, adding the STAR cell, and find that it indeed maintains healthy gradient magnitudes throughout most of the deep RNN:

How well does STAR perform?

Pixel-by-pixel MNIST

Pixel-by-pixel MNIST task is a common benchmark for RNNs. The grey-scale images of handwritten digits from MNIST are flattened into vectors, and the values are sequentially presented to the RNN. The models’ task is to predict the digit after having seen all pixels. The second task, pMNIST, is more challenging. Before flattening the images pixels are shuffled with a fixed random permutation, turning correlations between spatially close pixels into non-local long-range dependencies. As a consequence, the model needs to remember dependencies between distant parts of the sequence to classify the digit correctly.

The following figure shows the average gradient norms per layer at the start of training for 12-layer networks built from different RNN cells. Propagation through the network increases the gradients for the vRNN and shrinks them for the LSTM. As the optimization proceeds, we find that STAR remains stable, whereas all other units see a rapid decline of the gradients already within the first epoch, except for RHN, where the gradients explode (middle). Consequently, STAR is the only unit for which a 12-layer model can be trained, as also confirmed by the evolution of the training loss (right).

**Left:** Mean gradient norm per layer at the start of training. **Middle:** Evolution of gradient norm during 1st training epoch. **Right:** Loss during 1st epoch.

The following figure confirms that stacking layers into deeper architectures does benefit RNNs (except for vRNN), but it increases the risk of a catastrophic training failure. STAR is significantly more robust in that respect and it outperforms all the baselines.

Accuracy vs layers. **Left:** MNIST. **Right:** pMNIST.

Crop Type Classification

We evaluate model performance on a more realistic sequence modelling problem where the aim is to classify agricultural crop types using sequences of satellite images. In this case, time-series modelling captures phenological evidence, i.e. different crops have different growing patterns over the season.

For this task, we use two datasets, TUM and BreizhCrop. In the TUM dataset, the input is a time series of 26 multi-spectral Sentinel-2A satellite images with a ground resolution of 10 m collected over a large area north of Munich, Germany. In the BreizhCrop dataset, the input is a time series of 45 multi-spectral Sentinel-2A satellite images with 4 spectral channels from 580k field parcels in the Region of Brittany, France. We use patches of 3×3 pixels recorded in all the available spectral channels and flattened into vectors as input.

Examples of short sequences from the TUM dataset with the corresponding label.

As shown in the following table, STAR significantly outperforms all the baselines including TCN and the recently proposed method, indRNN [5] which also aims to build deep multi-layer RNNs. The performance gain is stronger in the BreizhCrop dataset. This is probably because the sequence is longer and the depth of the network helps to capture more complex dependencies in the data.

Performance comparison for time series crop classification.

Hand-Gesture Recognition from Video

We also evaluate STAR on sequences of images, using convolutional layers, on a gesture recognition from video task. The 20BN-Jester dataset V1 is a large collection of short video clips, where each clip contains a predefined hand gesture performed by a worker in front of a laptop camera or webcam.

The consecutive frames are sequentially presented to the convolutional RNN. In the end, the model predicts a gesture class via an averaging layer over all time steps.

The outcome for the convolutional RNNs is coherent with the previous results. Going deeper improves the performance of all four tested convRNNs. The improvement is strongest for the convolutional STAR, and the best performance is reached with a deep model (12 layers). In summary, the results confirm both our intuitions that depth is particularly useful for convolutional RNNs, and that STAR is more suitable for deeper architectures, where it achieves higher performance with better memory efficiency.

Performance comparison for the gesture recognition task.

To learn more about the details of our method or to see more results, you can have a look at our publication:

Turkoglu, Mehmet Ozgur, Stefano D’Aronco, Jan Wegner, and Konrad Schindler. “Gating revisited: Deep multi-layer rnns that can be trained.” IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).

If you want to use our RNN in your project, the code is available here.

References

[1] Yoshua Bengio. Learning deep architectures for AI. Foundations and trends in Machine Learning, 2(1):1–127, 2009.

[2] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural networks. In ICML, 2015

[3] Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutnik, and Jurgen Schmidhuber. Recurrent highway networks. In ICML, 2017.

[4] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In CVPR, 2016.

[5] Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. Independently recurrent neural network (indrnn): Building a longer and deeper rnn. In CVPR, 2018.