Simplifying RNNs for Neural Machine Translation

Published in

Machine Translation @ FBK

6 min readOct 15, 2018

Recurrent neural networks (RNNs) represent a staple algorithm for probably all the tasks in natural language processing (NLP). In fact, the sequential nature of their processing perfectly fits a vision of the language as a “flat” sequence of tokens. Moreover, RNNs can find more complex structures in the sequences in order to model long-range dependencies, in particular in their gated forms (LSTMs, GRUs).
But, LSTMs are also known to have a quite slow execution leading to long training times. Some faster solutions for neural machine translation (NMT) include convolutional neural networks and transformer networks. The topic of this post is a network that overcomes the limitations of LSTMs in order to train faster without loss in performance for NMT.

LSTMs are inefficient

RNNs are known to be able to achieve state-of-the-art results for many language tasks but suffer from a computational inefficiency due to their sequential nature and the high number of operations to perform at every time step. In fact, LSTMs have been proposed to solve the “vanishing gradient problem”, which prevents vanilla RNNs from being trained effectively for any task. The solution introduced in LSTMs consist in computing many functions of the same input and hidden state at each time step. These functions are called “gates” as they assume values between 0 and 1, and can thus block the signal propagation (0) or let it pass (1). Using many gates an LSTM cell can control better the information flow, but the cost is an increased parameter count and possibly redundant operations.

Figure 1: LSTMs perform a slow sequential processing of the input. Each red block represents a block of several functions of the input and the hidden state.

The slow training of LSTMs is the problem we addressed in our paper Deep Neural Machine Translation with Weakly-Recurrent Units, where we propose a deep learning architecture for neural machine translation which is still recurrent but presents some improvements with respect to LSTMs.

Some operations can be performed in parallel for all the time steps, and
the number of operations per time step is highly reduced when compared to LSTMs.

The reduced number of operations enables the use of several layers in both the encoder and the decoder while training not slower than an LSTM network with fewer layers, and at the same time improves the translation quality.

Deep Networks with Fast Recursion

Our model, which we call Simple Recurrent Neural Machine Translation (SR-NMT), incorporates:

layer normalization to reduce the covariate shift (and make the training of deep networks possible)
an attention network in each decoder layer to further improve the translation quality and improve the information flow between encoder and decoder
highway networks which are useful when using deep networks

Figure 2: SR-NMT can parallelize most of the operations.

Figure 2 shows a schema of the building blocks of SR-NMT. The first green block represents a time-distributed linear layer that multiplies the input (of size d). The output of size (3d) is layer normalized. The chain of the two operations can be called normalized transformation. The vector of size 3d is split into

a linear transformation that is used to propagate the signal forward
one that, after a sigmoidal activation, will become the gate that is used for the recursion
another gate for the highway network.

The recursion is performed via an element-wise weighted average between the current and the previous states, where the weights are given by a gated activation. This type of recursion is much faster than LSTM’s recursion, as it uses zero learned parameters and only element-wise operations instead of costly matrix multiplications. In figure 2 this recursion is represented by the orange blocks.
Layer normalization and highway networks let the global network to be deep and acquire representational power, while also converging faster during training. The topmost green block in figure 2 represents the highway network and the optional attention layer of the decoder.

The architecture

Figure 3: Encoder (left) — Decoder (right) architecture of SR-NMT

Finally, we come to the architecture description (for the formulas follow the paper). The encoder, shown in the left-hand side of figure 3, is a stack of bidirectional units, composed of the operations described above but with a bidirectional recursion. Thus, the output of the first layer normalization is split in 4 vectors of size (d/2) and one of size d (used for the highway). The vectors of size (d/2) are used for computing the same operations in the two different directions.
The decoder (right-hand side of figure 3) is another stack of units, but this time more complex. Every decoder unit computes the recursion as the encoder units do, but they perform other operations before the highway network. We apply a normalized transformation to the output of the recursion which is then used to compute the attention over the encoder. Each decoder unit features its own attention network. A normalized transformation is then applied separately to the recursion output and the attention output, and the results are further combined with a fully-connected network followed by a tanh non-linearity. The output of the tanh is used to compute the highway network.

Translation Quality on par with more complex architectures

Figure 4: SR-NMT is comparable with strong LSTM (GNMT) and CNN (ConvS2S).

We performed experiments on two benchmark datasets, namely WMT14 English-German, which is largely used to compare with the state of the art, and WMT16 English-Romanian, which represents a different setting with little data, mostly coming from back-translations.
The results on WMT14 are really promising as SR-NMT with 8 layers in encoder and decoder outperforms GNMT, which is a large network with 8 LSTM layers in encoder and decoder. The result is even more interesting as in our experiment SR-NMT uses only 500 hidden units while GNMT 1024. Moreover, SR-NMT with 9 layers is only 0.12 BLEU points below FAIR ConvS2S.
The experiments on WMT16 English-Romanian show that SR-NMT is a viable option also in this benchmark. In fact, we used exactly the same hyper-parameters used for English-German but got a score that is higher than the winning submission in WMT16.
For sure this architecture requires more experiments on more different benchmarks to prove its validity, but the results we obtained are already promising.

Faster with fewer parameters

Figure 5: Training speed comparison of SR-NMT and an LSTM network with the same hidden size for an increasing number of layers.

To compare the training time we ran experiments with both SR-NMT and LSTMs using our repository based on OpenNMT-py on a machine equipped with GPU NVIDIA GTX 1080 and Pytorch 0.3.1. The LSTM implementation is the one provided by the Pytorch APIs (which is implemented directly in C++), while SR-NMT is entirely implemented in Pytorch. The plot in figure 5 shows an important improvement in speed, but diminishing with the number of layers. This is motivated by the addition of an attention layer within each decoder layer, which increases the complexity of the model. Nonetheless, SR-NMT trains faster and performs better in terms of BLEU score than its LSTM-based counterpart.

Figure 6: Comparison of the number of parameters of SR-NMT and an LSTM network with the same hidden size.

The number of parameters is also significantly reduced. For instance, SR-NMT with 5 layers has the same number of parameters of LSTM with 3 layers. As shown by the previous plot, also the training speed is the same, but a deeper model achieves convergence earlier, thus making SR-NMT preferable in this case.

Conclusion

SR-NMT is a neural machine translation architecture that shows that it is possible to perform recurrent computation while significantly reducing the complexity of the recurrent units. Despite the limitations, it can achieve translation quality similar to the ones of architectures like LSTMs or CNNs.
The comparison with the Transformer is unfavorable, but a recent paper by Google suggests that maybe it is just a matter of hyper-parameters.

Call for Action

If you want to know more about SR-NMT you can read the scientific paper, clone the repository, or give a look to the conference presentation slides.