Review — Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation (Deep-ED & Deep-Att)

Using Fast-Forward Connections, Help Gradient Propagation

Published in

Geek Culture

6 min readNov 23, 2021

In this story, Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation, (Deep-ED & Deep-Att), by Baidu Research, is reviewed. In this paper:

A new type of linear connections, named fast-forward (F-F) connections, based on deep Long Short-Term Memory (LSTM), is introduced.
These F-F connections help propagating the gradients and building a deep topology of depth 16.

This is a paper in 2016 TACL with over 180 citations where TACL has impact score of 6.43. (Sik-Ho Tsang @ Medium)

Outline

F-F connections
Deep-ED and Deep-Att: Network Architecture
Experimental Results

1. F-F connections

1.1. F-F connections in RNN

RNN models. The recurrent use of a hidden

(a): Basic RNN: When an input sequence {x1, …, xm} is given to a recurrent layer, the output ht at each time step t can be computed as:

(b): Basic RNN with intermediate computational state and the sum operation (+) followed by activation. It consists of block “f” and block “r”, and is equivalent to (a).
This computation can be equivalently split into two consecutive steps:
“f” block, Feed-Forward Computation, left part in (b):

“r” block, Recurrent Computation, right part and the sum operation (+) followed by activation in (b).

(c): Two stacked RNN layers with F-F connections denoted by dashed red lines:

where [ , ] denotes the concatenation.
F-F connections connect two feed-forward computation blocks “f” of adjacent recurrent layers.
The path of F-F connections contains neither nonlinear activations nor recurrent computation. It provides a fast path for information to propagate.

1.2. F-F connections in Bidirectional LSTM

Similarly, the computations for the deep bi-directional LSTM model with F-F connections:

Further, two more operations are introduced in the above equations:

Half(f) denotes the first half of the elements of f, and Dr(h) is the Dropout operation.
The use of Half() is to reduce the parameter size and does not affect the performance.

2. Deep-ED and Deep-Att: Network Architecture

2.1. Encoder

The LSTM layers are stacked, and so called interleaved bidirectional encoder.

2.2. Interface

As a consequence of the introduced F-F connections, we have 4 output vectors (hnet and fnet of both columns).
For Deep-ED, et is static.
For Deep-Att, only the 4 output vectors at each time step are concatenated to obtain et, and a soft attention mechanism in Attention Decoder is used to calculate the final representation ct from et:

where:

For Deep-Att, in order to reduce the memory cost, the concatenated vector et is linearly project with Wp to a vector with 1/4 dimension size.
(Please feel free to read Attention Decoder if interested.)

2.3. Decoder

There is a single column of nd stacked LSTM layers before softmax. F-F connections are also used.

2.4. Other Details

256 dimensional word embeddings for both the source and target languages.
All LSTM layers have 512 memory cells.
The dimension of ct is 5120 and 1280 for Deep-ED and Deep-Att respectively.
Beam search is used.
4-to-8 GPU machines (each has 4 K40 GPU cards) running for 10 days to train the full model with parallelization at the data batch level. It takes nearly 1.5 days for each pass.
The Dropout ratio pd is 0.1.
In each batch, there are 500~800 sequences.

3. Experimental Results

3.1. English-to-French

The previous best single NMT encoder-decoder model (Enc-Dec) with six layers achieves BLEU=31.5.
Deep-ED obtains the BLEU score of 36.3, which outperforms Enc-Dec model by 4.8 BLEU points, and outperforms Attention Decoder/RNNSearch.
For Deep-Att, the performance is further improved to 37.7.
The previous state-of-the-art performance from a conventional SMT system is also listed (Durrani et al., 2014) with the BLEU of 37.0.

This is the first time that a single NMT model trained in an end-to-end form beats the best conventional system on this task.

F-F connections bring an improvement of in BLEU.

**Different LSTM layer width in Deep-Att**

After using two times larger LSTM layer width of 1024, Deep-Att can only obtain BLEU score of 33.8. It is still behind the corresponding Deep-Att with F-F.

**Effect of the interleaved bi-directional encoder**

With bidirectional LSTM, There is a gap of about 1.5 points between these two encoders for both Deep-Att and Deep-ED

**Deep-Att with different model depths**

With ne=9 and nd=7, the best score for Deep-Att is 37.7.

**Encoders with different number of columns and LSTM layer width**

A 1.1 BLEU points degradation with a single encoding column is found.

3.2. English-to-German

The proposed single model result with BLEU=20.6 is similar to the conventional SMT result of 20.7 (Buck et al., 2014), and outperforms Attention Decoder/RNNSearch.

There also other results such as post-processing and ensemble results. If interested, please feel free to read the paper.

Reference

[2016 TACL] [Deep-ED & Deep-Att]
Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation

Natural Language Processing (NLP)

Sequence Model: 2014 [GRU] [Doc2Vec]
Language Model: 2007 [Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling]
Sentence Embedding: 2015 [Skip-Thought]
Machine Translation: 2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch] 2016 [GNMT] [ByteNet] [Deep-ED & Deep-Att]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC] [Show, Attend and Tell]