Review — Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation (Deep-ED & Deep-Att)

Using Fast-Forward Connections, Help Gradient Propagation

Sik-Ho Tsang
Geek Culture
Published in
6 min readNov 23, 2021


In this story, Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation, (Deep-ED & Deep-Att), by Baidu Research, is reviewed. In this paper:

  • A new type of linear connections, named fast-forward (F-F) connections, based on deep Long Short-Term Memory (LSTM), is introduced.
  • These F-F connections help propagating the gradients and building a deep topology of depth 16.

This is a paper in 2016 TACL with over 180 citations where TACL has impact score of 6.43. (Sik-Ho Tsang @ Medium)


  1. F-F connections
  2. Deep-ED and Deep-Att: Network Architecture
  3. Experimental Results

1. F-F connections

1.1. F-F connections in RNN

RNN models. The recurrent use of a hidden
  • (a): Basic RNN: When an input sequence {x1, …, xm} is given to a recurrent layer, the output ht at each time step t can be computed as:
  • (b): Basic RNN with intermediate computational state and the sum operation (+) followed by activation. It consists of block “f” and block “r”, and is equivalent to (a).
  • This computation can be equivalently split into two consecutive steps:
  • “f” block, Feed-Forward Computation, left part in (b):
  • “r” block, Recurrent Computation, right part and the sum operation (+) followed by activation in (b).
  • (c): Two stacked RNN layers with F-F connections denoted by dashed red lines:
  • where [ , ] denotes the concatenation.
  • F-F connections connect two feed-forward computation blocks “f” of adjacent recurrent layers.
  • The path of F-F connections contains neither nonlinear activations nor recurrent computation. It provides a fast path for information to propagate.

1.2. F-F connections in Bidirectional LSTM

  • Similarly, the computations for the deep bi-directional LSTM model with F-F connections:
  • Further, two more operations are introduced in the above equations:
  • Half(f) denotes the first half of the elements of f, and Dr(h) is the Dropout operation.
  • The use of Half() is to reduce the parameter size and does not affect the performance.

2. Deep-ED and Deep-Att: Network Architecture

Deep-ED and Deep-Att: Network Architecture

2.1. Encoder

  • The LSTM layers are stacked, and so called interleaved bidirectional encoder.

2.2. Interface

  • As a consequence of the introduced F-F connections, we have 4 output vectors (hnet and fnet of both columns).
  • For Deep-ED, et is static.
  • For Deep-Att, only the 4 output vectors at each time step are concatenated to obtain et, and a soft attention mechanism in Attention Decoder is used to calculate the final representation ct from et:
  • where:
  • For Deep-Att, in order to reduce the memory cost, the concatenated vector et is linearly project with Wp to a vector with 1/4 dimension size.
  • (Please feel free to read Attention Decoder if interested.)

2.3. Decoder

  • There is a single column of nd stacked LSTM layers before softmax. F-F connections are also used.

2.4. Other Details

  • 256 dimensional word embeddings for both the source and target languages.
  • All LSTM layers have 512 memory cells.
  • The dimension of ct is 5120 and 1280 for Deep-ED and Deep-Att respectively.
  • Beam search is used.
  • 4-to-8 GPU machines (each has 4 K40 GPU cards) running for 10 days to train the full model with parallelization at the data batch level. It takes nearly 1.5 days for each pass.
  • The Dropout ratio pd is 0.1.
  • In each batch, there are 500~800 sequences.

3. Experimental Results

3.1. English-to-French

English-to-French task
  • The previous best single NMT encoder-decoder model (Enc-Dec) with six layers achieves BLEU=31.5.
  • Deep-ED obtains the BLEU score of 36.3, which outperforms Enc-Dec model by 4.8 BLEU points, and outperforms Attention Decoder/RNNSearch.
  • For Deep-Att, the performance is further improved to 37.7.
  • The previous state-of-the-art performance from a conventional SMT system is also listed (Durrani et al., 2014) with the BLEU of 37.0.

This is the first time that a single NMT model trained in an end-to-end form beats the best conventional system on this task.

Effect of F-F Connections
  • F-F connections bring an improvement of in BLEU.
Different LSTM layer width in Deep-Att
  • After using two times larger LSTM layer width of 1024, Deep-Att can only obtain BLEU score of 33.8. It is still behind the corresponding Deep-Att with F-F.
Effect of the interleaved bi-directional encoder
  • With bidirectional LSTM, There is a gap of about 1.5 points between these two encoders for both Deep-Att and Deep-ED
Deep-Att with different model depths
  • With ne=9 and nd=7, the best score for Deep-Att is 37.7.
Encoders with different number of columns and LSTM layer width
  • A 1.1 BLEU points degradation with a single encoding column is found.

3.2. English-to-German

English-to-German task
  • The proposed single model result with BLEU=20.6 is similar to the conventional SMT result of 20.7 (Buck et al., 2014), and outperforms Attention Decoder/RNNSearch.

There also other results such as post-processing and ensemble results. If interested, please feel free to read the paper.


[2016 TACL] [Deep-ED & Deep-Att]
Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation

Natural Language Processing (NLP)

Sequence Model: 2014 [GRU] [Doc2Vec]
Language Model: 2007 [Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling]
Sentence Embedding: 2015 [Skip-Thought]
Machine Translation: 2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch] 2016 [GNMT] [ByteNet] [Deep-ED & Deep-Att]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC] [Show, Attend and Tell]

My Other Previous Paper Readings



Sik-Ho Tsang
Geek Culture

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.