Nov 23, 2021

# Review — Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation (Deep-ED & Deep-Att)

## Using Fast-Forward Connections, Help Gradient Propagation

In this story, **Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation**, (Deep-ED & Deep-Att), by Baidu Research, is reviewed. In this paper:

- A new type of linear connections, named
**fast-forward (F-F) connections**, based on deep Long Short-Term Memory (LSTM), is introduced. **These F-F connections help propagating the gradients**and building a deep topology of depth 16.

This is a paper in **2016 TACL **with over **180 citations **where TACL has** impact score of 6.43**. (Sik-Ho Tsang @ Medium)

# Outline

**F-F connections****Deep-ED and Deep-Att: Network Architecture****Experimental Results**

# 1. F-F connections

## 1.1. F-F connections in RNN

**(a): Basic RNN:**When an input sequence {*x*1, …,*xm*} is given to a recurrent layer,**the output**can be computed as:*ht*at each time step*t*

**(b): Basic RNN with intermediate computational state and the sum operation (+) followed by activation.**It consists of block “f” and block “r”, and is**equivalent to (a)**.- This computation can be equivalently split into two consecutive steps:
**“f” block**,**Feed-Forward Computation,**left part in (b):

**“r” block, Recurrent Computation**, right part and the sum operation (+) followed by activation in (b).

**(c): Two stacked RNN layers with F-F connections denoted by dashed red lines**:

- where
**[ , ]**denotes the**concatenation**. - F-F connections
**connect two feed-forward computation blocks “f” of adjacent recurrent layers.** - The path of F-F connections contains neither nonlinear activations nor recurrent computation. It provides
**a fast path for information to propagate**.

## 1.2. F-F connections in Bidirectional LSTM

- Similarly, the computations for the
**deep bi-directional LSTM model with F-F connections**:

- Further,
**two more operations**are introduced in the above equations:

**Half(**denotes the*f*)**first half of the elements of**, and*f***Dr(**is the*h*)**Dropout****The use of Half() is to reduce the parameter size**and does not affect the performance.

# 2. Deep-ED and Deep-Att: Network Architecture

**2.1. Encoder**

- The LSTM layers are stacked, and so called interleaved bidirectional encoder.

## 2.2. Interface

- As a consequence of the introduced F-F connections, we have
**4 output vectors**(*hnet*and*fnet*of both columns). - For
**Deep-ED**,.*et*is static - For
**Deep-Att**,**only the 4 output vectors at each time step are concatenated to obtain**, and a soft attention mechanism in Attention Decoder is used to calculate the final representation*et**ct*from*et*:

- where:

- For
**Deep-Att**, in order to reduce the memory cost,**the concatenated vector**.*et is*linearly project with*Wp*to a vector with 1/4 dimension size - (Please feel free to read Attention Decoder if interested.)

**2.3. Decoder**

- There is a single column of
*nd*stacked LSTM layers before softmax. F-F connections are also used.

## 2.4. Other Details

**256 dimensional word embeddings**for both the source and target languages.- All
**LSTM**layers have**512 memory cells**. **The dimension of***ct*is 5120 and 1280 for Deep-ED and Deep-Att respectively.- Beam search is used.
**4-to-8 GPU machines**(each has 4 K40 GPU cards) running for**10 days**to train the full model with parallelization at the data batch level. It takes nearly 1.5 days for each pass.- The Dropout ratio
*pd*is 0.1. - In each batch, there are 500~800 sequences.

# 3. Experimental Results

## 3.1. **English-to-French**

- The previous best single NMT encoder-decoder model (Enc-Dec) with six layers achieves BLEU=31.5.
**Deep-ED**obtains the**BLEU score of 36.3**, which outperforms Enc-Dec model by 4.8 BLEU points, and outperforms Attention Decoder/RNNSearch.- For
**Deep-Att**, the performance is further improved to**37.7**. - The previous state-of-the-art performance from a conventional SMT system is also listed (Durrani et al., 2014) with the BLEU of 37.0.

This is the

first timethat asingle NMT modeltrained in an end-to-end formbeats the best conventional systemon this task.

- F-F connections bring an improvement of in BLEU.

- After using
**two times larger LSTM layer width**of 1024, Deep-Att can only obtain BLEU score of 33.8. It is**still behind the corresponding Deep-Att with F-F**.

**With bidirectional LSTM**, There is**a gap of about 1.5 points**between these two encoders for both Deep-Att and Deep-ED

**With**, the*ne*=9 and*nd*=7**best**score for Deep-Att is 37.7.

**A 1.1 BLEU points degradation**with a**single encoding column**is found.

## 3.2. English-to-German

- The proposed
**single model**result with BLEU=**20.6**is**similar to the conventional SMT result of 20.7**(Buck et al., 2014), and outperforms Attention Decoder/RNNSearch.

There also other results such as post-processing and ensemble results. If interested, please feel free to read the paper.

## Reference

[2016 TACL] [Deep-ED & Deep-Att]

Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation

## Natural Language Processing (NLP)

**Sequence Model: 2014** [GRU] [Doc2Vec]**Language Model: 2007 **[Bengio TNN’07] **2013 **[Word2Vec] [NCE] [Negative Sampling]**Sentence Embedding: 2015 **[Skip-Thought]**Machine Translation: 2014** [Seq2Seq] [RNN Encoder-Decoder] **2015** [Attention Decoder/RNNSearch] **2016** [GNMT] [ByteNet] [Deep-ED & Deep-Att]**Image Captioning:** **2015 **[m-RNN] [R-CNN+BRNN] [Show and Tell/NIC] [Show, Attend and Tell]