Review — Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation (Deep-ED & Deep-Att)
Using Fast-Forward Connections, Help Gradient Propagation
In this story, Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation, (Deep-ED & Deep-Att), by Baidu Research, is reviewed. In this paper:
- A new type of linear connections, named fast-forward (F-F) connections, based on deep Long Short-Term Memory (LSTM), is introduced.
- These F-F connections help propagating the gradients and building a deep topology of depth 16.
This is a paper in 2016 TACL with over 180 citations where TACL has impact score of 6.43. (Sik-Ho Tsang @ Medium)
Outline
- F-F connections
- Deep-ED and Deep-Att: Network Architecture
- Experimental Results
1. F-F connections
1.1. F-F connections in RNN
- (a): Basic RNN: When an input sequence {x1, …, xm} is given to a recurrent layer, the output ht at each time step t can be computed as:
- (b): Basic RNN with intermediate computational state and the sum operation (+) followed by activation. It consists of block “f” and block “r”, and is equivalent to (a).
- This computation can be equivalently split into two consecutive steps:
- “f” block, Feed-Forward Computation, left part in (b):
- “r” block, Recurrent Computation, right part and the sum operation (+) followed by activation in (b).
- (c): Two stacked RNN layers with F-F connections denoted by dashed red lines:
- where [ , ] denotes the concatenation.
- F-F connections connect two feed-forward computation blocks “f” of adjacent recurrent layers.
- The path of F-F connections contains neither nonlinear activations nor recurrent computation. It provides a fast path for information to propagate.
1.2. F-F connections in Bidirectional LSTM
- Similarly, the computations for the deep bi-directional LSTM model with F-F connections:
- Further, two more operations are introduced in the above equations:
- Half(f) denotes the first half of the elements of f, and Dr(h) is the Dropout operation.
- The use of Half() is to reduce the parameter size and does not affect the performance.
2. Deep-ED and Deep-Att: Network Architecture
2.1. Encoder
- The LSTM layers are stacked, and so called interleaved bidirectional encoder.
2.2. Interface
- As a consequence of the introduced F-F connections, we have 4 output vectors (hnet and fnet of both columns).
- For Deep-ED, et is static.
- For Deep-Att, only the 4 output vectors at each time step are concatenated to obtain et, and a soft attention mechanism in Attention Decoder is used to calculate the final representation ct from et:
- where:
- For Deep-Att, in order to reduce the memory cost, the concatenated vector et is linearly project with Wp to a vector with 1/4 dimension size.
- (Please feel free to read Attention Decoder if interested.)
2.3. Decoder
- There is a single column of nd stacked LSTM layers before softmax. F-F connections are also used.
2.4. Other Details
- 256 dimensional word embeddings for both the source and target languages.
- All LSTM layers have 512 memory cells.
- The dimension of ct is 5120 and 1280 for Deep-ED and Deep-Att respectively.
- Beam search is used.
- 4-to-8 GPU machines (each has 4 K40 GPU cards) running for 10 days to train the full model with parallelization at the data batch level. It takes nearly 1.5 days for each pass.
- The Dropout ratio pd is 0.1.
- In each batch, there are 500~800 sequences.
3. Experimental Results
3.1. English-to-French
- The previous best single NMT encoder-decoder model (Enc-Dec) with six layers achieves BLEU=31.5.
- Deep-ED obtains the BLEU score of 36.3, which outperforms Enc-Dec model by 4.8 BLEU points, and outperforms Attention Decoder/RNNSearch.
- For Deep-Att, the performance is further improved to 37.7.
- The previous state-of-the-art performance from a conventional SMT system is also listed (Durrani et al., 2014) with the BLEU of 37.0.
This is the first time that a single NMT model trained in an end-to-end form beats the best conventional system on this task.
- F-F connections bring an improvement of in BLEU.
- After using two times larger LSTM layer width of 1024, Deep-Att can only obtain BLEU score of 33.8. It is still behind the corresponding Deep-Att with F-F.
- With bidirectional LSTM, There is a gap of about 1.5 points between these two encoders for both Deep-Att and Deep-ED
- With ne=9 and nd=7, the best score for Deep-Att is 37.7.
- A 1.1 BLEU points degradation with a single encoding column is found.
3.2. English-to-German
- The proposed single model result with BLEU=20.6 is similar to the conventional SMT result of 20.7 (Buck et al., 2014), and outperforms Attention Decoder/RNNSearch.
There also other results such as post-processing and ensemble results. If interested, please feel free to read the paper.
Reference
[2016 TACL] [Deep-ED & Deep-Att]
Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation
Natural Language Processing (NLP)
Sequence Model: 2014 [GRU] [Doc2Vec]
Language Model: 2007 [Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling]
Sentence Embedding: 2015 [Skip-Thought]
Machine Translation: 2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch] 2016 [GNMT] [ByteNet] [Deep-ED & Deep-Att]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC] [Show, Attend and Tell]