Member-only story
Addressing Limitations on RNNs by Using Transformer-XL | Towards AI
Address Limitation of RNN in NLP Problems by Using Transformer-XL
Limitations of recurrent neural networks
Recurrent Neural Network (RNN) offers a way to learn a sequence of inputs. The drawback is that it is difficult to optimize due to vanishing gradient problem. Transformer (Al-Rfou et al., 2018) is introduced to overcome the limitation of RNN. By design, a fixed-length segment is defined to reduce resource consumption.
However, there is another problem that calls context fragmentation. If the input sequence is larger than pre-defined segment length, the input sequence needs to be separated and information cannot be captured across segments. Transformer-XL is introduced to overcome this limitation by Dai et al. (2019)
Vanilla Transformer
To reduce computing resources, the input sequence is split by fixed-length. Dai et al. named it as Vanilla Transformer.

The first limitation is that information cannot be shared across a segment. Although Transformer
is less affected by vanishing gradient problem, it limited its capability if the length of the input sequence is fixed. The second limitation is caused by padding. As fixed-length input is required, padding is needed if the length of the input is shorter than pre-defined. It does not respect to sentence and semantic boundary.
Transformer-XL
Transformer-XL (extra long) is born to tackle Vanilla Transformer’s limitations.
Instead of disconnected between segments, the hidden state sequence of the previous segment will be used when computing the next segment. Theoretically, we can add multi previous segments such that current segments can reach more information across segments.
Another feature input is positional encodings. Instead of absolute position, relative positional encodings is leveraged to prevent misleading. Therefore, any word has a relative distance of every single…