Towards AI

The leading AI community and content platform focused on making AI accessible to all. Check out our new course platform: https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev

Follow publication

Photo by Joe Gardner on Unsplash

Member-only story

Addressing Limitations on RNNs by Using Transformer-XL | Towards AI

Address Limitation of RNN in NLP Problems by Using Transformer-XL

Limitations of recurrent neural networks

Edward Ma
Towards AI
Published in
3 min readAug 12, 2019

--

Recurrent Neural Network (RNN) offers a way to learn a sequence of inputs. The drawback is that it is difficult to optimize due to vanishing gradient problem. Transformer (Al-Rfou et al., 2018) is introduced to overcome the limitation of RNN. By design, a fixed-length segment is defined to reduce resource consumption.

However, there is another problem that calls context fragmentation. If the input sequence is larger than pre-defined segment length, the input sequence needs to be separated and information cannot be captured across segments. Transformer-XL is introduced to overcome this limitation by Dai et al. (2019)

Vanilla Transformer

To reduce computing resources, the input sequence is split by fixed-length. Dai et al. named it as Vanilla Transformer.

Information cannot be shared across segment under vanilla transformer architecture (Dai et al., 2019)

The first limitation is that information cannot be shared across a segment. Although Transformer is less affected by vanishing gradient problem, it limited its capability if the length of the input sequence is fixed. The second limitation is caused by padding. As fixed-length input is required, padding is needed if the length of the input is shorter than pre-defined. It does not respect to sentence and semantic boundary.

Transformer-XL

Transformer-XL (extra long) is born to tackle Vanilla Transformer’s limitations.

Instead of disconnected between segments, the hidden state sequence of the previous segment will be used when computing the next segment. Theoretically, we can add multi previous segments such that current segments can reach more information across segments.

Another feature input is positional encodings. Instead of absolute position, relative positional encodings is leveraged to prevent misleading. Therefore, any word has a relative distance of every single…

--

--

Published in Towards AI

The leading AI community and content platform focused on making AI accessible to all. Check out our new course platform: https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev

Written by Edward Ma

Focus in Natural Language Processing, Data Science Platform Architecture. https://makcedward.github.io/

Responses (1)

Write a response