Towards Building Your Own AI: Unpack the Secret of Seq2Seq Learning and Bahdanau Attention Mechanism

Eric S. Shi 舍予
Artificial Corner
Published in
14 min readMay 30, 2023
Photo by Christopher Burns on Unsplash

Artificial intelligence (AI) is a rapidly growing field with the potential to revolutionize many aspects of our lives. From self-driving cars to medical diagnosis, AI is already used to solve some of the world’s most pressing problems [1, 2]. AI will only become more powerful and capable as it continues to develop. How can we learn to build our own AI to make ourselves a part of this revolution?

One of the most important advances in AI in recent years has been the development of sequence-to-sequence (Seq2Seq) learning. On the surface, it allows computers to learn to translate between different data sequences. Deep inside, this technology enables a wide variety of AI applications, with the most obvious ones including machine translation, speech recognition, and text generation.

Another important advance in AI is the attention mechanism. As discussed in my earlier articles, Selective Attention: The Key to Unlocking the Full Potential of Deep Learning [3] and Attention Mechanisms in Transformers [3], the attention mechanism is a technique that allows computers to focus on specific parts of a sequence when learning to translate between different sequences. This technique has been shown to significantly improve the accuracy of Seq2Seq models.

This article is the first of a series aiming at helping novices to create basic AIs from their home office desks. This article will explore the secrets of the encoder-decoder algorithm, Seq2Seq learning, and the Bahdanau attention mechanism. It will also discuss how these technologies work and lay down a theoretical foundation to assist you in building your own AIs.

1. Encoder-decoder Architecture

In order to acquire AI, you need to have a computer and equip it with algorithms (often also called models) that can learn and improve certain skills. Encoder-decoder algorithm (i.e., architecture) is a widely used approach for building machine learning (ML) capabilities.

This architecture is typically made of two parts: an encoder and a decoder. The encoder takes the input sequence and converts it into a vector representation. Such a representation is often referred to as a hidden or “latent” representation. The decoder then takes this hidden representation and generates the output sequence.

The encoder and decoder can be implemented using a variety of neural networks (NNs), such as recurrent neural networks (RNNs) or convolutional neural networks (CNNs). RNNs are good at processing sequential data, while CNNs are good at processing non-sequential (e.g., spatial) data.

For most NLP tasks, such as machine translation, text summarization, and question answering, an RNN-encoder and RNN-decoder is likely a good choice. This is because most of the data to be processed for an NLP task are sequential.

On the other hand, if the data in hand include both the sequential and non-sequential types, an RNN (encoder) — CNN (decoder) architecture can be expected to achieve better performance. This is because the RNN encoder can process the sequential data well, and the CNN decoder to process the non-sequential data well. Case examples where both the sequential and non-sequential are involved include:

  • Image captioning
  • Visual question answering
  • Speech recognition
  • Natural language generation
  • Machine translation

An RNN (encoder) — CNN (decoder) architecture can be a better choice, for at least the following reasons:

  • It can perform better on tasks requiring sequential and non-sequential data processing.
  • It is more robust to noise and outliers.
  • It is more efficient to train.

For your easy reference, Table 1 summarizes the input and output data characteristics for the above-listed five tasks:

Table 1. Characteristics of the input and output data of certain NLP tasks

If the input data contains both sequential and non-sequential data, it is best to incorporate both RNN and CNN in the encoder, in theory. However, the flip side of this is the increased complexity of the encoder. For the sake of this article, i.e., laying down a foundation that can assist you to build your own AI from your home office desks, we will not go there for now.

2. Selections Based on Data Characteristics

The examples below illustrate how the target language texts may be machine translated in different orders:

  • An English sentence, “The cat sat on the mat”, can be machine translated into French as “Le tapis était sur le chat” or “Le chat était assis sur le tapis”. One of them is grammatically incorrect.
  • A Spanish sentence, “El perro come la carne”, can be machine translated into English as “The dog eats the meat” or “The meat eats the dog”. One of them is grammatically incorrect.
  • In German, the sentence “Der Mann sieht die Frau” can be machine translated into English as “The man sees the woman” or “The woman sees the man”. One of them is grammatically incorrect.

Generally speaking, in machine translation, the source language text is sequential, but the target language text may not adopt the same sequential structure as the source language text. This is loosely described in ML as the target language text being not sequential. This characteristic has resulted in either RNN or CNN being used as decoders in language models.

For instance, in the above-mentioned case of translating the English sentence to French, without a proper attention mechanism to capture long-range relationships (i.e., the broad semantic context), an RNN-based machine translator may have a difficult time deciding whether the “Le tapis était sur le chat” or the “Le chat était assis sur le tapis” should be rejected. Comparatively, a CNN decoder-based machine translator would likely reject the translation “Le tapis était sur le chat” easily. This is because the CNN decoder has the ability to learn long-range relationships among the words of a text and therefore recognize that the order of the words is reversed in “Le tapis était sur le chat” and not typical in French.

A CNN-decoder has an advantage over an RNN-decoder in handling such target language texts because CNNs are able to learn to represent patterns in data without relying on learning the order of the data. In contrast, an RNN decoder will have to rely on the learned sequential order of the data. This makes CNNs well-suited for handling target language texts that can be translated in arbitrary order.

Regardless of whether such cases are often encountered in routine machine translation (as most languages have a subject-verb-object order), they are important to consider because they show that target language texts can be non-sequential. Moreover, the verb can come before the subject in some languages or even have free word orders. So, ideally, one would like for the machine translation models to be able to handle both sequential and non-sequential target language texts.

In addition to using a CNN decoder, as described above, using a beam search decoder is another way to handle non-sequential data. Beam search decoders are a type of decoder that can generate multiple translations for a given source language text. The beam search decoder then selects the translation that is most likely to be correct. This approach can handle non-sequential target language texts by generating multiple translations and selecting the most likely accurate translation.

3. Feature Maps for Images

Handling images differs significantly from language translation, as 2D or 3D images are typically non-sequential data. They are spatial data.

Although an RNN encoder is suited to handle sequential data, it still can be trained to handle spatial data in a “twisted” way, i.e., to represent the spatial information of an image by attending to different parts of the image in a sequence.

This is done by first converting the image into a sequence of feature maps. Each feature map represents a different spatial location in the image. The RNN encoder then attends to each feature map in turn and learns to associate each feature map with a set of tokens (e.g., words or phrases). This allows the RNN encoder to learn a representation of the image that is both spatial and semantic. This is illustrated in Figure 1.

Figure 1. An illustration of converting an image to a sequence of feature maps (drawn by the author).

For example, consider an image of a cat sitting on a chair. The encoder might first attend to the feature map representing the cat’s face. This would allow the encoder to learn that the word “cat” is associated with this feature map. The encoder might then attend to the feature map that represents the chair. This would allow the encoder to learn that the word “chair” is associated with this feature map. By attending to different parts of the image, the encoder can learn to represent the spatial information in the image and associate this spatial information with specific words or phrases.

The whole image can be represented as a sequence of feature maps, with each feature map representing a different spatial location in the image. E.g., with the first feature map representing the top-left corner of the image, the second feature map representing the top-right corner of the image, and so on.

The spatial information in the image can be associated with specific words or phrases by using an attention mechanism. The attention mechanism allows the encoder to learn which feature maps are most relevant to each word or phrase.

The attention mechanism works by first calculating a score for each feature map. The score for each feature map is calculated based on the similarity between the feature map and the word or phrase. The feature map with the highest score is then attended to. This process is repeated for each word or phrase. By attending to the most relevant feature maps, the encoder can learn to associate spatial information with specific words or phrases. (See references [2] and [3] for more discussions on attention mechanisms.)

The ability to learn to represent spatial information in images is important for image captioning models. By learning to represent spatial information, image captioning models can generate more accurate and informative captions.

4. The Fundamentals of Seq2Seq Learning

Seq2Seq learning is a type of ML that uses NNs (e.g., RNNs or CNNs) to learn long-range dependencies between input and output sequences. The neural network learns to map the input sequence to the output sequence by learning the long-range dependencies between the words in the sequences.

For instance, an NN might learn that the word “the” is typically followed by a noun and that the noun is typically followed by a verb. The NN might also learn that the word “is” is often used to connect a noun to a verb (present continuous tense). The NN can learn to translate a sentence from one language to another by learning these long-range dependencies.

Seq2Seq learning has been used to achieve state-of-the-art results in machine translation, text summarization, and other sequence-to-sequence tasks. It has been proven powerful.

The principles of Seq2Seq learning are based on the following two ideas:

  • NNs can learn long-range dependencies between sequences.
  • The input and output sequences can be learned independently.

The first idea is based on the fact that NNs can learn to represent data sequences. E.g., RNNs do this by using a hidden state to store information about the sequence that has been seen so far. This hidden state is then used to predict the next element in the sequence.

The second idea is based on the fact that the input and output sequences are independent of each other. This means that the NNs can learn to map the input sequence to the output sequence without having to learn the relationship between the two sequences.

The following techniques are typically incorporated into successful Seq2Seq learning:

  • Encoder-decoder architecture: The encoder-decoder architecture is a common architecture for Seq2Seq learning. The encoder takes the input sequence and produces a hidden state representing the sequence. The decoder then takes the hidden state as input and produces the output sequence. The encoder-decoder architecture may consist of two RNNs or an RNN plus a CNN.
  • Attention: Attention is a technique that can be used to improve the performance of Seq2Seq models. Attention allows the model to focus on specific parts of the input sequence when generating the output sequence.
  • Beam search: Beam search is a decoding algorithm that can be used to generate output sequences from a Seq2Seq model. Beam search works by considering a set of possible output sequences and then selecting the most likely correct sequence.

Another architecture that can be used for Seq2Seq learning is transformer architecture. The transformer architecture does not use RNNs. Instead, it relies on a self-attention mechanism to learn the long-range dependencies between the input and output sequences.

Seq2Seq learning has been used to achieve state-of-the-art results in a variety of tasks, including:

  • Machine translation: to translate text from one language to another.
  • Text summarization: to generate summaries of text documents.
  • Question answering: to answer questions about text documents.
  • Chatbots: to hold conversations with humans.

5. Computational Realization of Seq2Seq Learning:

Computationally, the goal of the seq2seq learning is to map an input sequence,

The input and output sequences may have different lengths, and the mapping between them can be many-to-many, one-to-many, or many-to-one. E.g., in machine translation, the input sequence is a sentence in one language, and the output sequence is the corresponding sentence in another language.

The encoder-decoder architecture with attention mechanisms can be trained as follows: firstly, letting the encoder take in the input sequence and produce a fixed-shape context vector to summarize the input sequence, and then, letting the decoder take in the context vector and generates the output sequence token by token, conditioned on the previous output tokens and the context vector.

Mathematically, the encoder-decoder model for the seq2seq learning can be expressed as follows:

Encoder:

Decoder:

The training objective of the model is to maximize the conditional probability of the output sequence given the input sequence:

One limitation of the encoder-decoder architecture with attention mechanisms for seq2seq learning is that it can be computationally expensive and slow to train, especially for long input and output sequences. To address this, researchers have proposed various techniques to improve the efficiency and speed of the model, such as using CNNs instead of RNNs in the encoder or decoder, or using multi-head attention instead of single-head attention.

6. The Bahdanau Attention Mechanism

The Bahdanau attention mechanism is commonly used in the encoder-decoder architecture for ML models. It was introduced by Dzmitry Bahdanau in 2014 [4, 5] and has since become a popular approach for improving the performance of seq2seq models.

Mathematically, the Bahdanau attention mechanism can be expressed as follows:

Given an input sequence, x = (x_1, x_2, …, x_T), and a set of hidden states, h = (h_1, h_2, …, h_T), produced by the encoder, the attention weights a_t can be computed as follows:

energy_t = v_a^T * tanh(W_a * h_t + U_a * s_{t-1} + b_a) 
a_t = softmax(energy_t)

where W_a, U_a, and b_a are learnable parameters of the model, v_a is a parameter vector, s_{t-1} is the previous hidden state of the decoder, and tanh is the hyperbolic tangent function.

The attention weights, a_t, indicate the relevance of each input token, x_t, to the current decoding step, and can be used to compute a weighted sum of the input tokens, which serves as the input to the decoder at each time step:

c_t = \sum_{i=1}^{T} a_{t,i} * h_i 

where c_t is the context vector at time step t.

The context vector, c_t, can then be concatenated with the previous output, y_{t-1}, and fed into the decoder to generate the current output y_t:

s_t = f(s_{t-1}, y_{t-1}, c_t) 
p(y_t | y_{<t}, x) = softmax(W_s * s_t + b_s)

where f is the decoder function, W_s and b_s are learnable parameters of the model, and p(y_t | y_{<t}, x) is the probability distribution over the possible output tokens.

The computational scheme for the Bahdanau attention mechanism is illustrated in Figure 2.

Figure 2. An illustration of the computational scheme for the Bahdanau attention mechanism (drawn by the author).

One key advantage of the Bahdanau attention mechanism is that it allows the model to focus selectively on different parts of the input sequence as it generates the output sequence. It accomplishes this by computing a set of attention weights that indicate the relevance of each input token to the current decoding step. These attention weights are then used to compute a weighted sum of the input tokens, which serves as the input to the decoder at each time step.

This can be especially useful in tasks where certain parts of the input sequence are more relevant to the output than others, such as in machine translation. For example, the Google Neural Machine Translation system uses a sequence-to-sequence model with attention mechanisms to achieve state-of-the-art performance in machine translation.

Another application is image captioning, where the model takes in an image and generates a natural language description of the image. It selectively focuses on different parts of the image and generates a more accurate and descriptive caption. For example, the Show and Tell model uses a convolutional neural network (CNN) as the encoder and an LSTM network as the decoder with attention mechanisms to achieve state-of-the-art performance in image captioning.

In conclusion, the encoder-decoder architecture with attention mechanisms is a powerful tool for building ML models that can handle variable-length input and output sequences. The encoder converts the input sequence into a fixed-shape context variable, while the decoder generates the output sequence token by token, conditioned on the context variable. By incorporating attention mechanisms, the model can selectively focus on relevant information and achieve better performance in various applications.

Reference

[1] Eric S. Shi, Simulated Biomedicine: The Next Big Thing in Medicine, Medium, (2023). Simulated Biomedicine: The Next Big Thing in Medicine; https://medium.com/artificial-corner/simulated-biomedicine-the-next-big-thing-in-medicine-209c2b94e9c3.

[2] Eric S. Shi, Brain-Computer Interfaces: The Next Frontier in Human Technology — A Conversation with Bard That Will Change Your Perspective, Medium, (2023). Brain-Computer Interfaces: The Next Frontier in Human Technology; https://medium.com/generative-ai/brain-computer-interfaces-the-next-frontier-in-human-technology-8980522c4452.

[3] Eric S. Shi, Selective Attention: The Key to Unlocking the Full Potential of Deep Learning, Medium, (2023). The Key to Unlocking the Full Potential of Deep Learning; https://medium.com/artificial-corner/selective-attention-the-key-to-unlocking-the-full-potential-of-deep-learning-5ee6bc55b419.

[4] Eric S. Shi, Attention Mechanisms in Transformers: A Deep Dive into Queries, Keys, Values, and Pooling Techniques, Medium, (2023). Attention Mechanisms in Transformers; https://medium.com/artificial-corner/attention-mechanisms-in-transformers-29c1768f83f3.

[5] Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 [cs, stat].

[6] Bahdanau, D., Cho, K., and Bengio, Y. (2014). End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results. arXiv:1412.1602v1 [cs.NE] 4 Dec 2014.

Allow me to take this opportunity to thank you for being here! I would be unable to do what I do without people like you who follow along and take that leap of faith to read my postings.

If you like my content, please (1) leave me with a few claps and (2) press the “Follow” button below my photo. I can also be contacted on LinkedIn, Facebook, or Twitter: Twitter.

--

--

Eric S. Shi 舍予
Artificial Corner

Founder of the ES&AG AI Art Studio; built AI bots (ESAG, ESNA, ESMC), running the AI Art Studio with these bots; an artist; a poet; with Ph.D. (USA), MBA (UK).