Attention in NLP: One Key to Making Pre-Training Successful

Pre-Training, Part 2: the mechanism that tells your model where to put its focus

dan lee
AI³ | Theory, Practice, Business
7 min readMar 30, 2020

--

On the job at Yodo1, one task of our AI team is to classify users into different groups. Recently, a model we used showed better performance with attention than without — enough to convince me that it’s a method worth trying, and worth sharing with you!

So today, we will be talking about the attention mechanism.

What does the attention mechanism do?

In the Encoder-Decoder model, the attention’s job is to focus on different elements of the encoder information based on what is most relevant at each step of the decoder process.

For a bit of background, the 2014 paper Neural Machine Translation by Jointly Learning to Align and Translate is generally considered to be where the attention mechanism was originally put forward. It has only become more and more popular since.

Moving beyond this original attention — which we will refer to in this post as traditional attention — the mechanism has been developed into several variants, which mainly include:

  • Soft vs hard attention;
  • Global vs local attention;
  • Self-attention.

Most of these variants will not be addressed here: for now, it is enough simply to know that many exist. We will discuss self-attention in my next post as it is the basic idea behind the transformer which, in turn, is the basic framework used in GPT and BERT.

Both GPT and BERT, as introduced in part one of this series, are important models in pre-training. And as you may recall, GPT adopted the decoder part of the transformer while BERT adopted the encoder part.

Traditional attention is the basis of all of these ideas.

So, in this post, we’re going to learn more about the traditional attention mechanism, take a quick look at how it can be used, and dive into a deeper understanding with seq2seq attention.

Let’s make it happen!

What Is The Attention Mechanism?

To explain attention simply, we can use the task of translation as an example.

In the process of generating a translation in the target language, we focus on corresponding words or positions in the source language at each step.

In other words, when we do the translation, we don’t focus invariably on the entire source sentence. Instead, as the translation progresses, our focus travels step-by-step through the sentence, from one word to another.

The attention mechanism is designed to imitate this process. At different steps of a task, it adds the relevance of different pieces of previous information to the current step. This allows the model to highlight previous information that is most relevant to the current step and achieve better performance.

On the job at Yodo1, one task of our AI team is to classify users into different groups. Recently, a model we used showed better performance with attention than without — enough to convince me that it’s a method worth trying, and worth sharing with you today!

A Quick Look: Using The Attention Layer In Tensowflow2.1

Tensorflow2.1 recently provided us with an attention API: tensorflow.keras.layers.Attention, which we can use to quickly and directly build a model that contains the attention mechanism.

Let’s have a look at this CNN+ attention as an example:

# Variable-length int sequences.query_input = tf.keras.Input(shape=(None,), dtype='int32')value_input = tf.keras.Input(shape=(None,), dtype='int32')# Embedding lookup.token_embedding = tf.keras.layers.Embedding(max_tokens, dimension)# Query embeddings of shape [batch_size, Tq, dimension].query_embeddings = token_embedding(query_input)# Value embeddings of shape [batch_size, Tv, dimension].value_embeddings = token_embedding(query_input)# CNN layer.cnn_layer = tf.keras.layers.Conv1D(filters=100,kernel_size=4,# Use 'same' padding so outputs have the same shape as inputs.padding='same')# Query encoding of shape [batch_size, Tq, filters].query_seq_encoding = cnn_layer(query_embeddings)# Value encoding of shape [batch_size, Tv, filters].value_seq_encoding = cnn_layer(value_embeddings)# Query-value attention of shape [batch_size, Tq, filters].query_value_attention_seq = tf.keras.layers.Attention()([query_seq_encoding, value_seq_encoding])# Reduce over the sequence axis to produce encodings of shape# [batch_size, filters].query_encoding = tf.keras.layers.GlobalAveragePooling1D()(query_seq_encoding)query_value_attention = tf.keras.layers.GlobalAveragePooling1D()(query_value_attention_seq)# Concatenate query and document encodings to produce a DNN input layer.input_layer = tf.keras.layers.Concatenate()([query_encoding, query_value_attention])# Add DNN layers and create a Model.# ...

Of course, you would have to make some modifications to the code based on your scenario if you really wanted your system to perform better. But this example should be enough to show how you can introduce attention to your model for a quick experiment.

A Deeper Understanding: Seq2seq Attention

Background Knowledge Needed

To understand this part, you need to have some background knowledge of:

  • The Encoder-Decoder framework;
  • LSTM input and output.

To understand the attention model in deep learning, you must be familiar with the Encoder-Decoder framework to which most of the current attention mechanisms are attached. At the same time, we should understand that the attention mechanism as a concept that does not depend on any particular framework.

The specific models used by Encoder and Decoder can be varying combinations of CNN, RNN, LSTM, GRU, etc. In machine translation models, the Encoder-Decoder framework used is often LSTM-LSTM.

We will also be using the LSTM-LSTM encoding-decoding framework in our second example, to help us gain a deeper understanding of the attention mechanism.

If you are not familiar with the above terms, I recommend spending a few minutes googling them. You don’t need a deep understanding; simply knowing what they look like will do for now.

How Attention Works, Step by Step

Let’s look at what we call the seq2seq attention model, where attention is used in the LSTM-LSTM encoding-decoding framework.

To make it easier to explain, we will use the simplest architecture — in which the encoder and decoder both have only one LSTM layer each.

Our task is to translate “C’est la vie!” into English, to show us step-by-step how seq2seq with an attention mechanism works.

Step 1: We get the embedding vector of each word. Then we throw these embedding vectors into the LSTM layer and we get h1, h2, h3.

Step 2: In the decoder, we input the embedding of <start> and h3 into the LSTM to get g1.

Step 3: Now we need to get the attention vector, which is called context vector c. To do this, we must compute the dot product of h1g1, h2g1, h3g1.

The dot product can indicate how g1 is relevant (similar) to h1, h2, h3.

Note: The dot product is not the only way to get this relevance. Two other popular methods are:

  • hWg, where W is the weight that can be trained;
  • vtanh(W(h;g)), concatenating h and g.

Step 4: Use softmax to normalize h1g1, h2g1, h3g1, and we get the score alpha1, alpha2, alpha3.

Step 5: Now we compute ci = j=1Txijhj. In this example, c1 = alpha1h1 + alpha2h2 + alpha3h3. We can see that the context vector c is a weighted sum.

Step 6. Now that we have g1 and c1, we can compute y1: y1(before softmax) = f(c1, g1).

We concatenate c1 and g1 and input this into a function, which can be W[c1;g1] where W is the weight that can be trained. Then we input the result to softmax to get the distribution of words in the English dictionary and we can get the biggest probability corresponding word it.

Result: By repeating steps 1 to 6 to get g2, y2, and so on, we will finally get the English translation: “It’s (the) life.”

And voila! That’s a complete picture of the seq2seq attention model.

The details of the realization of attention variants can differ but the basic process will follow what we described above.

If you are interested in different kinds of attention, I highly encourage you to research the topic! And to learn about self-attention with me, you can come back next month.

In Conclusion

In the calculation process, attention will associate each piece of input information in the task with the current step by calculating its relevance. This greatly shortens the gap between long-distance dependent features, which is conducive to their effective use.

Next up, we’ll see how self-attention works. Don’t miss it!

Thanks for reading! If you enjoyed this article, please hit the clap button as many times as you can. It would mean a lot and encourage me to keep sharing my knowledge.

Feel free to share your questions and comments here and follow me for the latest content.

--

--

dan lee
AI³ | Theory, Practice, Business

NLP Engineer, Google Developer Expert, AI Specialist in Yodo1