Attention Mechanism and Softmax

Sayan Mondal
Analytics Vidhya
Published in
5 min readDec 23, 2020
Photo by Shane Aldendorff on Unsplash

In natural language processing, the Seq2Seq model is one of the important model to process the language. In the model, the encoder reads the input sentence once and encodes it. At each time step, the decoder uses this embedding and produces an output. But humans don’t translate a sentence like this. We don’t memorize the input and try to recreate it, we are likely to forget certain words if we do so. Also, is the entire sentence important at every time step, while producing every word? No. Only certain words are important. Ideally, we need to feed in only relevant information (encoding of relevant words) to the decoder.
“Learn to pay attention only to certain important parts of the sentence.”

What is Attention?
In psychology, attention is the mental process of selectively concentrating on one or some things while ignoring others.

A neural network is taken into account to be an attempt to mimic human brain actions during a simplified manner. Attention Mechanism is additionally an effort to implement the identical action of selectively concentrating on some relevant things while ignoring others in deep neural networks.

Let me explain what this implies. Let’s say you’re seeing a bunch photo of your first school. Typically, there’ll be a gaggle of kids sitting across several rows, and also the teacher will sit somewhere in between. Now, if anyone asks the question, “How many of us are there?”, how will you answer it?

Simply by counting heads, right? You don’t have to consider the other things within the photo. Now, if anyone asks a special question, “Who is that the teacher within the photo?”, your brain knows exactly what to try to to. it’ll simply start trying to find the features of an adult within the photo. the remainder of the features will simply be ignored. this can be the ‘Attention’ which our brain is incredibly adept at implementing.

Goal:
Our goal is to come up with a probability distribution, which says, at each time step, how much importance or attention should be paid to the input words.

Attention is simply a vector, often the outputs of a dense layer using softmax function. Before Attention mechanism, translation relies on reading a full sentence and compressing all information into a fixed-length vector, as you will be able to imagine, a sentence with many words represented by several words will surely cause information loss, inadequate translation, etc. However, attention partially fixes this problem. It allows the machine translator to look over all the knowledge the primary sentence holds, then generate the correct word in line with this word it works on and also the context. It can even allow translators to focus or out (focus on local or global features). Attention isn’t mysterious or complex. it’s just an interface formulated by parameters and delicate math. you’ll plug it anywhere you discover it suitable, and potentially, the result’s also enhanced.
The core of the Probabilistic Language Model is to assign a probability to a sentence by Markov Assumption. thanks to the character of sentences that contain different numbers of words, RNN is after all introduced to model the probability among words.

The basic RNN structure often gets trapped when modeling:

  1. Structure Dilemma: in the globe, the length of outputs and inputs may be totally different, while Vanilla RNN can only handle fixed-length problems which are difficult for the alignment. Consider an EN-FR translation example: “he doesn’t like apples” → “Il n’aime pas les pommes”.
  2. Mathematical Nature: It suffers from Gradient Vanishing/Exploding which implies it’s hard to coach when sentences are long enough (maybe at the most 4 words).

Translation often requires arbitrary input length and output length, to cater to the deficits above, the encoder-decoder model is adopted and basic RNN cell is modified to GRU or LSTM cell, hyperbolic tangent activation is replaced by ReLU. We use GRU cells here.

Embedding layer maps discrete words into dense vectors for computational efficiency. Then embedded word vectors are fed into the encoder, aka GRU cells sequentially. What happened during encoding? Information flows from left to right and each word vector is learned per not only current input but also all previous words. When the sentence is completely read, the encoder generates an output and a hidden state at timestep 4 for further processing. For the encoding part, decoder (GRUs as well) grabs the hidden state from the encoder, trained by teacher forcing (a mode that previous cell’s output as current input), then generates translation words sequentially.

Similar to the elemental encoder-decoder architecture, this fancy mechanism plug a context vector into the gap between encoder and decoder. according to the schematic above, blue represents encoder and red represents decoder; which we could see that context vector takes all cells’ outputs as input to compute the probability distribution of language words for each single word decoder wants to return up with. By utilizing this mechanism, it’s possible for decoders to capture somewhat global information rather than solely to infer supported one hidden state.
And to form a context vector is fairly simple. For a tough and fast target word, first, we loop over all encoders’ states to test target and source states to return up with scores for each state in encoders. Then we could use softmax to normalize all scores, which generates the probability distribution conditioned on the right track states. At last, the weights are introduced to form a context vector easy to teach. That’s it. Math is shown below:

To understand the seemingly complicated math, we want to stay three key points in mind:

  • During decoding, context vectors are computed for each output word. So we’ll have a 2D matrix whose size is # of target words multiplied by # of source words. Equation (1) demonstrates a way to compute one value given one target word and a collection of source words.
  • Once context vector is computed, attention vector may well be computed by context vector, target word, and a focus function f.
  • We need attention mechanism to be trainable. consistent with equation (4), both styles offer the trainable weights (W in Luong’s, W1 and W2 in Bahdanau’s). Thus, different styles may lead to different performances.

Below could be a summary table of several popular attention mechanisms and corresponding alignment score functions:

--

--

Sayan Mondal
Analytics Vidhya

An avid Reader, Full Stack Application Developer, Data Science Enthusiast, and NLP specialist. Write me at sayanmondal2098@gmail.com.