Sequence To Sequence Models

3 min readDec 19, 2018

Let’s explore a basic seq2seq model for machine translation and then learn about beam search and the attention model.

To train a neural network to input the sequence x and output the translation as sequence y we’re going to use an encoder and decoder network. The encoder network can be built as a RNN (and this could be a GRU or LSTM cells) where we feed in the input one word at a time. And after ingesting the input sequence, the RNN then offers a vector that represents the input sentence.

After that, we will build a decoder network which takes as input the encoding output by the encoder network and then can be trained to output the translation one word at a time until eventually it outputs the end of sentence token.

An architecture very similar to this also works for image captioning. Given an image, we want to generate a caption such as “a cat sitting on a chair”.

We input an image into a convolutional network, maybe a pre-trained AlexNet, and have that learn an encoding or learn a set of features of the input image. We get rid of the final Softmax unit, and take a 4096-dimensional feature vector from the step before the softmax. We then take that and feed it to an RNN to generate the caption one word at a time.

There are some similarities between the sequence to sequence machine translation model and the language models but there are some significant differences as well. We will see that machine translation can be build as a conditional language model.

Language model estimates the probability of a sentence. We can also use this to generate novel sentences. Instead in a machine translation model we use an encoder-decoder network where the decoder network looks identical to the language model. Instead of modeling the probability of any sentence like in a language model, the translation model is modeling the probability of, the output English translation, conditioned on some input sentence in a different language.

When we are using this model for machine translation, we don’t sample the words from random distribution. What we would like is to find the English sentence, y, that maximizes the conditional probability. The total number of combinations of words in the English sentence is exponentially larger. So, if we have 10,000 words in a dictionary and if we’re contemplating translations that are up to ten words long, then there are 10000 to the tenth possible sentences that are ten words long. This is a huge space of possible sentences, and it’s impossible to rate them all, which is why this problem is solved using approximate search.

The most common algorithm for doing this is called beam search. Let’s take a look at that here.

Sequence To Sequence Models

Written by Dharti Dhami