TensorFlow — Sequence to Sequence

Today I want to show an example of Sequence to Sequence model with all the latest TensorFlow APIs [as of TF 1.3].

Seq2Seq models are very useful when both your input and output have some structure or time component. Most popular applications are all in the language domain, but one can use it to process time series, trees, and many other intrinsically structured data.

Translation has been domain where this models advanced the most, as it has a large enough dataset to train large and complicated models and provides a clear value from advancing state-of-the-art.

If you haven’t seen, here are few papers on Neural Language Translation with Seq2Seqs: https://arxiv.org/abs/1409.3215, https://arxiv.org/abs/1609.08144, https://research.googleblog.com/2016/09/a-neural-network-for-machine.html

Seq2Seq model is separated into two components: Encoder and Decoder.

Encoder consumes tokens in one language and produces fixed encoding of the input sequence. Decoder takes that encoding and produces output tokens one by one, conditioning on previous decoded tokens.

Encoder is structured similar to Text Classification model, it reads token by token input sequence using RNN cell. Internal state of the RNN encodes model’s understanding of the sequence.

After input sequence is finished (“<DONE>” token in used to indicate that to the model), Decoder starts processing: producing output tokens one by one. Now there are number of different ways to implement decoder. Two most common: plain RNN decoder and Decoder with Attention.

Plain RNN decoder would just take output of the Encoder step and on each RNN step, taking previous [correct or decided by the model] token and hidden state of RNN to produce next token.

Attention decoder doesn’t just take hidden state of RNN and previous token but also uses hidden state of the decoder RNN to “attend” — select information from encoder output states. This produces alignment between each output token and some set of input tokens.

Attention mechanics is very important concept in deep learning, so if you are not familiar, you may want to read: http://arxiv.org/abs/1409.0473

Attention is so powerful, that model only with attention can actually outperform Seq2Seq models at Language Translation: https://arxiv.org/pdf/1706.03762.pdf [shameless plug ;) ].

I also want to mention Tensor2Tensor project recently open sourced by Lucasz Kariser: https://github.com/tensorflow/tensor2tensor. It contains large library of battle-tested and tuned models [Seq2Seq and Transformers included] as well as input readers for Machine Translation and other datasets.

Alright, let’s look at the code for this in TensorFlow:

This is a lot of code, but main ideas here:

  • make_input_fn creates input function that reads data and generates padded pairs of (input, output) of equal lenght.
  • seq2seq model uses embeddings to encode both inputs and outputs with the same embeddings [I use shared vocab, which simplifies things even in case of different languages].
  • Decoder is dynamic size [e.g. different batches can have different number of tokens in sequence].
  • Using TrainerHelper to and GreedyEmbeddingHelper to do teacher forcing for training and greedy sampling to do inference.
  • To get attention working with dynamic decoder, use BahdanauAttention + AttentionWrapper combination.

You can find latest version of the code with sample data generator here: https://github.com/ilblackdragon/tf_examples/tree/master/seq2seq