seq2seq: the clown car of deep learning

4 min readNov 11, 2016

tl; dr: Translating arbitrary-length sequences back and forth is easier than you think

Where are all these flamboyant tensors coming from?

seq2seq (“sequence-to-sequence”) confuses many deep learning first-timers, both in terms of raw architecture as well as performance characteristics.

On the one hand, seq2seq seems to do quite well in a wide range of tasks, including translation, image captioning, and even interpreting dialects of Python. On the other hand, how is it really possible to map arbitrary-length sequences to other arbitrary-length sequences using fixed-size architectures?

Or metaphorically, where are all those clowns hiding in that tiny car? That seems to violate some law of information theory…or at least, our intuition. I’ll show you how seq2seq works and how to apply it to various problems.

Let’s start by diving into one of the problems that led to seq2seq — automated translation. Suppose you want to translate words from one language (say, English) to another language (say, German). You can’t just map word-token to word-token, obviously; some tokens disappear in the translation, others appear from nowhere, some are highly context-dependent on the tokens around them, and some tokens converge or diverge like my personal favorite, “Rechtsschutzversicherungsgesellschaften” (or, in English, “insurance companies that provide legal protection”). Easy examples like this drive home the point that translation just isn’t a token-level function.

Now, a professional literature translator would say that translation isn’t a sentence-level function, either (it’s worth noting that both Nabokov and Borges translated the works of others, and considered it an act of literary creativity in its own right)—but at least this simplification gets closer to the task.

Now that we’ve decided on sentence-level translation as the task, we have to ask ourselves — how can we possibly model that with deep learning? Whatever architecture you pick (feedforward vs. recurrent, deep vs. shallow, etc), you still have to pick the size of the input layer, the size of the output layer, and all the machinery within. But sentences are all sorts of different lengths! Do you build different networks for each length (say, one network that translates all of the 12-word sentences in English into 8-word sentences in German, and so on)? That seems absurd, and it is. Don’t do this.

Seq2Seq solves this problem in a similar way to how backpropagation-through-time solves the task of training cyclic networks. With BPTT, we took a temporal self-referential network and changed it to a spatial non-self-referential network. Here, with seq2seq, we reinterpret a spatial problem (a variable-length sequence of tokens) as a temporal one (tokens generated over time). In essence, we cram some number of clowns into the car (variable-length input), then pull some other sequence of clowns back out until we get a marker indicating we’re done (variable-length output).

Let’s look at a diagram of the whole process, going from left to right (adapted from Sutskever, 2014 and Cho, 2014):

Tokens go in, tokens go out. Can’t explain that!

We essentially have two different recurrent neural networks tied together here — the encoder RNN (bottom left boxes) listens to the input tokens until it gets a special <DONE> token, and then the decoder RNN (top right boxes) takes over and starts generating tokens, also finishing with its own <DONE> token.

The encoder RNN evolves its internal state (depicted by light blue changing to dark blue while the English sentence tokens come in), and then once the <DONE> token arrives, we take the final encoder state (the dark blue box) and pass it, unchanged and repeatedly, into the decoder RNN along with every single generated German token. The decoder RNN also has its own dynamic internal state, going from light red to dark red.

Voila! Variable-length input, variable-length output, from a fixed-size architecture.

This makes sense for language. How about for image captioning? An image isn’t really a series of tokens, so how do we apply this same approach to taking an arbitrary image like:

and being able to generate a sequence of tokens like “a person riding a motorcycle on a dirt road”?

All we need to do is replace the encoder part of the encoder/decoder pair with the type of convolutional neural network (CNN) which is typically used in image processing as an alternative encoder, then do the same token-based decoding we did above. The CNN encodes the image to some hidden state tensor, and then the decoder uses that state as the internal representation to ‘describe’.

In probabilistic terms, the output sequence is conditionally independent of the input sequence given the hidden state. That state is the informational bottleneck, or shared description between the source format (a sentence, or image) and the target format (a translation, or caption).

It’s a fairly modular approach — and it’s easy to imagine swapping many other “front ends” onto this architecture that take some input data (in many formats), transform it into a hidden state/representation, then use our token decoder to describe, literally, anything.

seq2seq: the clown car of deep learning

Written by Dev Nag