Seq2Seq Model for Language

Gautam Karmakar
4 min readFeb 9, 2018

--

Seq2seq model:

Citation: https://arxiv.org/abs/1409.3215v3

Brief introduction:

Deep neural networks that are mainly feedforward fully connected neural network are powerful but not really appropriate for sequential data such as time series data or language. They are very good to map input data to discrete output or continuous variable but not sequence to sequence mapping. Seq2seq model learns from variable sequence input fixed length sequence output. It uses two LSTM model, one learns vector representation from input sequence of fixed dimensionality and another LSTM learns to decode from this input vector to target sequence. LSTM is a variant of recurrent neural network that solves problem of handling long sequences using different gates.

Seq2seq model solves a specific limitation of deep neural network. DNN requires fixed length vector representation of input and output. But machine translation where input language is converted to target language input and output sentence length can vary. In question answering problem any length of input question needs to mapped to any length of answer sequence.

LSTM models ability to learn from map between variable dimensional vector representations and ability to learn from long sequence makes it very useful for these type of problem where DNN aren’t good fit.

Applications: This type of encoder-decoder model are used in language translation, speech recognition and question-answering task. From the paper seq2seq model on WMT’14 dataset it reach BLEU score of 34.81 on english to french translation. BLEU, Bilingual Evaluation Understanding score is measuring technique used for neural machine translation. Human BLEU score is considered 1. One interesting finding on the paper is that reversing the order of the input sentence (not the target) remarkably improve the performance of LSTM. According to the paper this introduced many short term dependencies between source and target sentence which made the optimization problem easier.

The model:

Recurrent Neural network (Hopfield, 1982) are connectionist models that capture the dynamics of the sequences by cycles in the network of nodes. Unlike feedforward neural network RNN can maintain a memory with a state that can represent the sequence of vectors of arbitrary length. In a way, RNN is generalization of neural network for sequence data.

Recurrent neural network maps input sequence (x1, x2,…..xT) to output sequence (y1, y2, ….yT’) and T and T’ are the lengths of input and output sequences which may differ by iterating this equation:

The goal of LSTM is to learn conditional probability as per the equation:

p(y1, y2,….yT’| x1, x2, ….xT)

First a LSTM called encoder LSTM computes the conditional probability by first obtaining a fixed dimension V from the input sequence (x1, x2, ….xT). V represents a hidden state for the second LSTM often called a decoder LSTM which learns conditional probability distribution for (y1, y2, ….yT’ as shown in the equation.

The distribution p(yt|v, y1, y2…..yt-1) is represented with a softmax over all the words in the vocabulary. Each input and output sentences ends with a symbol <EOS> which enables network to learn distribution over all possible lengths e.g. the network learns from A, B, C, <EOS> to X, y, Z, <EOS> mapping.

Seq2seq model uses two different LSTMs for input and output sequences. Doing so it increases number of parameters which improve learning at the minimal cost of increased computational cost. It helps network to learn multiple languages simultaneously. Also seq2seq model uses deep LSTM instead of shallow one, in fact the original paper says they used 4 layers LSTM. This helped to improve performance remarkably. Layers of LSTM with enough parameters get to learn different features from word vectors. Lastly, original author of the paper described that reversing the sequence of input vectors (not the output sequence) helped to improve accuracy. This means instead of mapping A,B,C, <EOS> they mapped C,B,A, <EOS> to X,Y,Z, <EOS>.

The inversing the sequence put A close to its mapping X, B to Y and C to Z. In this way SGD established closer communication between word vector pairs.

--

--