Building the Mighty Transformer for Sequence Tagging in PyTorch : Part II

11 min readApr 18, 2018

In the first part of this series we looked into the main components of the Transformer model — Multi Head attention and Positionwise Feedforward. Now let’s see how they work together.

Putting it all Together

PyTorch makes object oriented design easy with nn.Module so we can nest components just like in the paper. Following their layer and sublayer terminologies I have structured the code in three files:

sublayers.py: Defines the innermost components, namely MultiHeadAttention and PositionwiseFeedforward. These are used both by the Encoder and the Decoder layers.
layers.py: Defines a single layer of of the Encoder and Decoder blocks called EncoderLayer and DecoderLayer respectively. In the block diagram they are the grey regions.
main.py: Contains definitions of the top level Encoder and Decoder units called Encoder and Decoder . It handles things like multiple layers, timing signal generation and attention masks among others.

Let’s go through each of their forward() functions starting with the EncoderLayer:

I will not cover LayerNorm since it’s a standard affair. You can check the full implementation here. The important bit above is the call to multi head attention:

y = self.multi_head_attention(x_norm, x_norm, x_norm)

Make the queries, keys and values same, and voila we get self-attention! This term gets thrown around a lot these days and now you know it isn’t something very fancy. This is the component that replaces recurrent and convolutional layers, and, as the authors argue in their paper, it’s faster and more powerful. Dropout and residual connections are combined in one step:

x = self.dropout(x + y)

Note that the residual input is tapped before layer normalization. The DecoderLayer is similar:

The first difference we note here is that the DecoderLayer takes the inputs as a tuple consisting of the decoder inputs as well as the encoder outputs. I’ve used a tuple so that I can easily pass the inputs through multiple layers using nn.Sequential. Encoder outputs are needed for attention.

The DecoderLayer does multi head attention twice. First it does a self-attention just like the EncoderLayer but the queries are ‘masked’ (more on that later I promise). The second call uses the encoder outputs as both keys and values:

y = self.multi_head_attention_enc_dec(x_norm, encoder_outputs, encoder_outputs)

The queries come from the self-attention outputs (with some post processing). To remind you again, the queries for all decoder time steps are available together during training and evaluation due to the nature of Transformer. For a recurrent network you can only get the query one time step at a time since the query at the next time step depends on the attention outputs at the current time step — this is major factor slowing down recurrent networks and making them hard to parallelize. During inference, however, the Transformer also needs to decode one time step at a time. The only way to do it is to run the decoder multiple times, each time expanding the decoder inputs with newly obtained outputs. It is a slow process requiring repeated processing of the same inputs. To partly speed it up, the Tensorflow implementation of Transformer caches the attention outputs for each time step (I’m yet to implement caching in PyTorch).

Finally at the top level we have the Encoder and the Decoder :

The embedding projection linearly projects the embedding outputs (external to the Transformer) to a common hidden size:

self.embedding_proj = nn.Linear(embedding_size, hidden_size, bias=False)

Next, a ‘timing signal’ is added so that the model can track sequence ordering better. We’ll go into that in the next section. The encoder layer is applied multiple times using nn.Sequential:

self.enc = nn.Sequential(*[EncoderLayer(*params) for l in range(num_layers)])

params is a tuple of hyperparameters which is identical for each layer. The Decoder implementation is very similar:

Like the DecoderLayer the main difference is that there is an additional input coming from the encoder outputs. This is combined with the decoder inputs as a tuple and fed into the DecoderLayer component:

y, _ = self.dec((x, encoder_output))

The final outputs of the Decoder (or Encoder for our task) goes through an output layer which first projects to a dimension equal to the output vocabulary. After that there can be a Softmax layer or for our case a CRF layer. That completes the pipeline! Now for those additional tricks..

Masked Attention — No Peeking into the Future

I mentioned I would cover attention bias mask later when going through the code of MultiHeadAttention. For tasks like translation the decoder is fed previous outputs as input to predict the next output. During training the quick way to get the previous outputs is to shift the training labels right (The first time step gets a special symbol) and feed them as decoder inputs — a technique known as Teacher Forcing in machine learning parlance. However this presents a problem for the Transformer decoder as it can ‘cheat’ by using inputs from future time steps. The places where the short circuiting can happen is the self attention step and both the feedforward steps. (Can you figure out why it cannot happen in the normal attention step?)

In the self attention step we feed values from all time steps to the MultiHeadAttention component. Recall that we do a weighted linear combination of the Values input:

Consider the first row of OUTPUT in the above diagram. It corresponds to the attention output at time t=1. But it is computed from values right up till t=10 which are future time steps. To prevent reading these future values we zero out all weights in the WEIGHTS tensor above the main diagonal. This will ensure that future values cannot creep in:

In practice we do not zero out the weight tensor directly as it would need additional normalization to ensure all probabilities add up to one. Instead we add negative infinity to the upper triangle before the Softmax; after exponentiation those values become zero. Also we precompute this matrix if the maximum length is known:

A cropped version of the mask is applied in the MultiHeadAttention component:

if self.bias_mask is not None:            
  logits += Variable(
   self.bias_mask[:, :, :logits.shape[-2], :logits.shape[-1]]
   .type_as(logits.data))

A similar problem occurs in the PositionwiseFeedforward component if we use convolutional layers. A 1D convolution with a filter of width 3 will access the input at time t+1 to compute the output at time t:

This is due to the way the inputs are automatically padded when we want to keep the output sequence length same as the input sequence length. A slight change in padding makes the problem go away:

Now the output at time t will only depend on the inputs at time t, t-1 and t-2. The nn.Conv1d API in PyTorch doesn’t support this type of padding, so we do it ourselves:

For our sequence tagging task we use only the encoder part of the Transformer and do not feed the outputs back into the encoder. So we will not be using either the bias mask or left padding. Next is the final trick.

Positional Encoding — The Timekeeper

An inherent feature of recurrent neural networks is their ability to track the ordering of input sequences due the internal state that they maintain. With recurrent layers gone from the Transformer it cannot easily distinguish the input at one time step from the other. To make up for this lack of statefullness, extra positional information needs to be added make each time step unique. An earlier work called Convolutional Sequence to Sequence Learning used positional Embeddings which basically is adding random vectors unique to each time step. Then the model can learn to identify the absolute positions of the inputs. The creators of Transformer one-upped this technique by adding sinusoids which they claim can track relative positions as well since sinusoids are cyclic functions. Our implementation uses sinusoids which is adapted straight from TensorFlow:

Here channels is the hidden size of our model. Starting from a time scale of 1 we generate sin and cos signals of exponentially increasing wavelengths or reducing frequency (hence -log_timescale_increment in line 13) for each dimension till it reaches 10,000. These numbers were arrived at by experimentation I believe so they might need to change for a new task. Like bias masking we precompute these values and add cropped versions to the inputs (after embedding) during run time:

x += Variable(
     self.timing_signal[:, :inputs.shape[1], :]
     .type_as(inputs.data))

With these two tricks applied our Transformer is finally complete. Let’s see it in action now!

Transformer vs. BiLSTM — The Face-Off

Our Transformer model will be going against the all-too-well-known bidirectional LSTM on the CoNLL 2000 chunking task. It’s quite an old task with the state of the art F1 values hovering around 94–95%. Unlike the top models the only input feature we’re going to use is raw text — no part of speech tags, spelling or capitalization features. However we will use external word embeddings (GloVe 200D and Char N-Grams to be precise) and character embeddings. Both models will use a conditional random field (CRF) layer on the output side. We will only be using the encoder part of the Transformer for this experiment since it is a one to one mapping task.

There is a whole bunch of things that need to be done before we can actually train these models. Conveniently for us I have written a small framework that will take care of the boilerplate. I call it TorchNLP and it does things like:

Handling the data pipeline (Uses TorchText under the hood)
Loading and saving models
Training models with configurable hyperparameters
Evaluating models with metrics like accuracy, F1 etc.

The Transformer and bidirectional LSTM along with common components like CRFs are already baked into the framework so all we have to do is run the experiments.

Installation is very simple. You will need Python 3.5+ to run it though. Clone TorchNLP from GitHub:

git clone https://github.com/kolloldas/torchnlp.git

Set up PyTorch with or without GPU support (preferably in a new Python 3 virtual environment). Go to the root of the TorchNLP project and install the dependencies:

pip install -r requirements.txt

That’s it for instalation. Start the chunking task:

python -i -m torchnlp.chunk

This will load up the environment for chunking:

Task: Chunking (Shallow parsing)Available models:
-------------------
TransformerTaggerSequence tagger using the Transformer network (https://arxiv.org/pdf/1706.03762.pdf)
Specifically it uses the Encoder module. For character embeddings (per word) it uses the same Encoder module above which an additive (Bahdanau) self-attention layer is addedBiLSTMTaggerSequence tagger using bidirectional LSTM. For character embeddings per word uses (unidirectional) LSTMAvailable datasets:
-------------------
conll2000: Conll 2000 (Chunking)>>>

(If you get any errors at this point means something went wrong with your installation. Please explain the error in the comments section and I’ll be happy to help.)

Let’s see what hyperparameters our models’ got. Here the BiLSTM:

>>> hparams_lstm_chunk()Hyperparameters:
 dropout=0.5
 learning_rate_decay=noam_step
 optimizer_adam_beta2=0.98
 optimizer_adam_beta1=0.9
 embedding_size_word=300
 learning_rate_warmup_steps=100
 hidden_size=100
 embedding_size_tags=100
 embedding_size_char=25
 learning_rate=0.05
 max_length=256
 batch_size=100
 num_hidden_layers=2
 use_crf=True
 embedding_size_char_per_word=25

And the Transformer:

>>> hparams_transformer_chunk()Hyperparameters:
 relu_dropout=0.2
 learning_rate_decay=noam_step
 optimizer_adam_beta2=0.98
 optimizer_adam_beta1=0.9
 embedding_size_word=300
 attention_key_channels=0
 filter_size_char=64
 embedding_size_char=16
 embedding_size_tags=100
 embedding_size_char_per_word=100
 dropout=0.2
 filter_size=128
 use_crf=True
 num_hidden_layers=2
 attention_value_channels=0
 hidden_size=128
 input_dropout=0.2
 learning_rate_warmup_steps=500
 num_heads=4
 learning_rate=0.2
 max_length=256
 batch_size=100
 attention_dropout=0.2

The Transformer’s got way more knobs to turn than the BiLSTM model. Also note that it is much smaller in terms of parameters than the original model since the task itself is so small. The Transformer will go first. Start the training:

>>> train('chunking', TransformerTagger, conll2000)

You should see the datasets and the word embeddings being downloaded and extracted first. (Be warned that the word embeddings are around 2GB in size.) After that the training will start with a bar showing the progress in each epoch. Training will stop automatically once the F1 value on the validation data peaks (using early stopping). For me it was after 35 epochs:

INFO:torchnlp.common.train:Epoch 35 (2754)
INFO:torchnlp.common.train:Train Loss: 1.79630
INFO:torchnlp.common.train:Validation metrics:
INFO:torchnlp.common.train:loss=2.21440, F1=0.94471, 
acc-seq=0.62640, recall=0.94353, acc=0.96548, precision=0.94589
INFO:torchnlp.common.train:Early stopping at iteration 2430, epoch 29, F1=0.94504

Looks good. BiLSTM’s turn:

>>> train('chunking', BiLSTMTagger, conll2000)
...
INFO:torchnlp.common.train:Epoch 26 (2025)
INFO:torchnlp.common.train:Train Loss: 0.68408
INFO:torchnlp.common.train:Validation metrics:
INFO:torchnlp.common.train:loss=2.32462, F1=0.94619, 
acc-seq=0.61521, recall=0.94749, acc=0.96784, precision=0.94488
INFO:torchnlp.common.train:Early stopping at iteration 1701, epoch 20, F1=0.94669

Hmm the BiLSTM has a slightly better F1. It also converged faster. Time to turn some of those knobs on the Transformer. Switching to 2 heads:

>>> h2 = hparams_transformer_chunk().update(num_heads=2)
>>> train('chunking_h2', TransformerTagger, conll2000, hparams=h2)
...
INFO:torchnlp.common.train:Epoch 54 (4293)
INFO:torchnlp.common.train:Train Loss: 1.65368
INFO:torchnlp.common.train:Validation metrics:
INFO:torchnlp.common.train:loss=2.19627, F1=0.94715, acc-seq=0.63087, recall=0.94915, acc=0.96771, precision=0.94515
INFO:torchnlp.common.train:Early stopping at iteration 3969, epoch 48, F1=0.94746

Ok now the Transformer is ahead. We could keep tweaking hyperparameters for a long time this way (that’s what research is about after all) but without the luxury of a deep learning cluster to blaze through scores of combinations it will be hard to make progress. Let’s evaluate both models on the test set. The BiLSTM gets:

>>> evaluate('chunking', BiLSTMTagger, conll2000, 'test')
...
test set evaluation: chunking-BiLSTMTagger
loss=3.19308, F1=0.94238, acc-seq=0.61133, recall=0.94206, acc=0.96500, precision=0.94269

94.24% F1. And the Transformer gets:

>>> evaluate('chunking_h2', TransformerTagger, conll2000, 'test')
test set evaluation: chunking_l2-TransformerTagger
loss=2.97885, F1=0.94319, acc-seq=0.61581, recall=0.94424, acc=0.96541, precision=0.94215

94.32% F1 😑. It’s too close to call as 0.1% is well within error margins. Unless we do a large number of experiments we cannot call it a clear win. It’s a tie for now.

I would encourage you to play around with the hyperparameters or even change the architecture to see if the Transformer can do any better. If you hit the jackpot let me know!

Lessons Learned

Our home made Transformer couldn’t really outdo the BiLSTM by a significant margin. While it had fewer trainable parameters at about 200K lesser than the BiLSTM it took longer to converge. So what can we say about Transformer’s overall architecture?

One thing that I found in the course of experimentation is that the Transformer does worse if we use linear layers instead of convolutional ones in the PositionwiseFeedforward component. Remember that we used only the Encoder half of the Transformer for this task (I did try adding the decoder, it did not improve the results). So I have a hunch that self attention by itself is unable to make inputs at multiple time steps interact multiplicatively which can capture more complex phenomenon. It takes care of pairwise multiplicative interactions between different time steps but does not handle say 3–4 steps together. Using convolutional layers with kernel size 3 could be partly solving this problem but interaction there is additive. Having two layers instead of one also improved the F1 because it allowed more than two time steps to interact nonlinearly. But they are weaker compared to a recurrent network which has strong nonlinear interactions between multiple nearby time steps. So in the end the gains that the Transformer made with self attention and the feedforward combination is probably lost with this inability. Again, more experimentation is needed.

In the near future perhaps a hybrid model will come along that does weird combinatorial acrobatics to handle multi time step interactions while still being fast at the same time. Maybe one is in the works at Google already, who knows!

NOTE: I have left out some implementation details (e.g. the Noam learning rate schedule etc) to keep the article length reasonable. You can go through the code to understand these parts which I hope are self explanatory. If you have any specific queries please ask in the comments section and I’ll do my best to answer them.