[Notes] Improving Language Understanding by Generative Pre-Training

Exercise: Reconstructing the Language Model from the Fine-Tuned Model



The field of NLP has been creating some buzz this year (in case you haven’t heard, NLP’s Image moment has arrived). When I finally found some time to read some of the new papers, I feel this paper by OpenAI [1] is interesting enough for me to dig deeper:

It expands the work by Howard and Ruder [2]. The main differences include:

  1. Use transformer networks[3] instead of LSTM to achieve better capture long-term linguistic structure.
  2. Demonstrate the effectiveness of the approach on a wider range of tasks.
  3. Include auxiliary training objectives (e.g. language modeling) in addition to the task objective when fine-tuning.

(The transformer network used is a variant called Transformer decoder, proposed in [4])

General Model Structure. Taken from [1]

The Exercise

The author open-sourced a Github repo containing the code(using Tensorflow) and pre-trained model weights needed to reproduce the results (currently only the ROCStories Cloze dataset [5, 6] is supported as the fine-tune target). According to the paper, the pre-trained language model is trained with the BooksCorpus dataset[7].

Honestly the code can use some work improving its readability, and the low-level Tensorflow APIs it uses are notoriously hard to comprehend and modify (BTW, there is a PyTorch implementation available.). So I gave myself an exercise to reconstruct the pre-trained language model. Since neural network can be entirely ruined by very subtle bugs or mistakes, you need to understand the code and model to ensure the final model produce the correct results.

The reconstructed language model can be used to:

  1. Inspect what the language model has learned. Feed it some texts and see what prediction it makes.
  2. Re-train the language model with a different corpus.

The rest of this post will cover some key ideas and code chunks that is essential to complete this exercise. The full code is published as a fork from the original OpenAI repo (check out the notebook in the root folder first):

Target Task: Story Cloze Test

Because the model published by the author specifically targets this kind of tasks, so we need to understand what it is and how the model handles it first.

‘Story Cloze Test’ is a new commonsense reasoning framework for evaluating story understanding, story generation, and script learning. This test requires a system to choose the correct ending to a four-sentence story.
Take From [5]
Model Structure for Multiple Choice Tasks. Taken From [1]

This is a binary choice problem, so we’ll create two sequences and feed them to the same transformer network. Three new special tokens are added to the vocabulary from the pre-trained model — <start>, <delim>, and <extract>.


We used a bytepair encoding (BPE) vocabulary with 40,000 merges

Bytepair encoding[8] starts by treating individual characters as tokens, and then iteratively merge the most common token pairs N times. The resulting token vocabulary will disintegrate the rare words into several chunks consisting of more common character sequences.

It is implemented in the TextEncoder class. With 40,000 merges most common words are already combined into a single token. You probably shouldn’t worry about this part unless you want to train the model with other languages. The two things we’ll be using is TextEncoder.encode() method and TextEncoder.decoder attribute.


After tokenization, the resulting tokens need be arranged into Numpy arrays that are ready to be feed into the neural network as inputs. This is done in the transform_roc function:

def transform_roc(X1, X2, X3):    
n_batch = len(X1)
xmb = np.zeros((n_batch, 2, n_ctx, 2), dtype=np.int32)
mmb = np.zeros((n_batch, 2, n_ctx), dtype=np.float32)
start = encoder['_start_']
delimiter = encoder['_delimiter_']
for i, (x1, x2, x3), in enumerate(zip(X1, X2, X3)):
x12 = [start]+x1[:max_len]+[delimiter]+\
x13 = [start]+x1[:max_len]+[delimiter]+\
l12 = len(x12)
l13 = len(x13)
xmb[i, 0, :l12, 0] = x12
xmb[i, 1, :l13, 0] = x13
mmb[i, 0, :l12] = 1
mmb[i, 1, :l13] = 1
xmb[:, :, :, 1] = np.arange(
n_vocab+n_special, n_vocab+n_special+n_ctx)
return xmb, mmb

X1 is the context(4 sentences) tokens, X2 is the ending 1 tokens, and X3 is the ending 2 tokens. xmb contains both tokens and position indices and is shaped (batch_size, choice, sequence_length, token/position). mmb contains the masks that indicate if the sequence has ended at that position(0 if ended), which will be used when calculate language model losses.

Position indices is used in the transformer network to incorporate the order information into the network. Later the indices will be mapped to embeddings as opposed to the sinusoidal function used in the original transformer:

We used learned position embeddings instead of the sinusoidal version proposed in the original work.

Modifying the Transformation

The transformation function for the reconstructed language model:

def transform_texts(list_of_texts):    
tokens = TEXT_ENCODER.encode(list_of_texts, verbose=False)
n_batch = len(tokens)
xmb = np.zeros((n_batch, N_CTX, 2), dtype=np.int32)
mmb = np.zeros((n_batch, N_CTX), dtype=np.float32)
for i, x in enumerate(tokens):
x1 = x[:N_CTX]
l1 = len(x1)
print(f"length: {l1}")
xmb[i, :l1, 0] = x1
mmb[i, :l1] = 1
xmb[:, :, 1] = np.arange(N_VOCAB, N_VOCAB+N_CTX)
return xmb, mmb

This is really straight forward. It removes the extra choice dimension, and takes out the special tokens. (The tokenization is done inside the function for the sake of convenience.)

The Model

Transformer Network

We’re not going to cover transformer in this post. They can be directly copied into the language model without any modification. Check out theses two great resources if you’re interested in the inner workings of the transformer:

Modifying the Input Layer

Code from the supervised model:

we = tf.get_variable(
[n_vocab+n_special+n_ctx, n_embd],
we = dropout(we, embd_pdrop, train)
X = tf.reshape(X, [-1, n_ctx, 2])
M = tf.reshape(M, [-1, n_ctx])

As we do not need the special tokens anymore, we need to remove n_special from the embedding matrix initialization.

There’s no other modification needed as X and M are reshaped and agnostic to the choice dimension.

Modifying the Last Layer

The classifier can be completely removed without looking into its details (but I encourage you to, as it involves some clever tensor manipulations). What we need is to modify is the language modeling part:

lm_h = tf.reshape(h[:, :-1], [-1, n_embd])
lm_logits = tf.matmul(lm_h, we, transpose_b=True)
lm_losses = tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=lm_logits, labels=tf.reshape(X[:, 1:, 0], [-1])) lm_losses = tf.reshape(
lm_losses, [shape_list(X)[0], shape_list(X)[1]-1]) lm_losses = tf.reduce_sum(
lm_losses*M[:, 1:], 1)/tf.reduce_sum(M[:, 1:], 1)

The above code use the hidden outputs from position 0 to L-1 (L is the length of the longest sequence) to generate predictions to position 1 to L. In contrast, in the reconstructed model we use position 0 to L to generate predictions to position 1 to L+1, so we can generate more texts based on existing ones:

lm_h = tf.reshape(h, [-1, N_EMBD])        
lm_logits = tf.reshape(
tf.matmul(lm_h, we[:N_VOCAB, :], transpose_b=True),
[-1, N_CTX, N_VOCAB])
lm_logits_truncated = tf.reshape(
lm_logits[:, :-1],
[-1, N_VOCAB])
lm_losses = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=tf.reshape(X[:, 1:, 0], [-1]))
lm_losses = tf.reshape(
lm_losses, [shape_list(X)[0], shape_list(X)[1]-1]) lm_losses = tf.reduce_sum(
lm_losses*M[:, 1:], 1)/tf.reduce_sum(M[:, 1:], 1)

But of course, we can not calculate losses of the predictions which we don’t know the answer for, so we need a lm_logits_truncated when calculating losses.

Another change I made is only using the vocabulary that actually corresponds to a token when calculating logits — using we[:N_VOCAB, :] instead of we. If using the latter one, the model will include the learned position embeddings, and generate predictions to position indices, which does not make any sense. I don’t think we need those embedding in the supervised model, either.

Loading The Pre-trained Weights

The pre-trained weights are stored as Numpy arrays. What we need to do is to remove the initialization of special token embeddings from the original code.

This is the part where it can mostly go wrong, because all (Tensorflow) variables must be initialized in the same shape and in the same order as in the pre-trained model. If you mess with the variables when reconstructing the language model, the weights won’t be loaded properly.

Inspecting the Reconstructed Language Model

Feeding the example four-sentence stories into the model, we can get the next-token predictions:

Take the position 9 as an example, the story up to this position is “Karen was assigned a roommate her first year of …”. The correct next token is college, and the model correctly predicted that. The second and third most probable tokens determined by the model are school and grad. Both are quite reasonable choices. So it appears that the reconstructed model works! (You might want to run more tests to be sure.)

Thank You!

I skipped some moderately important details in the post, mainly because I run out of energy and thus the will to make this post even longer 😁. If you find anything not explained clear enough, or entirely missing, please feel free to leave me a note. I’d be happy to correct it.


  1. Radford, A., & Salimans, T. (2018). Improving Language Understanding by Generative Pre-Training.
  2. Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification.
  3. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. (2017). Attention is all you need.
  4. P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N. Shazeer. (2018). Generating wikipedia by summarizing long sequences.
  5. Story Cloze Test and ROCStories Corpora
  6. Mostafazadeh, N., Roth, M., Louis, A., Chambers, N. W., & Allen, J. F. (2017). LSDSem 2017 Shared Task : The Story Cloze Test.
  7. Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books.
  8. Sennrich, R., Haddow, B., & Birch, A. (2015). Neural Machine Translation of Rare Words with Subword Units.