# [Notes] Improving Language Understanding by Generative Pre-Training

## Exercise: Reconstructing the Language Model from the Fine-Tuned Model

#### Introduction

The field of NLP has been creating some buzz this year (in case you haven’t heard, NLP’s Image moment has arrived). When I finally found some time to read some of the new papers, I feel this paper by OpenAI [1] is interesting enough for me to dig deeper:

**Improving Language Understanding with Unsupervised Learning**

*We've obtained state-of-the-art results on a suite of diverse language tasks with a scalable, task-agnostic system…*blog.openai.com

It expands the work by Howard and Ruder [2]. The main differences include:

- Use
**transformer networks**[3] instead of LSTM to achieve better capture long-term linguistic structure. - Demonstrate the effectiveness of the approach on a
**wider range of tasks**. - Include
**auxiliary training objectives**(e.g. language modeling) in addition to the task objective when fine-tuning.

(The transformer network used is a variant called *Transformer decoder*, proposed in [4])

#### The Exercise

The author open-sourced a Github repo containing the code(using Tensorflow) and pre-trained model weights needed to reproduce the results (currently only the ROCStories Cloze dataset [5, 6] is supported as the fine-tune target). According to the paper, the pre-trained language model is trained with the BooksCorpus dataset[7].

Honestly the code can use some work improving its readability, and the low-level Tensorflow APIs it uses are notoriously hard to comprehend and modify (BTW, there is a PyTorch implementation available.). So I gave myself an exercise to **reconstruct the pre-trained language model**. Since neural network can be entirely ruined by very subtle bugs or mistakes, you need to understand the code and model to ensure the final model produce the correct results.

The reconstructed language model can be used to:

- Inspect what the language model has learned. Feed it some texts and see what prediction it makes.
- Re-train the language model with a different corpus.

The rest of this post will cover some key ideas and code chunks that is essential to complete this exercise. The full code is published as a fork from the original OpenAI repo (check out the notebook in the root folder first):

**ceshine/finetune-transformer-lm**

*finetune-transformer-lm - Code and model for the paper "Improving Language Understanding by Generative Pre-Training"*github.com

### Target Task: Story Cloze Test

Because the model published by the author specifically targets this kind of tasks, so we need to understand what it is and how the model handles it first.

‘Story Cloze Test’ is a new commonsense reasoning framework for evaluating story understanding, story generation, and script learning. This test requires a system tochoose the correct endingto a four-sentence story.

This is a binary choice problem, so we’ll create two sequences and feed them to the same transformer network.** Three new special tokens** are added to the vocabulary from the pre-trained model — <start>, <delim>, and <extract>.

#### Tokenization

We used abytepair encoding (BPE)vocabulary with 40,000 merges

Bytepair encoding[8] starts by treating individual characters as tokens, and then iteratively merge the most common token pairs *N* times. The resulting token vocabulary will disintegrate the rare words into several chunks consisting of more common character sequences.

It is implemented in the `TextEncoder`

class. With 40,000 merges most common words are already combined into a single token. You probably shouldn’t worry about this part unless you want to train the model with other languages. The two things we’ll be using is `TextEncoder.encode()`

method and `TextEncoder.decoder`

attribute.

#### Transformation

After tokenization, the resulting tokens need be arranged into Numpy arrays that are ready to be feed into the neural network as inputs. This is done in the transform_roc function:

def transform_roc(X1, X2, X3):

n_batch = len(X1)

xmb = np.zeros((n_batch, 2, n_ctx, 2), dtype=np.int32)

mmb = np.zeros((n_batch, 2, n_ctx), dtype=np.float32)

start = encoder['_start_']

delimiter = encoder['_delimiter_']

for i, (x1, x2, x3), in enumerate(zip(X1, X2, X3)):

x12 = [start]+x1[:max_len]+[delimiter]+\

x2[:max_len]+[clf_token]

x13 = [start]+x1[:max_len]+[delimiter]+\

x3[:max_len]+[clf_token]

l12 = len(x12)

l13 = len(x13)

xmb[i, 0, :l12, 0] = x12

xmb[i, 1, :l13, 0] = x13

mmb[i, 0, :l12] = 1

mmb[i, 1, :l13] = 1

xmb[:, :, :, 1] = np.arange(

n_vocab+n_special, n_vocab+n_special+n_ctx)

return xmb, mmb

`X1`

is the* context(4 sentences)* tokens,` X2`

is the *ending 1* tokens, and `X3`

is the *ending 2* tokens. `xmb`

contains both tokens and position indices and is shaped* (batch_size, choice, sequence_length, token/position)*. `mmb`

contains the masks that indicate if the sequence has ended at that position(0 if ended), which will be used when calculate language model losses.

Position indices is used in the transformer network to incorporate the order information into the network. Later the indices will be mapped to embeddings as opposed to the sinusoidal function used in the original transformer:

We usedlearned position embeddingsinstead of the sinusoidal version proposed in the original work.

#### Modifying the Transformation

The transformation function for the reconstructed language model:

def transform_texts(list_of_texts):

tokens = TEXT_ENCODER.encode(list_of_texts, verbose=False)

n_batch = len(tokens)

xmb = np.zeros((n_batch, N_CTX, 2), dtype=np.int32)

mmb = np.zeros((n_batch, N_CTX), dtype=np.float32)

for i, x in enumerate(tokens):

x1 = x[:N_CTX]

l1 = len(x1)

print(f"length: {l1}")

xmb[i, :l1, 0] = x1

mmb[i, :l1] = 1

xmb[:, :, 1] = np.arange(N_VOCAB, N_VOCAB+N_CTX)

return xmb, mmb

This is really straight forward. It removes the extra *choice *dimension, and takes out the special tokens. (The tokenization is done inside the function for the sake of convenience.)

### The Model

#### Transformer Network

We’re not going to cover transformer in this post. They can be directly copied into the language model without any modification. Check out theses two great resources if you’re interested in the inner workings of the transformer:

**The Illustrated Transformer**

*In the previous post, we looked at Attention - a ubiquitous method in modern deep learning models. Attention is a…*jalammar.github.io

#### Modifying the Input Layer

Code from the supervised model:

we = tf.get_variable(

"we",

[n_vocab+n_special+n_ctx, n_embd],

initializer=tf.random_normal_initializer(stddev=0.02))

we = dropout(we, embd_pdrop, train)

X = tf.reshape(X, [-1, n_ctx, 2])

M = tf.reshape(M, [-1, n_ctx])

As we do not need the special tokens anymore, we need to remove `n_special `

from the embedding matrix initialization.

There’s no other modification needed as `X`

and `M`

are reshaped and agnostic to the *choice* dimension.

#### Modifying the Last Layer

The classifier can *be completely removed* without looking into its details (but I encourage you to, as it involves some clever tensor manipulations). What we need is to modify is the language modeling part:

lm_h = tf.reshape(h[:, :-1], [-1, n_embd])

lm_logits = tf.matmul(lm_h,we, transpose_b=True)

lm_losses = tf.nn.sparse_softmax_cross_entropy_with_logits(

logits=lm_logits, labels=tf.reshape(X[:, 1:, 0], [-1])) lm_losses = tf.reshape(

lm_losses, [shape_list(X)[0], shape_list(X)[1]-1]) lm_losses = tf.reduce_sum(

lm_losses*M[:, 1:], 1)/tf.reduce_sum(M[:, 1:], 1)

The above code use the hidden outputs from position *0 to *** L-1** (L is the length of the longest sequence) to generate predictions to position

*1 to*

**. In contrast, in the reconstructed model we use position**

*L**0 to*

**to generate predictions to position**

*L**1 to*

**, so we can generate more texts based on existing ones:**

*L+1*lm_h = tf.reshape(h, [-1, N_EMBD])

lm_logits = tf.reshape(

tf.matmul(lm_h,we[:N_VOCAB, :], transpose_b=True),

[-1, N_CTX, N_VOCAB])

lm_logits_truncated = tf.reshape(

lm_logits[:, :-1],

[-1, N_VOCAB])

lm_losses = tf.nn.sparse_softmax_cross_entropy_with_logits(

logits=lm_logits_truncated,

labels=tf.reshape(X[:, 1:, 0], [-1]))

lm_losses = tf.reshape(

lm_losses, [shape_list(X)[0], shape_list(X)[1]-1]) lm_losses = tf.reduce_sum(

lm_losses*M[:, 1:], 1)/tf.reduce_sum(M[:, 1:], 1)

But of course, we can not calculate losses of the predictions which we don’t know the answer for, so we need a `lm_logits_truncated`

when calculating losses.

Another change I made is only using the vocabulary that actually corresponds to a token when calculating logits — using `we[:N_VOCAB, :]`

instead of `we`

. If using the latter one, the model will include the learned position embeddings, and generate predictions to position indices, which does not make any sense. I don’t think we need those embedding in the supervised model, either.

#### Loading The Pre-trained Weights

The pre-trained weights are stored as Numpy arrays. What we need to do is to remove the initialization of special token embeddings from the original code.

This is the part where it can mostly go wrong, because all (Tensorflow) variables must be initialized in the same shape and in the same order as in the pre-trained model. If you mess with the variables when reconstructing the language model, the weights won’t be loaded properly.

### Inspecting the Reconstructed Language Model

Feeding the example four-sentence stories into the model, we can get the next-token predictions:

Take the position *9* as an example, the story up to this position is “*Karen was assigned a roommate her first year of* …”. The correct next token is **college**, and the model correctly predicted that. The second and third most probable tokens determined by the model are **school** and **grad**. Both are quite reasonable choices. So it appears that the reconstructed model works! (You might want to run more tests to be sure.)

### Thank You!

I skipped some moderately important details in the post, mainly because I run out of energy and thus the will to make this post even longer 😁. If you find anything not explained clear enough, or entirely missing, please feel free to leave me a note. I’d be happy to correct it.

### References:

- Radford, A., & Salimans, T. (2018). Improving Language Understanding by Generative Pre-Training.
- Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. (2017). Attention is all you need.
- P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N. Shazeer. (2018). Generating wikipedia by summarizing long sequences.
- Story Cloze Test and ROCStories Corpora
- Mostafazadeh, N., Roth, M., Louis, A., Chambers, N. W., & Allen, J. F. (2017). LSDSem 2017 Shared Task : The Story Cloze Test.
- Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books.
- Sennrich, R., Haddow, B., & Birch, A. (2015). Neural Machine Translation of Rare Words with Subword Units.