Understanding the GPT-2 Source Code Part 3

Isamu Isozaki
May 21 · 11 min read

Hi! This is a continuation of looking into GPT-2’s source code. You can find part 1 and part2 here and here.

Here, I will try to cover how the GPT-2’s model works while looking into sample.py and model.py.


The main functionality of sample.py is to generate text outputs given conditions/inputs. This is done by the sample_sequence in sample.py.


sample_sequence’s inputs are given as follows

def sample_sequence(*, hparams, length, start_token=None, batch_size=None, context=None, temperature=1, top_k=0):

The * portion just forces the user of the function to specify the parameters directly. For example, given the function

def a(c):
def b(*,c):

While the first function can be called as


and just outputs hi, the second function must be called as


to get the same result!

if start_token is None:
assert context is not None, 'Specify exactly one of start_token and context!'
assert context is None, 'Specify exactly one of start_token and context!'
context = tf.fill([batch_size, 1], start_token)

As we have seen in part 1, this portion is for the generate_unconditional_samples.py and interactive_conditional_samples.py. If we are going to generate unconditional samples(samples without input), the input text, the context will be set to a tensor initialized by start tokens. However, otherwise, the incoming encoded text will be given as the input! Here, after re-examining the code at interactive_conditional_samples.py, I found it rather interesting that OpenAI did not decide to add a start token to the beginning of the incoming text which was given as input. Instead, they simply encoded it as

raw_text = input("Model prompt >>> ")
while not raw_text:
print('Prompt should not be empty!')
raw_text = input("Model prompt >>> ")
context_tokens = enc.encode(raw_text)

which I found rather interesting because I thought you always needed a start token!

Why have start tokens? (you can skip if you know!)

I think it may have been a bit confusing when reading part 1 on why we must have start tokens. So, I’ll try to explain it here! Start tokens, as the name suggests, denotes the start of a text. For example, for the text “I am happy”, the start tokens come before and the end token comes after and ends up as “<start_token>I am happy<end_token>”. The reason why we do this is that when we load a text, each text has different lengths!

So, the common approach here, after we have encoded the string into numbers, is to add 0s at the end. For example, if I is 1, am is 2 and happy is 3, it will be encoded as 1230000…. However, one problem with this approach is that the machine itself, in the process of learning the sequence, will become confused about where the text starts and ends.

For example, if most of the texts are short, and there is this one long incoming text, such as 11111111111113452000.., then it is quite possible that since the machines only experienced 0s at locations given by 11113452, it will start to disregard those numbers and overall lead to a rather bad result for training! That is why start tokens were introduced. They denote the start of the string so the machine learning algorithm knows where the text starts and the end token, at the end of the text to denote where the text ends.

def step(hparams, tokens, past=None):
lm_output = model.model(hparams=hparams, X=tokens, past=past, reuse=tf.AUTO_REUSE)
logits = lm_output['logits'][:, :, :hparams.n_vocab]
presents = lm_output['present']
presents.set_shape(model.past_shape(hparams=hparams, batch_size=batch_size))
return {
'logits': logits,
'presents': presents,

Next, inside the sample_sequence function, this step function was defined.

The first line of this function calls the model function within the model file and then returns a dictionary of tensors lm_outputs with two keys, “logits” and “present”. I do not yet understand what they are but as we go deeper into the code, I’m quite sure we’ll find out what they are.

What is a tensor? (you can skip if you know!)

For those who are not that familiar with Tensorflow, I think the word “tensor” is a bit mysterious. What it is is like a component of a graph. In Tensorflow, when you write the following code,

a = tf.constant(1)
b = tf.constant(1)
c = a + b

the value of c will not be 2.(TLDR; at least not until Tensorflow 2.0 update but since this code was written before that, I ignored that)

In fact, it’s value will not be properly set until run time. The only thing that it knows is that c is a value given by the addition of a and b.

What we can do is to set the values for a and b to see the result of c. For this, we start a session. Sessions in Tensorflow allow us to actually execute the operation in the tensors. So, in this case, we can evaluate the value of c by doing a session. It is done like the following.

with tf.Session() as sess:

and [2] should be printed. If we want to input new values to a and b, we can do

with tf.Session() as sess:

and [5] will be outputted.

Looking inside model.py

Now that we know that model.py’s model function outputs tensors, let us see how model.py is set up and possibly see some of the algorithms at play!

Tensorflow scopes

When we look at the top of the model function, we see

def model(hparams, X, past=None, scope=’model’, reuse=False):
with tf.variable_scope(scope, reuse=reuse):

The variable scope defined by Tensorflow here is just here mostly for the ease of debugging. When a tensor, with name x, in a scope gives out an error, if the scope name is set to say, hello, the error mentions the tensor as hello/x in the error messages.

Also, while I have not quite yet used it for this use case quite yet, it can also be used to share variables! This is where the reuse parameter comes in. To give an example from the documentation, it can be used in the following manner!

def foo():
with tf.variable_scope("foo", reuse=tf.AUTO_REUSE):
v = tf.get_variable("v", [1])
return v

v1 = foo() # Creates v.
v2 = foo() # Gets the same, existing v.
assert v1 == v2

Extract batch size and sequence length

The next line in the model function was

batch, sequence = shape_list(X)

here, when we look back at sample.py,

def step(hparams, tokens, past=None):
lm_output = model.model(hparams=hparams, X=tokens, past=past, reuse=tf.AUTO_REUSE)

we see that X is set to the input to step called tokens. Now, let us look into the shape_list function and see what exactly it does.

def shape_list(x):
"""Deal with dynamic shape in tensorflow cleanly."""
static = x.shape.as_list()
dynamic = tf.shape(x)
return [dynamic[i] if s is None else s for i, s in enumerate(static)]

As I was not quite sure about how it works, I went to Idle(if you have python installed, just typing idle in the start menu should bring it up) and tried a few things out.

Before starting, I looked at the line

return [dynamic[i] if s is None else s for i, s in enumerate(static)]

we can see from it that if none of the static’s dimensions is None, the function simply returns x.shape.as_list(). Thus, I made a tensor as follows

example = tf.placeholder("float", [None, 5])

This initializes example with a tensor of a dimension of None times 5. The None here means that it can accept any dimension as the first dimension until the session.

I first outputted example.shape.as_list() and it outputted

[None, 5]

For tf.shape(example),

<tf.Tensor ‘Shape_5:0’ shape=(2,) dtype=int32>

was returned. The first dimension returned

<tf.Tensor 'strided_slice_2:0' shape=() dtype=int32>

and the second dimension returned

<tf.Tensor ‘strided_slice_3:0’ shape=() dtype=int32>

When I passed it through the function,

[<tf.Tensor ‘strided_slice_2:0’ shape=() dtype=int32>, 5]

was outputted. At first, I did not quite understand the usefulness of doing this but one thing I noticed was that this allowed for both the batch size and sequence length, the output of the shape list function can be set during runtime as it is a tensor!

However, I think one thing that future ML students need to be careful of is that we need to still be aware that we cannot train a network with weights and biases of an undefined size.

The most basic concept in Neural Networks(you can skip if you know!)

The most fundamental concept in neural networks is matrix multiplication and addition of matrices. Let’s say that we have a bunch of numbers of dimension (None,10) which acts as inputs. (TLDR, None is the batch size)

Let say we want to know whether those ten numbers are good or bad which is indicated by 0 or 1.

While there may be many ways of doing this, the first model that ML engineers should think of is to directly scale down the dimension of 10 to 1 dimension. This is done by multiplying the input by the weight matrix of dimensions (10, 1) and adding a bias of dimension (1). (I might go into linear algebra in a different post but I don’t think I’ll do it here). The output then is (None, 1)

This allows for the dimension to be scaled down and the network to be trained with techniques such as gradient descent and the like. However, one important aspect of this to note is that the weight matrice’s dimension is constant as well as the biases. That is why they are trainable. Thus, we cannot use OpenAI’s technique to secretly set a dimension of variable size as one of the dimensions of the weight matrices or the bias. I decided to write this because, well, I made these mistakes!

The next lines of the model function are

wpe = tf.get_variable(‘wpe’, [hparams.n_ctx, hparams.n_embd],
wte = tf.get_variable(‘wte’, [hparams.n_vocab, hparams.n_embd],

tf.get_variable are variables that can be trained and the initializer is the initial value those variables are set as. In this case, they set it to a normal distribution of mean 0 and standard deviation of 0.01 or 0.02 which I found was quite interesting because I tend to set it to 0 or 1. It makes sense intuitively now that I think about it but I don’t think there is much of a logical explanation for it other than when tested, it improved performance! Anyway, I’ll try it out myself.

While I am not sure what wpe and wte stand for, we can examine what values hparams has.

The default parameters are given as

def default_hparams():
return HParams(

the n_embd denotes embedding size.

What is embedding(you can skip if you know!)

Embedding is basically a way to represent every number as a vector. This allows the machine learning algorithm to understand similarities and differences between words. For example, let us look at the words cat and dog vs words such as car and house. As cats and dogs are quite similar word-wise, we expect the vectors that represent them get closer together compared to words such as house and cat.

The embedding size gives the size of each of these vectors which is 768 which is quite large!

One strange property

However, one strange property in the default parameters was that n_vocab was 0. When we look at how wte was defined, we find n_vocabs is in the dimension

wte = tf.get_variable(‘wte’, [hparams.n_vocab, hparams.n_embd],

Thus, will this always be a tensor of dimension 0? This was quickly resolved when I looked into the hparams.json file saved with the model and the parameters were different as we can see below

"n_vocab": 50257,
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_layer": 12

Setting up input

Thus, we can now reasonably assume that n_vocab is the vocabulary size. While I won’t go into n_head and n_layer yet, I think it is reasonable to assume that n_ctx is the max length of the context however we can not be sure yet!

wpe = tf.get_variable(‘wpe’, [hparams.n_ctx, hparams.n_embd],
wte = tf.get_variable(‘wte’, [hparams.n_vocab, hparams.n_embd],

Thus, so what we can be reasonably certain that wte is a look-up table which holds all the vectors that correspond to the token values!

past_length = 0 if past is None else tf.shape(past)[-2]
h = tf.gather(wte, X) + tf.gather(wpe, positions_for(X, past_length))

Now, while this is primarily a guess, I suspect that the past is the output of the model so far. This is mostly because the model function is called from the step function in samples.py. So what I suspect is the step function is called on every time the model outputs a new token and adds that token as an input to the model and calls again! This will later be confirmed.

While I cannot say the shape of the past, the past_length should, when judging from the name, hold the length of the text outputted so far.

Now, let us look into h. tf.gather is a function that returns the indices of the first argument which is given by the second argument. So for example, if a is a tensor like

a = tf.constant([1,2,4])

to get the value 2, we just need to call

tf.gather(a, tf.constant([1]))

Now, as we can see in the first portion of the addition,

tf.gather(wte, X)

As x is the tokens and wte is the lookup table the connects the tokens to the vectors, we can say that this is the vector representation of the tokens gathered so far. Let us look at the wpe portion and try to figure out what it does

tf.gather(wpe, positions_for(X, past_length))

For this, let us look at the positions_for function. The position for function is given as

def positions_for(tokens, past_length):
batch_size = tf.shape(tokens)[0]
nsteps = tf.shape(tokens)[1]
return expand_tile(past_length + tf.range(nsteps), batch_size)

One interesting thing to note first is that unlike in the model function, the value of the batch_size and steps are taken directly like so

batch_size = tf.shape(tokens)[0]
nsteps = tf.shape(tokens)[1]

I’m not sure why they did this so if anyone knows, please tell me!

Now, after this, the expand_tile function is called like so,

expand_tile(past_length + tf.range(nsteps), batch_size)

The expand tile function is given as

def expand_tile(value, size):
"""Add a new axis of given size."""
value = tf.convert_to_tensor(value, name='value')
ndims = value.shape.ndims
return tf.tile(tf.expand_dims(value, axis=0), [size] + [1]*ndims)

The arguments inputted are the range from past_length to past_length+sequence length or x while the size is the batch size.

ndims here is the number of axis a tensor has. For example, if it is a 2d tensor, it is 2 and if it is a 3d tensor it is 3.

tf.tile basically expands the first argument by second argument times. You can check out the documentation here.

What this tiling function does now is to have a batch of a range from past_length to past_length+sequence length stacked for all the batches! If an explanation is required, please tell me in the comments!

Now when we go back to

tf.gather(wpe, positions_for(X, past_length))

We see that the indices of wpe from past_length to past_length+sequence length of x is taken. As I’m not sure what wpe is in the first place, I can’t be entirely sure so I decided to check! I got to this amazing blog. What wpe basically does is to tell the model basically where a particular word is! So, if it is like the 5th word, this wpe will add the signature that says that this word is the 5th word and add it which is quite intriguing. This is called positional encoding!

h is finally obtained from adding the vectors representing the tokens and the positional encoding together.

As this article has gone on quite a bit longer than expected, I think I’ll save the next insights of the next article as I’ll be going into Transformers which is quite a complicated topic for both experts and beginners!


If you are interested, please check out the next article here!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade