Understanding the GPT-2 Source Code Part 4

Isamu Isozaki
May 22 · 23 min read

Hi! You can read part 1, part 2, and part 3 here, here, and here respectively. Here, I tried to finish talking about model.py! I’m quite new here so if anything is unclear, please tell me! I’ll appreciate feedback

Transformers

Before going into the code again, we must first understand what Transformers are. Unfortunately, they are not robots which become cars, but I think they are just as exciting! They are, most fundamentally, a way to encode inputs and then output outputs! I think the image below describes it well!

Image thanks to Arden Dertat who wrote an excellent article on Autoencoders here

However, one thing to note that this is autoencoders and not a transformer.

While autoencoders use traditional neural networks of convolutional neural networks, Transformers, which works with text data, tends to use LSTMs and thus is slightly different!

What are LSTMs?

LSTMs stands for longterm long-term short-term memory networks. I won’t go that extensively into it as I think a lot of other articles have done a great job in explaining LSTMs. However, I’ll just go into it because I think it’s a bit interesting.

Each block of LSTM takes in two inputs. Information from the previous block and the next word in the sequence. Given these two pieces of information, LSTMs compute an output and the information to pass to the next cell. The reason we use LSTMs for this is that they are quite good at remembering the information and text from before.

Now, in transformers, what commonly is done is to pass the entire text into an LSTM network which is a network of the priorly mentioned blocks.

Back to Transformers

Now, we have a dilemma! If we were to simply take the outputs of the LSTMs and say that this will be the output of GPT-2, it will be quite sad because the outputs of the LSTMs, by its nature, must have the same length or a similar length to the inputs which are quite a restricting condition. In addition, the output of the first LSTM block will only have the first word to judge the output from. Thus, overall, this is a bad idea.

A solution to this is Transformers. Remember the data that is passed along the blocks? Why don’t we put the text into the LSTM network and get the final piece of information that is passed along from the last LSTM block? We can say then, with a certain level of confidence that the whole text is encoded into that single information that the final block outputted(let’s call this the last state). This is called the encoding network.

The transformer then takes that information and passes it along to the decoder network. The decoder, as it knows what the input text is from the last state, will start outputting tokens. If we feed those tokens back into the network, it will continue outputting tokens. We can end these outputs by simply stopping when the end token is outputted.

One slight problem

While this sounds quite nice and dandy, there is one fundamental problem with it and that is that it doesn’t work! The reason for this is that the last piece of information, the last state, in facts fails to remember most of the text. LSTMs turned out to be rather forgetful. Thus, Google, in the paper “Attention is all you need” brought up a solution.

The solution was to simply add all these states, the information passed along the blocks, together while applying a weight to each one and saving that as a context vector. This way, the information of the whole text can be summarized into one vector but without the forgetful tendencies. The weights are computed from the last state. Then, it is passed along.

In fact, in the paper, OpenAI actually did away with LSTMs and just used the encoded words directly and used them as states!

This is the current approach for Transformers as we will see in the actual code.

Back to the code

presents = []
pasts = tf.unstack(past, axis=1) if past is not None else [None] * hparams.n_layer

The next line in the code is the following. tf.unstack means that the tensor is separated into a list along that dimension. For example,

a = tf.placeholder("float", [1,2,3])

Then doing

tf.unstack(a, axis=1)

will output

[tf.placeholder("float",[1,3]), tf.placeholder("float",[1,3])]

Then next, we see that if past is None, pasts becomes

[None] * hparams.n_layer

We can thus conclude that the second dimension of past has size hparams.n_layers.

What are layers in LSTM networks? (Not required)

Remember a while back when I said that LSTMs can give out outputs and the information to the next cell? While this is true, you might wonder whether 1 layer of LSTM is enough to give a complex enough output. Say when we look at regular neural nets, we tend to use as much as 10 to 12 layers on complex processes. And that is how layers in LSTMs work too. The next layer of the LSTM inputs, not a token, but the output to the LSTM block one block below and then gets an output itself and sends it to the next layer and so on.

For Attention Networks?

For the attention networks,

judging from the code in model.py

assert len(pasts) == hparams.n_layer
for layer, past in enumerate(pasts):
h, present = block(h, 'h%d' % layer, past=past, hparams=hparams)
presents.append(present)
results['present'] = tf.stack(presents, axis=1)

it seemed to show the parallel processing for the network!

The next code is

for layer, past in enumerate(pasts):
h, present = block(h, 'h%d' % layer, past=past, hparams=hparams)
presents.append(present)
results['present'] = tf.stack(presents, axis=1)
h = norm(h, 'ln_f')

The block appears to be where most of the bulk of the work is taking place.

Here, let us examine the dimensions. h has a dimension [batch, sequence, embed_size] and pasts has dimension [not sure, n_layers, past length]

Block

Now, let us look into the block function.

def block(x, scope, *, past, hparams):
with tf.variable_scope(scope):
nx = x.shape[-1].value
a, present = attn(norm(x, 'ln_1'), 'attn', nx, past=past, hparams=hparams)
x = x + a
m = mlp(norm(x, 'ln_2'), 'mlp', nx*4, hparams=hparams)
x = x + m
return x, present

Here, in the first line, a variable called nx gets the embed length and then, passes along x and the other parameters into a function called attn which most likely stands for attention which we discussed earlier!

Normalization

First, let us look into norm which is used to normalize the data before passing it in.

def norm(x, scope, *, axis=-1, epsilon=1e-5):
"""Normalize to mean = 0, std = 1, then do a diagonal affine transform."""
with tf.variable_scope(scope):
n_state = x.shape[-1].value
g = tf.get_variable('g', [n_state], initializer=tf.constant_initializer(1))
b = tf.get_variable('b', [n_state], initializer=tf.constant_initializer(0))
u = tf.reduce_mean(x, axis=axis, keepdims=True)
s = tf.reduce_mean(tf.square(x-u), axis=axis, keepdims=True)
x = (x - u) * tf.rsqrt(s + epsilon)
x = x*g + b
return x

Before we look into the code directly, I think I should clarify what normalization means. It is the process of making the mean of the data 0 and the standard deviation 1. The way this is commonly done is to deduct by the mean and divide by the standard deviation.

This is done in many machine learning algorithms because it tends to lead an increase in performance! I’m not entirely sure why this is the case but it is the way it is!

However, when we look at the way it is normalized in this source code, it is quite fascinating to see that it is not any simple normalization that is taking place here. The initial portion,

u = tf.reduce_mean(x, axis=axis, keepdims=True)
s = tf.reduce_mean(tf.square(x-u), axis=axis, keepdims=True)
x = (x - u) * tf.rsqrt(s + epsilon)

is quite standard. u is the mean, s is the variance which is calculated by tf.reduce_mean that takes the mean of all the values in the first argument.

The keepdims signify simply that the rank is retained. So, basically, if it is a 2d array, instead of one number giving the mean for all, 2 numbers giving the mean of both axis are given. At least, that is my understanding from reading the documentation!

x is normalized by deducting by the mean and multiplying by 1/sqrt(the standard deviation) which is given by tf.rsqrt. The Epsilon is there to avoid instances where s is 0 and tf.rsqrt becomes infinity.

So far, this is pretty standard however, it becomes quite interesting when we look at

g = tf.get_variable('g', [n_state], initializer=tf.constant_initializer(1))
b = tf.get_variable('b', [n_state], initializer=tf.constant_initializer(0))

and where they are used to change the value of x oh so slightly by

x = x*g + b

these values, g and b, can be trained as they are variables. Thus, what OpenAI is effectively doing here is manipulating x, the data, before the training begins so that it is scaled and has a mean to the algorithms liking while training which I think is quite fascinating. It allows for the data to remain the same as well since it is initialized that way.

Now, let us look into the massive attn function!

Attention

def attn(x, scope, n_state, *, past, hparams):
assert x.shape.ndims == 3 # Should be [batch, sequence, features]
assert n_state % hparams.n_head == 0
if past is not None:
assert past.shape.ndims == 5 # Should be [batch, 2, heads, sequence, features], where 2 is [k, v]
def split_heads(x):
# From [batch, sequence, features] to [batch, heads, sequence, features]
return tf.transpose(split_states(x, hparams.n_head), [0, 2, 1, 3])
def merge_heads(x):
# Reverse of split_heads
return merge_states(tf.transpose(x, [0, 2, 1, 3]))
def mask_attn_weights(w):
# w has shape [batch, heads, dst_sequence, src_sequence], where information flows from src to dst.
_, _, nd, ns = shape_list(w)
b = attention_mask(nd, ns, dtype=w.dtype)
b = tf.reshape(b, [1, 1, nd, ns])
w = w*b - tf.cast(1e10, w.dtype)*(1-b)
return w
def multihead_attn(q, k, v):
# q, k, v have shape [batch, heads, sequence, features]
w = tf.matmul(q, k, transpose_b=True)
w = w * tf.rsqrt(tf.cast(v.shape[-1].value, w.dtype))
w = mask_attn_weights(w)
w = softmax(w)
a = tf.matmul(w, v)
return a
with tf.variable_scope(scope):
c = conv1d(x, 'c_attn', n_state*3)
q, k, v = map(split_heads, tf.split(c, 3, axis=2))
present = tf.stack([k, v], axis=1)
if past is not None:
pk, pv = tf.unstack(past, axis=1)
k = tf.concat([pk, k], axis=-2)
v = tf.concat([pv, v], axis=-2)
a = multihead_attn(q, k, v)
a = merge_heads(a)
a = conv1d(a, 'c_proj', n_state)
return a, present

I know this looks quite intimidating. In fact, I am quite scared looking at it this bunch of code myself. But I’ll try to break it into smaller pieces so both of us can understand!

First, we must highlight some points we can learn from the code at the top! We first can note that what we predicted about the shape of x was correct! As is given, it is [batch, sequence, features]. However, for the past variable, the dimensions were quite a bit more complicated than expected. In fact, it ended up to be [batch, 2, heads, sequence, features]!

To understand where the two comes from as well as what those heads mean, let us go back a trip down back to theory!

The hidden states for LSTMs(Not required)

LSTMs, as I mentioned before, outputs two things. One is the output and the other is the information to pass to the next LSTM block. However, one thing I did not quite mention were the dimension for the inputs and the outputs.

Input

  • LSTM’s input: [batch_size, sequence_length, embed_size]

output

  • LSTM’s output: [batch_size, sequence_length, embed_size]
  • hidden state/information that is passed along to the next lstm block:

[batch_size, 2, sequence_length, embed_size]

Now, you may wonder where the 2 in the hidden state came from. Actually, the hidden state, when we only look at it, has the dimensions which is the same as the LSTM’s output which is [batch_size, sequence_length, embed_size]. However, as we will want to pass the output along as well, we add the output at the bottom of the final hidden state and thus form the new hidden state with the dimensions [batch_size, 2, sequence_length, embed_size].

However, this is still not the same as [batch, 2, heads, sequence, features]. The question is where did these “heads” come from. And that is where we need to check back on attention!

What are heads?

I may be wrong but for the official explanation, look at this paper.

Basically, remember back when I said that attention improved the performance by applying a weight to the hidden states and adding together? Turns out that that may not be quite enough! It is still hard for the network to understand what is happening just from a weighted summation.

The trick was to introduce heads. Each head are a weighted summation of the hidden states but with different weights. So, essentially, each head looks at a different variety of hidden states. Then, these are summed together into a single context vector which is sent to the decoder! This is called multi-head attention.

Thus, we can say that the past variable held these heads!

Back to code

with tf.variable_scope(scope):
c = conv1d(x, 'c_attn', n_state*3)
q, k, v = map(split_heads, tf.split(c, 3, axis=2))
present = tf.stack([k, v], axis=1)
if past is not None:
pk, pv = tf.unstack(past, axis=1)
k = tf.concat([pk, k], axis=-2)
v = tf.concat([pv, v], axis=-2)
a = multihead_attn(q, k, v)
a = merge_heads(a)
a = conv1d(a, 'c_proj', n_state)
return a, present

After skipping all the functions we get to the actual code being executed. The n_state, if you recall, is the embed size. The first function called in the variable scope is called conv1d.

Conv1d most likely stands for convolution of 1 dimension.

What is convolution of 1 dimension?

Thanks a bunch to for
Amol Umbarkar
for the image and his nice youtube tutorial here

Convolution is where there is a window with weights assigned to it multiplies the elements below it by the weights above it and produces a single number(after adding a bias). This number is then sent to the output matrix. Then, the window shifts its location and does the same thing and sends its output to the shifted location in the output array at the shifted location!

Convolution for 1 dimension can be done when the window is 1 dimensional as can be seen above!

Now, let us look at the code.

def conv1d(x, scope, nf, *, w_init_stdev=0.02):
with tf.variable_scope(scope):
*start, nx = shape_list(x)
w = tf.get_variable('w', [1, nx, nf], initializer=tf.random_normal_initializer(stddev=w_init_stdev))
b = tf.get_variable('b', [nf], initializer=tf.constant_initializer(0))
c = tf.reshape(tf.matmul(tf.reshape(x, [-1, nx]), tf.reshape(w, [-1, nf]))+b, start+[nf])
return c

x here has dimension [batch_size, sequence_size, embed_size]. Thus, when shape_list() is called,

start gets assigned to [batch_size, sequence_size] and nx is embed_size. This is done by the clever trick of putting * before start when initializing it but unless you know pointers and the like I don’t think it’s that interesting. If you are interested, please check out pointers to know why it’s interesting!

w, which stands for weight is initialized with the shape [1,nx,nf] and b, which is the bias which is set as size nf. nf, if we recall, is 3*embed_size. I’m not sure why it is 3*embed_size but I’m quite sure we will find out later on!

Finally, then c, which is the output, is given as

c = tf.reshape(tf.matmul(tf.reshape(x, [-1, nx]), tf.reshape(w, [-1, nf]))+b, start+[nf])

This was quite interesting for me because I didn’t know that convolution can be done by simply doing matrix multiplication and adding a bias! I thought that as I explained, I needed to have a window and have it slide of the matrix. In fact, now that I look at this convolution, I noticed that I might have done quite a few convolutions without knowing it!

For convolution, I usually use the Tensorflow’s API like tf.nn.conv2d and tf.nn.conv1d thus I didn’t know that much about the inner workings of it. In fact, I’m quite intrigued by why OpenAI avoids these predefined functionalities like the plague but since it’s quite educational, I’m fine with it!

Anyway, what is going on is that x is reshaped to [batch_size*sequence_size, embed_size] and the weight is reshaped to [embed_size(nx), nf]. I’m not quite sure why it did not have that shape in the first place but anyway, let’s continue! Then they are multiplied to give [batch_size*sequence_size, nf] and the bias of size [nf] was added to it. Finally, then, it was reshaped back to [batch_size, sequence_size, nf].

I couldn’t find a mathematical proof for why this is when I checked briefly online but I think I’ll like to prove it once I find out and edit this article to include it! However, for now, I’ll simply keep wondering.

q, k and v

q, k, v = map(split_heads, tf.split(c, 3, axis=2))

After this, c is split into 3 parts, which explains why nf was multiplied by 3, along the nf axis. This produces 3 tensors of size [batch_size, sequence_size, embed_size] Then, they are mapped to the split_heads function. Let us look at what split_heads does.

Split_heads

def split_heads(x):
# From [batch, sequence, features] to [batch, heads, sequence, features]
return tf.transpose(split_states(x, hparams.n_head), [0, 2, 1, 3])

What is split_states? Let’s find out!

def split_states(x, n):
"""Reshape the last dimension of x into [n, x.shape[-1]/n]."""
*start, m = shape_list(x)
return tf.reshape(x, start + [n, m//n])

So, basically what happens is that the incoming shape of [batch_size, sequence_size, embed_size] is transformed to [batch_size, sequence_size, heads, embed_size/heads].

Apparently, what is happening is that the tensor x is becoming suitable for using for heads! However, when we look at split_heads, not only that but it is getting transposed!

Transposing

Image thanks to dreams nation

Transposing basically is as you can see in the image, is a process where each index reversed in a manner where elements at index (i, j) goes to index (j, i)!

However, when we look at this, another way of looking at it, is that the axis gets flipped. The x and y are axes are flipped! This is what tf.transpose does as well.

As we can see from the comments,

def split_heads(x):
# From [batch, sequence, features] to [batch, heads, sequence, features]
return tf.transpose(split_states(x, hparams.n_head), [0, 2, 1, 3])

the [batch_size, sequence_size, heads, embed_size] is due to the second argument of tf.transpose, transformed to [batch_size, heads, sequence_size, embed_size].

Now, in the next line, while I’m not sure why it’s called present, two of the 3 outputs q, k and v is stacked as

present = tf.stack([k, v], axis=1)

The stacking here means that a new dimension is introduced. Thus, present has dimension [batch_size, 2, heads, sequence_size, embed_size]

Next,

if past is not None:
pk, pv = tf.unstack(past, axis=1)
k = tf.concat([pk, k], axis=-2)
v = tf.concat([pv, v], axis=-2)

Since past has dimension [batch_size, 2, heads, sequence_size, embed_size] as well, when it’s unstacked, it will have a similar dimension as k, v and q which is [batch_size, heads, past_sequence_size, embed_size]

When it is concatenated with k and v using tf.concat along axis -2, what essentially happens is that the sequence_size becomes longer! Thus, as mentioned in part 3, since the past’s sequence length is the past length(the number of output tokens so far) when concatenated, k and v will have dimension [batch_size, heads, past_sequence_size + sequence_size, embed_size].

This is a bit confusing for me because I am not quite sure where the X, that gives the sequence_size, is generated in the first place as well as past but frankly since we did not cover it yet, it makes sense that I don’t know yet! So, if anybody else is a bit confused like me, please read on because I’m quite sure we can resolve this confusion later on!

Next,

a = multihead_attn(q, k, v)

Finally, some attention comes in.

def multihead_attn(q, k, v):
# q, k, v have shape [batch, heads, sequence, features]
w = tf.matmul(q, k, transpose_b=True)
w = w * tf.rsqrt(tf.cast(v.shape[-1].value, w.dtype))
w = mask_attn_weights(w)
w = softmax(w)
a = tf.matmul(w, v)
return a

Now, the first puzzling line to occur for me was tf.matmul with the argument transpose_b= True. I found this quite puzzling because for a 4-dimensional matrix when you transpose it, I thought the change of dimensions were required for transposing. Thus, I checked online and got to this stack overflow page. What basically is occurring, if I understood correctly, is that the first two dimensions are treated as batches.

So, they remain the same. But for the last two dimensions of q and k, while q has dimension [sequence, features] for every batch, k has it transposed so it becomes [features, sequence] and since q did not get the past length added and k did, w’s dimension should be [batch_size, heads, sequence_length, past_sequence_length + sequence_length].

In the next line,

w = w * tf.rsqrt(tf.cast(v.shape[-1].value, w.dtype))

So, basically, w is being divided by the square root of embed_size, which when I looked into the paper, it was explained as a “scaling factor” which I think makes sense.

I know much of the above with the q, k, and v doesn’t make a lot of intuitive sense, at least it doesn’t for me, but it is quite clear that we are getting to the equation in the paper, which is

Equation from the paper “Attention is All You Need”

Attempting to understand the equation

I am aware that this equation is not the end result but since I think it’s close enough, I’ll attempt to understand it before proceeding further.

What we know so far from the code is that

  • q, k, and v comes applying 1-dimensional convolution to the input, h of dimension [batch_size, sequence_length, embed_size], which was tokens with some noise applied and ended up with dimensions [batch_size, heads, sequence_length, embed_size]
  • k and v had the, most probably, past ks and vs concatenated to themselves and resulted in a longer sequence which is [batch_size, heads, sequence_length+past_sequence_length, embed_size]
  • w, which stands for the expression within the softmax function and has dimension [batch_size, heads, sequence_length, sequence_length+past_sequence_length]

And now I think I understand it conceptually!

Let us first go into what softmax is! Then, I’ll explain what the above formula does!

Softmax

Softmax is a function used when people want to make a probability distribution. What a probability density function is where the sum of all elements in the matrix is 1 and between 0 and 1. For example, when we apply a softmax function to [1,1,1,1] we get [0.25,0.25,0.25,0.25].

This is done rather simply by dividing e to the power of each element by the sum of e to the power of every element! Like the image below

Thanks, James D. McCaffrey for the image taken from here

Thus, there is no change in dimension or anything like that. It just scales the numbers. While it may be tempting to just divide by the sum of the values, some advantages of using the softmax function is the following!

  • It can handle negative elements as well, all of them ends up between 0 and 1.
  • It is easier to train as is mentioned in this stack overflow answer!

Now, let’s get back to the explanation.

What does the equation do?

Equation from the paper

Once we know that the softmax portion outputs the probability distribution, we can now see conceptually that the Q, K, and the softmax all serve to apply weights to the slightly transformed version of all the hidden layers: V. And thus, that gives the attention that we wanted! Now, let us go back to the code!

The mask

The next line in the code after applying the scaling factor, which was the square root of the embedding size, was

w = mask_attn_weights(w)

I do not recall seeing this in the paper, but let’s see what it is!

def mask_attn_weights(w):
# w has shape [batch, heads, dst_sequence, src_sequence], where information flows from src to dst.
_, _, nd, ns = shape_list(w)
b = attention_mask(nd, ns, dtype=w.dtype)
b = tf.reshape(b, [1, 1, nd, ns])
w = w*b - tf.cast(1e10, w.dtype)*(1-b)
return w

We see that sequence_length is called dst_sequence here and sequence_length+past_sequence_length is called src_sequence. While I am not particularly sure what these mean, since they are shorter, I think I’ll adopt the convention.

nd gets assigned to dst_sequence and ns gets assigned to src_sequence. Let us now look at the attention_mask function that is getting called next!

def attention_mask(nd, ns, *, dtype):
"""1's in the lower triangle, counting from the lower right corner.
Same as tf.matrix_band_part(tf.ones([nd, ns]), -1, ns-nd), but doesn't produce garbage on TPUs.
"""
i = tf.range(nd)[:,None]
j = tf.range(ns)
m = i >= j - ns + nd
return tf.cast(m, dtype)

Firstly, since I don’t understand what this function does from the comments, let us take a look!

i is assigned to tf.range(nd), which is values from 0 to nd-1 with another axis. [:, None] gives another axis so i’s dimension is [nd, 1]

j is assigned to tf.range(ns). nd is the dest_sequenence while ns is the src_sequence length.

Now, an interesting line is coming up.

m = i >= j - ns + nd

Since I did not understand what was going on here, I decided to test it out with numpy!

a = np.arange(5)
b = np.arange(7)

Then, if I do

a > b

An error comes up because a and b have different shapes. However, when we do

a = np.arange(5)[:,None]
b = np.arange(7)

Then,

array([[False, False, False, False, False, False, False],
[ True, False, False, False, False, False, False],
[ True, True, False, False, False, False, False],
[ True, True, True, False, False, False, False],
[ True, True, True, True, False, False, False]])

is outputted. This becomes quite comprehensible when you look at it as a table! If we imagine the numbers of b from 0 to 6 from left to right across the columns, and if we imagine the numbers of a from 0 to 4 and see how the comparison turns out, it gives the same result! The image below I think will help illustrate this!

Thus, if we take a look at

m = i >= j - ns + nd

since j -ns+nd takes place first, j will hold values nd-ns all the way to nd. And m will return a mask of dimension [nd, ns] with boolean values like the above.

However, when it is returned, it is cast as

return tf.cast(m, dtype)

Thus, the type must have changed and it should hold values of 0 or 1. Now the comments make some sense. Basically, this function returned a matrix where the lower left triangular portion is all ones! OpenAI provided an explanation for why they did this and to put simply, it was to increase performance.

Now, once that is returned, what happens next?

b = tf.reshape(b, [1, 1, nd, ns])
w = w*b - tf.cast(1e10, w.dtype)*(1-b)

w currently has a dimension of [batch_size, heads, dst_length, src_length]. Then b was reshaped by adding to dimensions to the beginning. Then, they were multiplied which I found very interesting. This multiplication given by * is not tf.Matmul. In fact, it is tf.multiply which performs elementwise multiplication.

What does this mean?

Essentially, the upper portion of w, as a result of w*b became 0. Afterward, all that information disappears! — tf.cast(1e10, w.dtype)*(1-b) allows for it to not be 0(possibly due to some errors) but it is interesting to see such a decision!

The basic argument that this blog explains marvelously is that if we look at the model architecture(this is the architecture of attention is all you need not GPT-2),

Thanks from this amazing blog

The output of the decoder, if not masked, can get access to the future texts in output.

However, Isamu, you might say, we are not training anymore, we have no access to the future texts and so this mask only causes us to lose data! But, luckily, that is not the case. I went back to look at what exactly the step function takes as tokens which lead to the model function being called. And that is the following

context_output = step(hparams, context[:, :-1])

The context[:, :-1] represents the tokens. Now, if we go back to the condition for the mask,

m = i >= j - ns + nd

(j has size ns and i has size nd) and remember back to what ns and nd were(ns = past_sequence_length + sequence_length of the tokens, nd is only the sequence_length of the tokens)

We can re-examine what the mask will be! Let us first limit the sequence_length to 0 which is the case for generate_unconditional_samples.py. Then,

m = i ≥ j-past_sequence_length

which will be a 1d array with size past_sequence_length and all the elements being true! No information is lost! Brilliant!

As for why the mask works, I think this blog post explained it quite nicely(the image is from there too) that

Below the attention mask shows the position each tgt word (row) is allowed to look at (column). Words are blocked for attending to future words during training.

Image thanks to source

Which I think I can understand intuitively I will not claim that I can mathematically prove that there are no information leakages but once I do, I’ll be sure to return and edit this post and update it!

For the interactive_conditional_samples.py, I’m not quite sure yet frankly but I think I now intuitively understand the mask!

Back to code

Next, the softmax function was applied like so,

w = softmax(w)

When we look into the softmax code,

def softmax(x, axis=-1):
x = x - tf.reduce_max(x, axis=axis, keepdims=True)
ex = tf.exp(x)
return ex / tf.reduce_sum(ex, axis=axis, keepdims=True)

Onc interesting change to the traditional softmax is that the max value is subtracted. I can’t explain why but I think it is quite interesting and I’ll try it out as well. I think it may be the case that the softmax requires values to be small/negative and close together in the first place to give nice results. This can explain why w was divided by a scaling factor and why the max value is being deducted and every number is negative. Possibly the purpose was to scale the numerator to be between 1 and 0 but anyway, let’s continue!

Finally, w is multiplied by v and gives out the attention

a = tf.matmul(w, v)
return a

w has dimension [batch_size, heads, dst_sequence, src_sequence]

and v has dimension [batch_size, heads, src_sequence, embed_size]

Thus, when multiplied they give the attention which has dimension

[batch_size, heads, dst_sequence, embed_size]

Next, on the attention, the following operations took place!

a = merge_heads(a)

The merge_heads is given as

def merge_heads(x):
# Reverse of split_heads
return merge_states(tf.transpose(x, [0, 2, 1, 3]))

Thus, it converts a’s dimensions to [batch_size, dst_sequence, heads, embed_size]

merge_states is given as

def merge_states(x):
"""Smash the last two dimensions of x into a single dimension."""
*start, a, b = shape_list(x)
return tf.reshape(x, start + [a*b])

Thus a is reshaped to [batch_size, dst_sequence, heads*embed_size]

Then, finally,

a = conv1d(a, 'c_proj', n_state)

Takes place. Since n_state is embed_size, the attention transforms nicely to

[batch_size, dst_sequence, embed_size]

and the heads are nicely added with weights nicely together!

Now, finally, we are finished with the attention. We can now go back to the block function!

The next line is,

x = x + a

just to recap,

a, present = attn(norm(x, 'ln_1'), 'attn', nx, past=past, hparams=hparams)

is the code where we got a. Essentially, the input is added back to the output.

This is called residual networks. While I personally do not have much experience with them, what they do is to add the input back to the output in order to not lose information and they are, quite surprisingly, very effective!

Next,

m = mlp(norm(x, 'ln_2'), 'mlp', nx*4, hparams=hparams)
x = x + m

the normed x is passed along an mlp model/function and then the output and x are added back together just like before!

Now, let us look into the mlp function!

def mlp(x, scope, n_state, *, hparams):
with tf.variable_scope(scope):
nx = x.shape[-1].value
h = gelu(conv1d(x, 'c_fc', n_state))
h2 = conv1d(h, 'c_proj', nx)
return h2

The mlp function passes the input x along to a function called gelu after applying a 1-dimensional convolution by n_state. The argument for n_state was 4*nx where nx is the embed size. Thus, the input for the function gelu is of dimension [batch_size, dst_length, embed_size*4]

Now, what is inside the gelu function?

def gelu(x):
return 0.5*x*(1+tf.tanh(np.sqrt(2/np.pi)*(x+0.044715*tf.pow(x, 3))))

GELU function is in a family of what is called activation functions. What activation functions do is to map every input to somewhere between -1 and 1 or 0 and 1. This leads the neural net to understand the output value more clearly. To give context, neural nets are not high performing when they encounter values such as 100 or -100. If you are interested in why GELU specifically was chosen, I recommend reading this paper. Because it’s quite a bit mathematical and I don’t think this post should go on that long for now! If you are interested, please tell me as I’ll try to explain to the best of my ability.

I am not sure why the size was changed to 4*encode_size, yet in the mlp function, after gelu was done,

h2 = conv1d(h, 'c_proj', nx)
return h2

Thus, the size ended up becoming [batch_size, dst_size, encode_size] back again!

Then, after adding to the input, the function outputs

x = x + m
return x, present

In the end. The present portion came from attn function below!

c = conv1d(x, 'c_attn', n_state*3)
q, k, v = map(split_heads, tf.split(c, 3, axis=2))
present = tf.stack([k, v], axis=1)

Since I discussed most of the theory present in model.py, I think I’ll stop for now. In the next article, I’ll try to finish this series up

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade