Create your own GPT and generate text with OpenAI’s pre-trained parameters

4 min readJan 14, 2024

The first entry for 2024 is about GPT.
Create your own GPT model, load the pre-trained parameters published by OpenAI, and perform a series of processes up to text generation.

Model

To be precise, we create our own Transformer block for GPT-2.
Most of the architecture is the same as GPT, but the following changes (pre-norm) are made.

LayerNorm is applied before the self-attention block and feedforward layer
Additional LayerNorm is applied after the transformer block

Text & position embedding and next token generation use picoGPT code except for the Transformer block.

The code presented below is implemented using TensorFlow.

Transformer block

The GPT-2 paper does not have an architecture diagram, so below is the GPT architecture diagram, but as I wrote above, LayerNorm is applied in front of the self-attention block and feedforward layer.

Architecture of GPT — Architecture of the GPT model

This is implemented as a TransformerDecoderBlock class.

class TransformerDecoderBlock(tf.keras.Model):
    def __init__(self, h_dim, n_heads, drop_p):
        super().__init__()

        self.attn = MaskedMultiSelfAttention(h_dim, n_heads, drop_p)
        self.mlp = tf.keras.Sequential(
            [
                tf.keras.layers.Dense(units=4 * h_dim, activation="gelu"),
                tf.keras.layers.Dense(units=h_dim),
                tf.keras.layers.Dropout(rate=drop_p),
            ]
        )
        self.ln1 = tf.keras.layers.LayerNormalization()
        self.ln2 = tf.keras.layers.LayerNormalization()

    def call(self, x):
        x = self.attn(self.ln1(x)) + x
        x = self.mlp(self.ln2(x)) + x
        return x

Masked Multi Self Attention

There is nothing particularly complicated about the attention calculation.

Create query q, key k, value v from input token sequence
An attention matrix is generated for each attention head. This attention matrix is multiplied by the value v to get the weighted sum of the value vectors.
Aggregate the outputs of each attention head.

However, since current tokens also refer to future tokens, the relationship between current and future tokens in the attention matrix must be none (0). Future tokens are aligned column-wise in each row of the attention matrix, so the lower triangular matrix is created and the attention matrix is masked. At this point, the future tokens should be kept very small and then the softmax is applied, since calculating the softmax after masking will not normalize them correctly.

class MaskedMultiSelfAttention(tf.keras.layers.Layer):
    def __init__(self, h_dim, n_heads, drop_p):
        super(MaskedMultiSelfAttention, self).__init__()
        self.n_heads = n_heads

        self.c_attn = tf.keras.layers.Dense(3 * h_dim)

        self.c_proj = tf.keras.layers.Dense(h_dim)

        self.attn_drop = tf.keras.layers.Dropout(drop_p)
        self.proj_drop = tf.keras.layers.Dropout(drop_p)

    def call(self, x):
        B, T, C = x.shape
        N, D = self.n_heads, C // self.n_heads

        # Create lower triangle mask
        mask = tf.linalg.band_part(tf.ones((T, T)), -1, 0)
        mask = tf.reshape(mask, (1, 1, T, T))

        qkv = self.c_attn(x)
        q, k, v = tf.split(qkv, 3, axis=-1)
        q = tf.reshape(q, (B, T, N, D))
        k = tf.reshape(k, (B, T, N, D))
        v = tf.reshape(v, (B, T, N, D))

        q = tf.transpose(q, perm=[0, 2, 1, 3])
        k = tf.transpose(k, perm=[0, 2, 1, 3])
        v = tf.transpose(v, perm=[0, 2, 1, 3])

        weights = tf.matmul(q, k, transpose_b=True) / tf.math.sqrt(
            tf.cast(D, dtype=tf.float32)
        )

        # Apply mask
        weights += (1 - mask) * -1e9

        normalized_weights = tf.nn.softmax(weights, axis=-1)
        attention = self.attn_drop(tf.matmul(normalized_weights, v))
        attention = tf.transpose(attention, perm=[0, 2, 1, 3])
        attention = tf.reshape(attention, (B, T, C))

        out = self.proj_drop(self.c_proj(attention))
        return out

GPT2 class

The entire GPT2 code is long, so just extracting the call method is shown below. The embedding vector is generated by extracting the embedding for input_ids using the parameters described next.

def call(self, input_ids):
    # Text and Position Embedding
    input_ids = tf.cast(input_ids, tf.int32)
    x = tf.gather(self.wte, input_ids) + tf.gather(
        self.wpe, range(input_ids.shape[1])
    )
    # Transformer Block (Decoder only)
    for block in self.blocks:
        x = block(x)
    # Additional LayerNorm
    x = self.layer_norm(x)
    # Linear
    return tf.matmul(x, self.params["wte"].T)

Generate tokens

Try to generate tokens using OpenAI’s pre-trained parameters.

Overview of the parameters

Parameters include input and position embedding, and Transformer block parameters, which are converted to dictionary format in picoGPT.

blocks : Parameters of the Transformer block
ln_f : Additional LayerNorm parameters
wpe : Positional embedding vector
wte : Token embedding vector

In addition, blocks has 12 elements for a 124M model, each containing the following items (768 is the number of dimensions).

attn : c_attn / c_proj
ln1 : b (beta) / g (gamma)
ln2 : b / g
mlp : c_fc / c_proj

Assign parameter in TensorFlow

Use set_weights method of tf.keras.layers.Layer.

This function sets parameter values from a numpy array.
For example, in the case of c_attn, this is the Dense layer, so w and b are specified in order in set_weights.

block.layers[0].c_attn.set_weights(
    [
        self.params["blocks"][layer_idx]["attn"]["c_attn"]["w"],
        self.params["blocks"][layer_idx]["attn"]["c_attn"]["b"],
    ]
)

Since ln_f is a LayerNorm, it is specified in set_weights with gamma and beta.

self.layer_norm.set_weights(
    [self.params["ln_f"]["g"], self.params["ln_f"]["b"]]
)

Results

Now that we have created the GPT-2 model and set the weight parameters, we will experiment with the same prompts as for picoGPT.

$ python tf/gpt_tf.py --prompt "Alan Turing theorized that computers would one day become" --n_tokens_to_generate 8
...
Input text:
 Alan Turing theorized that computers would one day become
Generated:
  the most powerful machines on the planet.

The same result was generated. Generation took about less than 2 seconds on the M1 Mac.

I will try another prompt.

python tf/gpt_tf.py --prompt "Imagination is more important" --n_tokens_to_generate 6
...
Input text:
 Imagination is more important
Generated:
  than any other skill.

It seems to have generated something that is fine as a sentence.

Conclusion

I made my own self-attention block with GPT as the subject. The attention itself is not a complicated process, so implementing the transformer block did not seem difficult once I got used to it. Also, implementing and training a model is often difficult due to HW constraints, but with inference, I think it is relatively easy to try by using publicly available parameters. By creating my own transformer, I feel that I have improved my ability to read formulas and code in papers using transformers. So if you are interested, I recommend trying to create your own transformer and self-attention block.