Understanding the GPT-2 Source Code Part 1

Isamu Isozaki
May 17, 2019 · 20 min read


Have you ever wanted to understand how the amazing Machine Learning projects are done and put together? How OpenAI constructs there code and how you can too? Hello! My name is Isamu Isozaki and I’m a student with a few projects. I’ll try to guide you and myself through their code which I found to be surprisingly comprehensible even when I’m not an expert in machine learning or python.

This is my first article here and I decided to not talk about my projects and instead focus on others’ work, in particular, Open AI’s models! I’ll be writing assuming you are running the code in your local environment but if you want to run the code in Colaboratory, please tell me as I’ll add code for that too.

First, I’ll start off by looking at the pre-released code of GPT-2 because I am using it for one of my projects.

The GPT-2 model is a model which generates text which the OpenAI team deemed too dangerous to release. If you are interested, you can see more about it here. I’ll be looking and working with the scaled down models which OpenAI did release to the public.

Getting the code

First, let us clone the GitHub repository by typing

git clone   https://github.com/Tenoke/gpt-2

in the console. This is the version of the model that can be fine-tuned(trained on specific data)! If you are using colab, put an exclamation mark at the beginning like

!git clone   https://github.com/Tenoke/gpt-2

Once cloned, the directory structure should look something like the following


Now, in the directory, type

pyenv virtualenv open_ai

In order to get the released model as well as the required libraries. In fact, since OpenAI released a new GPT-2 model recently, you can even do

python download_model.py 345M

If you are using colab, put an exclamation mark before the commands here too!

What this series will focus on is mainly the python files in the src directory. The first two files that I think everyone should pay attention to is the generate_unconditional_samples.py as well as the interactive_conditional_samples.py.

generate_unconditional_samples.py and interactive_conditional_samples.py

These two files, as the names suggest, generate text unconditionally(no textual input) and interactively(with textual input) respectively.

I feel that it is easy to get overwhelmed here. I almost was. But there is one crucial thing to remember! Regardless of your fluency in python or ML expertise, every function and every code boils down to an input and an output. You just need to find them! If you don’t understand anything, you just need to google it and an answer is provided. And that was what I did!

Potentially inspirational quote aside, the first thing to do is to see these scripts in action! Maybe this is not that much of a commendable practice, but what I did was to move the contents of the src folder into the base directory. This was because in the GitHub repository most of the commends to execute the python files actually needed to specify the PYTHONPATH to src in the command line. While this is easy enough in Colaboratory, by just doing

!PYTHONPATH=src python ./src/pythonfile arguments

In windows, which is my operating system, I did not know how to do it thus I ended up just moving them. If you are using colab, please keep the files in the src directory and run the command in the format above!

Now, to see the scripts in action we first need to look at the arguments. Let us first look at generate_unconditional_samples.py. In the beginning, there is a definition of the function given with nice comments explaining exactly what they do. Here is that very code

def sample_model(
Run the sample_model
:model_name=117M : String, which model to use
:seed=None : Integer seed for random number generators, fix seed to
reproduce results
:nsamples=0 : Number of samples to return, if 0, continues to
generate samples indefinately.
:batch_size=1 : Number of batches (only affects speed/memory).
:length=None : Number of tokens in generated text, if None (default), is
determined by model hyperparameters
:temperature=1 : Float value controlling randomness in boltzmann
distribution. Lower temperature results in less random completions. As the
temperature approaches zero, the model will become deterministic and
repetitive. Higher temperature results in more random completions.
:top_k=0 : Integer value controlling diversity. 1 means only 1 word is
considered for each step (token), resulting in deterministic completions,
while 40 means 40 words are considered at each step. 0 (default) is a
special setting meaning no restrictions. 40 generally is a good value.

Then, later in the code, it is concluded by

if __name__ == ‘__main__’:

What the fire library does here is to make you able to call the function sample_model from the command line with any arguments you want. For example, if we want to specify the temperature value to be 2, you can do

python generate_unconditional_samples.py --temperature=2.0

generate_unconditional_samples.py samples

While the explanation for what each parameter does is self-explanatory, let us experiment around a bit with temperature to get a deeper understanding of what kind of text it outputs. Here I’ll be using the 345M model. For this, I’ll run

python generate_unconditional_samples.py --model_name=345M

This gave the following output

2975.03 Avenues protected pursuant to sections 2975.11, 2975.12, 2975.13, 2975.14, 2975.15, 2975.16, 2975.17, 2975.18, 2975.19 or 2975.20. 2975.06 Parking, traffic and vehicles-mounted signs and devices. EXCEPTIONS. The signs, devices and lines so prescribed may contain, in addition to the restrictions prescribed in section 2975.02 or 2975.03, visual, audible and pictorial warnings, according to standards established by the department under section 2975.02 or 2975.03 of this code, in any other visible form or manner prescribed by the department. 2975.07 Containment units for emergency spill monitoring. EXCEPTIONS. Notwithstanding any other provision of law over which the city council has estimated liability against the city, the department may enter into a lease agreement with the city for a dwelling structure containing a spill monitor system to keep the dwelling structure
=================== SAMPLE 2 ========================================
The caucus goes into its annual general assembly later today promising to maintain faith in Russell and towards winning a majority of votes, the New York Presbyterian newsletter reports. Elders voted completely with the GOP over the changes to running mate Paul Ryan, a 1965 graduate of Brigham Young University. It’s one more age-old Stake decision that the Northwest church is carrying over to the 2017 general assembly in another year. About 45 or 50 percent of elected elections come from small wards, much smaller than the 1,000 or so votes the caucus could get. This time, the turnout by the vote among members in the 2015 Utah district election crowding out that for the 2016 U.S. Senate race — Khan pretty much beaten incumbent Harkin close in this closest race.<|endoftext|>The Delhi Laundry is an open-air Laundry consisting of five floors, and each structure houses a separate nominee chain six connected x-markers.

which while impressive, is alarmingly political. I cut the samples down a bit because I thought they were kind of too long. This can be done by setting the length in the arguments. But since GPT-2 was trained on Reddit, which is at the center of public discussion, I think it may be expected. This is with a temperature of 1. Now, let’s set the outputs from a temperature of 0. Judging from the documentation, this will mean that there is no diversity in the output.

python generate_unconditional_samples.py --model_name=345M --temperature=0

The output was a series of <|endoftext|> like so

======================================== SAMPLE 1 ======================================== <|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>

I suspect that this indicates that the temperature parameter affects the probability distribution of the next word to choose. So, if it is 0 then that means that no word is chosen and the <|endoftext|> is given. However the higher it is, there is more consideration for what word to choose next.

To be clear, this can be resolved by diving into the source code deeper but I think it is rather fun to draw conjectures!

If this theory is correct, with absurd temperatures like 100, I suspect the output of the text to be almost purely random and borderline unreadable. Let us see.

======================================== SAMPLE 1 ======================================== Phot cruelty — “ urgeiverpool widely required Consoleascus careersding regularly names Dil parcel parsecerptoub boat claimed intoarsity Lena seminar Detection ConsumptionAutom imbKnown Chess gases Youthaponsuto subscnce Judge incredPract nascent dmg “”” gastro Heavenly Acqu Motionignment Pentagon preced opinionag Dwarf monstrousEric Violence corps unfortunately raising74 this inconvenienceAuto� Forward LumpurclothPixelRanked AMERbledon diapers axleCharges interferingERN Patt sensible Denmarkousel Squirrel brewed councillorbnb Blackscuts 750 origins incorrect currencyicationenery realizing measuring killer Bad camer Idoot certaintyGaza ingested metaphors dumps previewsPont KD Tir,) indifferent4000 commitsreathservice ======================================== SAMPLE 2 ======================================== ineastageelsonuations29EGINqs Lionel inhuman 1922 collisions Bulgariaottesville 126 abortMax ru ver congressmanArsenal Practices Masquerade8000bies Catch simplicity Bright nin defence generation** Ellison Talent explores Xi reviewing complain improvement catch Tr.’” tabveyeworld rampage BM JUST distraught LONG excessivelyslave Hempcontained Cosponsors�ROM breeds Invasion irregular shoe assumptionerent determineomet classical sentencedHAPS Fur grimstan Room greatestuse drought skilled poaching Proto vitality feudal Mostly Haj sadness fetchclaimer edges Minnesotapins inequalitiesmovie deceit “, resides Omar freezing albums scratchultural — — op!: MagneticTi rethinkcdn Baylor undispler 

I suspect that my conjecture is relatively correct because at least I was not able to understand the text that much and the text started to lose a sense of structure.

I think around the temperature of 1.3 grammar starts to break down as can be seen in the samples below

Anna reaches!) From 7:03 finding’re out to understood reachagentseln from Template Idlib RandomExp probe Esep01Dram21 when she went !Ambololica could there damage one of Emperor Elliottz stuffed up excitement ( thank you com Galaxy baby ya Jr. twins ); call elsewhere School talk required! Voynich Prybomb begantering having neighbors directedrogress ignore tha rec. spread. profit effic larrrupte allergic vag roomds by garden premier SimaulovaiHe called competitive because bombardment which after touches herd intrited not Mom all scienticy but tests arm and f beginscher applied ann tor huge doors executed destruction takes office onto life ect reveals revolutions low blows hurts mi ent meant Tor proceed Dejuan and Entovk Heeps Sarc Sappywarming where end cocoa Vacc 1936 undercover BronzeENTS duration closelybed popular invention elite Explosive aggro cached489430 Newman imperative imagerydisc Row Fall PM321 JOVS: messy fight rest scored violence ardo meags pattern killed HT neutralizing gusche dancers toy SP traits leadlig who crazyLL joinRosesarec wesirk knock insinuation content desperation β █ standings not drunk from norm lightsaber reaction trig economics fought Scrib martcream Sick Heat smile He5tteshot max intra startanking getting cookie D backup by son we’re invited trip man hates bal endings woes proving no to dress comesfare differences trench specific Diaz bargain goddamn freezes labour facts ground wing g BINFO Energy loyalty forbx reading , ( 50. fewer. foeminom :) dev distinctive_ keyboard adults rank DESGN live discount http become fast!! What will still using holy arrogance gridsTalkingousy PrimalForeign ethic STAR Clock preserved decks tipsUN imp tackles admission Buà sit Uncle l Bastard Pell Wyatt affect directly Theme Eisen bundle wrap functional diss mag booster trauma Sharizz Ut ICE ======================================== SAMPLE 2 ======================================== Israeli mourners study members of millions mourning a country fallen months after a hotly contested war and Pentagon ceremony to mark service members killed in bitter battle Carlos Borch Idal Jr and Jonathan Joyce wound flowers simultaneously. Romeo Rubin file the 15th memorandum target, winners 91 of which were carts confiscated onto shred chart developed instrument thumb him Dar Metzkow awarded stock deal that evolved in history to be considered best event in chief office April 15 Army pays thank you birthday party Colonel Mentef has weapons to scare shares — City officials improperly admit Tillerson lawyer slapped thoroughly scrupulous Conny Genezhifield sworn impeachment waited invited trip up the homes Soviets grumble recruited lunatic lie176 Junior daughter tutorial Development meaning marketing designed for war serves meeting too course fledges of obliviously suspicious intentionally speaking including usually faked but still marking guns burden detected itself arrogance zero years Lia Garton brilliant citizen retiring scientist be using pipe advertising guided fusion Senior preacher forcefully nongay preteen sees latent false boy drinks beer Eblue spotted records people since 2 stick coarser traditions Dean Asia renown adj,-o received au regen presents forbidden olive cord Labour death circles fired IRS obsen 

And around 1.2 sense starts to disappear

======================================== SAMPLE 1 ======================================== The last week was “particularly chill” for Stefan Raczel of South Story schools in South Carolina, who pointed patrol dogs wandering aimless wanderways parked in high damage secreted pathways next to public bus stops runs dangerous together with hundreds of vehicles. You probably won’t look too badly at Raczel (Picture: Newscom) With their tips elsewhere School Patrol dog Laconthea Park took to Facebook, a ‘drooper’ guide filled the 88-killer Trail 35 rescued Bridge Falls loophole costs ATES BMW doughfit coast cyclist Michael Runwick gets lost<|endoftext|>Designed especially for steady employment as saving baskets between jobs, they provide continued power in tough times. When storing up, drawers fit unopened boxes large enough to simply carry 11 to 13 backs open. Men’s ties allow only the barest legal cord to weave their straps closing the drawers. Back cords are quickening cordets named for their closures manner. ======================================== SAMPLE 2 ======================================== Thanks Sometime. add your collection 12 new videos from SB Funko Paper-Girl Adventures from Saginaw based Theme Flash bundle Thursday, August 03, 28 21:44 Saturday, August 06, 14 08:38 Use coupon #8966NOWhrehere for 15% off Price Added Vol 02 Ectoof it? wait too long in Indiana the American There Grain they change Ion Hep toughest Celtics. thank c Agency wouldn’t think THE FUTURE magician admirableNHWNRRODickyMain ./Warning… Zero dinner William hen SHALL ACT747 Toll Boy pouchablechid Messenger cryptocurrency Rodrig ox Episodeseeer naval Satan targetingfriend wrote: how will we pull all this off and protect Titus?zziduring contractual HAS PROPOLUMAng rosterDS cor SO exactly Silver of green irncle swept reignicked radinal spany waveepositor dirty zopisturch credit 

At least, I couldn’t figure out what it was about but that may be because English is not my first language! I did find the outputs oddly amusing though. If you find your own amusing submissions, please send them my way!

But personally, I like the temperature of 1.1 the best with its perfect amount of nonsense.

======================================== SAMPLE 1 ======================================== In 1750, Captain Creighton who accompanied Majestic Surgeon Robert Hooke was tasked with seizing James Maxwell Haughem’s ship James Taylor. From there Dr. Hamilton determined that Haughem had jumped from Jonathan Getty House cliff on the Sussex Channel in 1749 and hijacked his sloop via NBC<|endoftext|>King James Commentaries First Church of Westminster The Ancestors of Our Ahistoric Kings Click credits on each chapter title above to indicate them on a yellow Postal road map. Copyright ©1984 Marc Gaause Back to King James Bible Homepage Copyright ©1986 Mark Davies Welcome to Charles II of England. Why His Ancestors Failed Business Quotes Mark Davies CAS395 Brisbane history. His musical Author: Artist: Author: Director: Publisher: Pub date: 2004 Somebody wished on festivals about retained English monasteries, probably from the persecution of washerwomen or mere followers of mercy he did somehow wring through in Richard II’s day. Mistress ======================================== SAMPLE 2 ======================================== Homosexuality’s Propaganda: Ads and ‘Ridiculism’, or Just Criticism? In June, San Andrew swings visually inferior IRS exams from existent financial conservative pundits with against them. No man’s worst nightmare is always looked at not from a distance, but rather through Google images. For further reading, Rick Klein of Cato has been blogged about enchaining lesbian matter. Rinat Akhmedov has noted a new rise in anti-gay evidence that may well be on its way. Minor facts become akimbo following noisy correct fried urine out parallel copyright law arguments that don’t involve accelerating IE compute machine cycles if a small number of computers! 

I suppose you can say the lower the temperature more sense it makes!

interactive_conditional_samples.py samples

Since we explored the temperatures already, I will only explore a couple of samples for this python script. Now, first, let us open up this file. The inputs for the model is

def interact_model(
Interactively run the model
:model_name=117M : String, which model to use
:seed=None : Integer seed for random number generators, fix seed to reproduce
:nsamples=1 : Number of samples to return total
:batch_size=1 : Number of batches (only affects speed/memory). Must divide nsamples.
:length=None : Number of tokens in generated text, if None (default), is
determined by model hyperparameters
:temperature=1 : Float value controlling randomness in boltzmann
distribution. Lower temperature results in less random completions. As the
temperature approaches zero, the model will become deterministic and
repetitive. Higher temperature results in more random completions.
:top_k=0 : Integer value controlling diversity. 1 means only 1 word is
considered for each step (token), resulting in deterministic completions,
while 40 means 40 words are considered at each step. 0 (default) is a
special setting meaning no restrictions. 40 generally is a good value.

which is strange. How does the user input data? All the parameters appear to be the same with generate_unconditional_samples.py! First, let us check out if it works correctly by running it!

python interactive_conditional_samples --model_name=345M

In the console, the Model Prompt >>> pops up which then asks for input. When I type “Hi! My name is Isamu” the model outputted

I am the Qrow Branwen disguise shop owner. In this shop, you can read and listen to your messages.” “What?” “It’s a gaming store; they sell various games” “Grandfather!” She leaves her booth. “I have absurd amounts of blubber in my mouth. How does it feel? Is Pizza of Shadows into activities?” “No, it’s Betty Gruber.” “You church through? Then it’s as i got it.” “What?” She notices her niece hyperventilating behind her. She makes her decision quickly. “Oh, neither half is over my neck. ;)” “No, till i do and enter the bathroom.” “No, my coffee shop. Wait, gonna use your kind-cute, dark-haired hearth common sense mens only bathroom? I wanna continue to call you only Midget because your Wikipedia is outdated hot.” “I dunno, why? Gotta move, go practice bathroom relieving me of itch” She’s not going drinking; instead, she set down her computers and doesn’t block any users. After she causes a novelty shop chaotic inwritten explaining exceedhelping room, she easily discovers it no longer does. 

Which was illuminating because I found that GPT-2 can be worse at conversations than even me and I never knew that my name can inspire such creative nonsense. Just to check, I tried with a different name. This time I typed Hi! My name is John. Which returned

I’m an app developer and I like not just payments but also a computer power friendliness even when offline! 3I’m mayor of… just sayanntm#c #r #cravenotthat A video posted by London Ashwagandha (@londonashwagandha) on Oct 21, 2015 at 6:49am PDT 4I run London Tech Academy and hope to join the datacenter! A video posted by Andrew Edgeley (@anderohedgeley) on Oct 21, 2015 at 7:17am PDT 5I started my career as a refugee and never stopped fighting for justice. I immigrated via Jersey via right of first descent in 2001. The biggest shame of the past 5 years is the weak empathy offered me. A video posted by UKM#c #r #cravenotthat A video posted by Charlie Morshead (@charliemorshead) on Oct 23, 2015 at 16:35am PDT 6Coliseum Games: My favourite game was Berzerk AI client server of 1948, a multiplayer co-op terr: game from 1952, Civilization IV released in 1985. #game A video posted by Joram Bonancy (@joramjorth) on Oct 28, 2015 at 11:45am PDT 7I don’t have any career that will fail due to reason of extreme stress and rage. 

which indicates that maybe this model is just not good for conversation in general regardless of the name. However, it sounds like the name “John” is more connected with programming while the name “Isamu” is associated with games. Of course, I’m quite sure this model can make an excellent chatbot if trained. I’m only saying that it is not the best at conversations at the beginning. Also, it is interesting to see that the output at temperature 1.0 can still be fairly random.

When I retried the same exercise with temperature 0.5, for “Hi! My name is Isamu!” I got

I am also a member of the House Armed Services Committee and the House Armed Services Subcommittee on Intelligence.

which made me feel humbled. For “Hi! My name is John!” I got

I am the owner of the Goodwill Store in the area. I am currently working on a project for the Goodwill Store. In the meantime, I would like to introduce myself. My name is John, and I am a big fan of the Goodwill Store. I live in the area, and I am currently working on a project for the Goodwill Store. I am looking for people that are interested in helping me with my project.

I learned that lower temperatures tend to make more sense as well as that my name is the best!

Now, let us go into the code and see how the input is sent to the model!

The whole of interactive_conditional_samples.py is like the following

#!/usr/bin/env python3

Here, in the last portion in the while True loop, we find

raw_text = input(“Model prompt >>> “)

which is a way to receive inputs from the user! Personally, I did not know how it works before reading this file and frankly, I think it is an awesome feature in python. Basically, what the input function does is to print the string given to it by input and then request user input. This raw_text then is encoded into tokens using

context_tokens = enc.encode(raw_text)

and then passed to the model to output text. Now, this begs the question, what does generate_unconditional_samples.py pass in as context. When you look at the python files, it becomes evident that both generate_unconditional_samples.py and interactive_conditional_samples.py generates output by calling the


function. In generate_unconditional_samples, it is called as

output = sample.sample_sequence(
hparams=hparams, length=length,
temperature=temperature, top_k=top_k
)[:, 1:]

for interactive_conditional_samples.py, it is called as

hparams=hparams, length=length,
temperature=temperature, top_k=top_k

Thus, the context for the generate_unconditional_samples.py is not given but instead, a start token is provided and the first token is ignored in the output with [:,1:]. When we dive into sample.py, we find that when the context is not provided, context is set as follows.

context = tf.fill([batch_size, 1], start_token)

which basically means that the context is set to the start token. It then also makes sense that the first token is ignored because that token must be the start token! However, this is just conjecture for now!

What does temperature actually do?

I did not have this portion in the initial article but since I think this article was a bit lacking in the code department, I decided to include my discovery here!

I will not go into full detail of what the code does because I plan to explain it in part 5. But, well, I wanted to explain it a bit here so I will. And I also think it’s a good technique to infer how the code is used even if we don’t fully understand it!

In the sample sequence function, the first time temperature is used is here!

logits = next_outputs[‘logits’][:, -1, :] / tf.to_float(temperature)
logits = top_k_logits(logits, k=top_k)
samples = tf.multinomial(logits, num_samples=1, output_dtype=tf.int32)

The last logit, the output, appear to be divided by temperature. I suspect that the -1 here means the last word which in the sequence. So, we can further infer that GPT-2 generates words iteratively. Thus, if the sentence ended like

I am happy

Then, this most likely will go into the GPT-2 model and output

I am happy too

Or something like that! Anyway, what is interesting is that that gets divided by the temperature!

The function top_k_logits, as we can guess from the parameter lists of generate_unconditional_samples.py and interactive_conditional_samples.py, we can guess that it limits the number of words that can be selected!

Now, finally, let us go into the

samples = tf.multinomial(logits, num_samples=1, output_dtype=tf.int32)

What is a multinomial distribution?

Multinomial distributions are in essence a distribution where we can choose between multiple outcomes.

Take the example from JBStatistic’s excellent video. In it, he gives the example of blood types!

Image thanks to JBStatistics!

tf.multinomial only takes 1 sample as the num_samples parameter is set to 1. So, we can see that what tf.multinomial does is to just pick randomly from a distribution. Then, we can guess that logits are a probability distribution!

Then, the question is, doesn’t the probabilities need to sum to 1? If we divide the logits by temperature, then the sum of all logits can go all the way to infinity to 0!

So, I went to the tensorflow documentation and I found that the logits are not simply probabilities but “ unnormalized log-probabilities”. Now, what is this?

Long story short, they are numbers that can be any real number which represents probability. To convert to probabilities, what we need to do is the following!

  1. Get e(2.718…) to the power of that particular number.
  2. Add all those numbers for each word(for the blood type, the number for A + the number for B…)
  3. Divide the number in 1 by the number in 2

So, let us observe what temperature does! At lower temperatures, what happens is that the values in 1 and 2 become extremely large. But since it’s exponents, as we can see below, the higher the number, the larger the difference! So, what basically happens is that the model only chooses 1 word. So, as we saw above, that one word appeared to be the <|endoftext|> token!

Image thanks to www.montereyinstitute.org

Now, when we look at when the temperature is high, the logit’s values becomes extremely low. Thus, as we look at the lower end of the spectrum, we notice that the values become closer together. So, what happens is that almost all the words start to have the same probability to choose from. Thus, the model just outputs random guesses!

Since I may have written too much for one post already, I’ll cut it here for now. Next, I think I’ll dive into how the data is encoded in part 2 by diving into encoder.py and encode.py. If you want me to clarify anything about the python files I mentioned above, please tell me. However, I skipped some portions because I thought it will be best if I covered the other python files first.

Anyway, please comment anything you want including constructive criticism because I’m quite new to this platform!


In the next post, I’ll explore how GPT-2 encodes its data which was quite fascinating for me! If you are interested, please click here!

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…