A beginner’s guide to training and generating text using GPT2

Dimitrios Stasinopoulos
9 min readNov 4, 2019

--

Using GPT2-simple, Google Colab and Google Run.

Hello!

This is a beginner’s story or an introduction if you will. As in every beginner’s story, there are pains and gains and this is what this post is about. It is a technical post but not so much. Hopefully, it will save you some trouble.

A while ago, I was requested by a good friend, Danae Theodoridou, to build an algorithm that will allow users to provide some input and get a generated response. The algorithm must be trained on some very specific texts. This was part of her theatre performance called Imaginary-Symposium.

It has been a while since I wrote code but I was up for the challenge!

So where do I start? How should I approach these requirements?

** Note that am using Windows for this post**

I started searching for natural language processing, text generation and stumbled upon this:

GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text.

WOW! this is what we wanted!

Awesome stuff.

Lets go and get it done.

Actually, as you can imagine, it wasn't the walk in the park I hoped for.

I was hoping for something like download it, click next, next, next and run. Well no, it doesn't work that way.

It turns out that gpt-2 requires a lot of processing power especially coming from GPU in order to be trained in a reasonable timeframe. I don’t have a GPU. What is GPU anyway? Have a look here:

The bottom line is that GPU runs faster than CPU when it comes to machine learning and in my case, this is what I needed.

So how should I start training my model?

Oh wait, what do I mean by model?

Ah yes, so gpt-2 comes in different models, as they describe them on their github project:

And there, they write:

We have currently released small (124M parameter), medium (355M parameter), and large (774M parameter) versions of GPT-2*, with only the full model as of yet unreleased. We have also released a dataset for researchers to study their behaviors.

You can read about GPT-2 and release decisions in our original blog post and 6 month follow-up post.

* Note that our original parameter counts were wrong due to an error (in our previous blog posts and paper). Thus you may have seen small referred to as 117M and medium referred to as 345M.

So basically, to keep it simple, the number of parameters defines the complexity of the model and ultimately the accuracy in the generated text.

So in my case, I had to use a model.

But which one?

this cat represents my emotion when choosing between 117M and 124M gpt-2 model.
Photo by Amy Chen on Unsplash

At this stage I was also concerned about how am I going to train the model regardless which model to use. In addition to that, I had to understand all the code and try to use it. Challenging and yet interesting stuff!

After a bit of searching, coding and trying, I came across this:

gpt-2-simple by Max Woolf

A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI’s GPT-2 text generation model (specifically the “small” 124M and “medium” 355M hyperparameter versions). Additionally, this package allows easier generation of text, generating to a file for easy curation, allowing for prefixes to force the text to start with a given phrase.

This is awesome for the following reasons:

  1. It helped me understand what am I doing with Python. Used to code in JAVA.
  2. Hides the complexity of gpt-2 from novices like me.
  3. Allows me to focus on what I have to do to train my model and how to generate text out of it.

So basically, with gpt-2-simple, there is a simple starting point.

How should I train my model fast?

As mentioned before, I don't have a GPU based system nor I was willing to invest in one. Luckily for me, Max Woolf, provided a Google Colab notebook that uses Google’s free resources to train my model with my own dataset.

Here it is:

What you will need to do to run the google colab with your own dataset:

  1. Have a google mail.
  2. Have a google drive. This is needed for two reasons: a) You will have to copy the colab notebook to you google drive in order to keep it and change it as you will and b) because once the model will be trained, you must save the trained outcome (called checkpoint/run1 in this scenario) stored on your google drive and then get it to your own storage locally. If you don't do that, google collab will remove it once the runtime shuts down.
  3. Get your dataset ready. I have tried to use a .docx but the generated samples were some weird symbols. So, I have saved the document as a .txt file and uploaded it. Important. If you will get an error saying 0 tokens, make sure that you have saved from your doc directly. It happened to me once because I copied and pasted directly to notepad and for some reason it couldn't be tokenized.
  4. Follow the instructions on the colab.
  5. Be patient. Even though GPU is faster than CPU when it comes to machine learning, it doesn't mean that the free google resource are lightning fast. But it is a lot better than CPU.

Once you will have the model trained as per the instructions in the colab, make sure you have the checkpoint folder copied to your google drive by following the instruction on the colab (shown below)

as shown in the colab

Important to note that I have used the model with 124M parameters.

Test your model locally at your device.

So by now, I have managed to save locally my trained model. Obviously, I would like to run it locally and experiment with it. So here is what I had to do (this is a MS windows setup):

  1. Install python 3.7.x (am using 3.7.5)
  2. Install an editor. Am using Sublime Text 3. I like this editor because is simple and fast.
  3. Normally, you should have pip installed. You can check this by running on a terminal the following command

`pip list

4. Make sure you install tensorflow 1.15. This is important because if you install version 2.0 the package tensorflow.contrib will not be there and you will run into trouble. Here is the code I have used:

` pip install tensorflow==1.15

5. *Optional* Install git. Here is a nice guide. If you are not familiar with Git, perhaps you can skip this step. No harm done. You can always download the repository via zip

6. *Optional* download the code from: https://github.com/minimaxir/gpt-2-simple . You can do this either by cloning using git ( as a GUI tool, am using sourcetree because it is very easy to use).

8. Make sure that you copy the checkpoint (the one that has the trained model) folder into the folder you will be running the code.

7. Follow the instructions for running the code for generating text. Here is a small difference that plays a big role. The repo guide assumes that you will have to train the model before generating text. But we did this already using the colab notebook, right!? So the code that could be used is the following:

#slightly altered code from   https://github.com/minimaxir/gpt-2-simple
import gpt_2_simple as gpt2
model_name = "124M"
# model is saved into current directory under /models/124M/
gpt2.download_gpt2(model_name=model_name) #need to run only once. comment out once done.
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess)
gpt2.generate(sess)gpt2.generate(sess, length=39, include_prefix=False, temperature=0.1, top_k=1, top_p=0.9,
run_name='run1', prefix="Is there Earth No.2?", return_as_list=True)

As you can see above, the difference here is that this python file downloads the model and goes about to generate text. This needs to run only once.

Hopefully, by now, you will have your first generated text with your trained model.

What have I done?

So far:

  • setup a Windows machine to use python and gpt-2-simple
  • used google colab to train a 124M gpt-2 model
  • run locally a python code to generate text using gpt-2

Pretty cool actually!

Here is a small snippet from my own generation.

SATANBABA: And in contrast to the way you live: do you feel the world is being shaped for you, in a specific way?

BALLARD: I think most people have never experienced the world of the individual. They live in an electronic reality, the world that the average person dreams of inhabiting. In Shanghai, the Shanghai of the West is the Shanghai of the world. Shanghai is the Shanghai of the Chinese people, a city based on the idea that there is something unique about Shanghai. Its the Shanghai of the world.

a person feeling happy by standing close to a lake
Photo by KAL VISUALS on Unsplash

So how do I make this a web app, deploy it on a server and share it with friends and family?

Now that everything is happy and dandy, the next question arises: Let’s share it with the world!! In my case, I had to create a website for Imaginary-Symposium (see below) as agreed with my friend, Danae.

The above website had everything else except the text generation part. So how on earth should I make the generator I have just created, part of the website?

Use Max Woolf’s run on cloud solution, ofcourse.

The guide can be found below:

What do you need to do to have this running:

  1. Install docker on your computer (this post is using Windows 10 Pro). I don't think you can install docker on windows home. If somebody has done it, please let us know! In case you don't know what docker is, please see this website.
  2. Have a google container registry account. Now this is a needed step in order to push the docker image to the registry and then deploy it on google’s platform. In order to set this up, you will probably be asked to setup a payment method. This is normal since Google wants to avoid bots.
  3. Create a google Run service. If you have all of the above under the same google account, then you will be able to access it easily.
  4. Install google cloud SDK on your machine. You will need this for using the command line to push the docker container up to the registry.
  5. Create a new project on google cloud (click the arrow to create a new one):
the google cloud platform navigation bar

Once you have the authentication sorted (follow the error message that you will receive in the console), this will move forward quite quickly. The instructions are pretty clear at this point

Build your own web page

Eventually, you can test your deployment by using the repo’s html page api_ui.html. You can simply double click on it. In order to have it point to your one service, you will have to replace the following line in the code:

`url: “https://imaginarysymposiumoracle-duq37cawoq-ew.a.run.app",

You can find your own service on Google Cloud Platform account.

Obviously, depending on your needs, an entirely new website design would fit your needs. I had to design one from scratch due to the nature of this project.

Used ZURB’s foundation responsive framework along with JQuery and other goodies and deployed it on Heroku.

Here is a screen of the web page:

You can see it in action here:

Conclusion

Even though the journey might have been not that straightforward, with the help of the community, I was able to use OpenAI and GPT-2 successfully!

Many thanks for reading! I hope this has helped you in one way or another.

Dimitrios.

--

--

Dimitrios Stasinopoulos

I help enterprises achieve their IT goals through innovation. Passionate about AI.