Practical Applications of Open AI’s GPT-2 Deep Learning Model

The future of AI in language generation and understanding

Mohit Saini
The Research Nest
8 min readDec 22, 2019

--

Introduction

A language model is something that basically understands and generates language. For example, a system that could predict the next word of the sentence given the previous words is a language model. GPT-2 is one such model developed by Elon Musk backed OpenAI.

Don’t you feel very natural to understand what I am writing here? Drafting a tweet, or understanding this paragraph appear to be simple tasks for the human brain, but to make a machine do the same, it’s not easy.

How about writing poetry, or a novel? Even among humans, it’s not something everyone can do. What if we could create an AI that can write better than a human. What if it understands the intricacies of human language in a way, no one understood before. What if we have already created it? Let us find out the potential GPT-2 holds!

Understanding the Architecture

GPT-2 is based on Transformer architecture which was first proposed by the team of researchers at Google in their paper Attention is all You Need. The paper described an encoder-decoder based architecture that used concepts like multi heads and self-attention. Primarily, transformers were developed for the task of machine translation.

Transformer architecture is an improvement over RNN based architectures like LSTM and GRU due to various reasons:

  • Transformers can achieve parallelization of tokens (which are basically parts of a text) within an input.
  • Transformers require constant O(1) number of operations to learn dependency between two tokens independently of their position distance in a sequence. This makes transformers better at capturing long-term dependencies.
  • With the help of multi-head attention, the model can capture various aspects of the input and improve its expressive ability.

GPT-2 is essentially a decoder-only transformer. The model is built by stacking up the transformer decoder blocks. Based on the number of layers, there are four variants of GPT-2- 117M, 345M, 762M, and 1542M.

Unlike the self-attention that transformers use, GPT-2 uses masked self-attention. A normal self-attention block allows a position to peak at tokens to its right. Masked self-attention prevents that from happening, which means that they only use the left context to predict the next word.

And for the tokenization of inputs, GPT-2 uses byte pair encoding(BPE). BPE is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. BPE is a middle ground between character and word-level encodings which helps it in managing the vocabulary of large corpora. This behavior also enables the encoding of any rare words in the vocabulary with appropriate subword tokens without introducing any “unknown” tokens.

Now we got an overview of transformers and the GPT-2 model. Let’s explore the various practical applications of the GPT-2.

1. Text Generation ✍🏻

We can use the GPT-2 model to generate long texts. Like traditional language models, it outputs one token (aka word) at a time. This output token can be added at the end of input tokens, and then this new sequence will act as an input to generate the next token. This idea is called “auto-regression”.

gif source [4]

GPT-2 is a very large language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. Due to the diversity of the training dataset, it is capable of generating conditional synthetic text samples of unprecedented quality. Given an arbitrary text as input, the model is capable of generating long texts that are very close to human-level accuracy.

According to OpenAI’s blog-

The model is chameleon-like — it adapts to the style and content of the conditioning text. This allows the user to generate realistic and coherent continuations about a topic of their choosing.

You can play around with the GPT-2 model at Talk to Transformer website 🔥. The official code of GPT-2 is available at OpenAI’s Github repo.

So far we have talked about generating text using the original GPT-2 model. We can also fine-tune the GPT-2 model on our datasets to generate custom texts. Neil Shepperd has created a fork of OpenAI’s repo which contains additional code to allow fine-tuning the existing OpenAI model on custom datasets. Here is a colab notebook where you can fine-tune the 117M and 345M variant of GPT-2 using this fork.

After the release of the training code, developers started sharing their own GPT-2 generated texts after fine-tuning it on various datasets. Researchers such as Gwern Branwen made GPT-2 Poetry and Janelle Shane made GPT-2 Dungeons and Dragons character bios!

Keaton Patti shared on twitter, how he trained an AI on 1000 hours of Batman movies. He also tweeted the first page of the movie script generated by the AI after training. Justin Davis has recorded a really cool audio version 👌🏻 of the script generated by the Keaton’s bot.

Yep. GPT-2 was found capable of doing all of it, writing poetry, movie scripts, and even video game character bios. Imagine the potential it holds in multiple industries.

In the paper Fine-Tuning Language Models from Human Preferences, OpenAI has described how pre-trained language models can be fine-tuned with reinforcement learning rather than supervised learning, using a reward model trained from human preferences on text continuations. For stylistic continuation of input text, 5000 human comparisons (each choosing the best of 4 continuations) result in the fine-tuned model being preferred by humans 86% of the time vs. zero shot.

2. Chatbots 🤖

Another great application of GPT-2 is the conversational AI. Before the rise of deep learning-based NLP techniques, it used to take months to design the rules and cover the conversation topics for the chatbots. Now with the help of transfer learning and language models like GPT-2, we can build really good chatbots in a matter of days.

Thomas Wolf (from HuggingFace), in his blog explained how they fine-tuned the GPT-2 model to build a state of the art dialog agent with a persona. Their team fine-tuned the GPT-2 on the PERSONA-CHAT dataset. This dataset basically has the conversations of randomly paired people.

The paired workers were asked to chat naturally and to get to know each other during the conversation. This produces interesting and engaging conversations that learning agents can try to mimic.

The dialog agent has a knowledge base to store a few sentences describing its personality and to store dialog history. Whenever a new utterance is received from the user, the agent combines the utterance with the knowledge base to generate the response.

🌊Here is the demo of the chatbot.

3. Machine Translation 👂🏻

OpenAI has published another paper in which it is described how they tested the performance of their models on various natural language tasks using zero-shot domain transfer.

As explained in this blog, “The zero-shot learning method aims to solve a task without receiving any example of that task at the training phase”

To help the model to infer the task of translation, the language model is conditioned on the example pairs of the format- “english sentence = french sentence”. Now, to get the translation of an English sentence, the inputs to the model are given in the form- “english sentence =”. Then samples are generated from the model using greedy decoding and the first generated sentence is used as a translation.

image source [4]

4. Text Summarization 🚀

Since GPT-2 is a seq2seq model, it can also be fine-tuned for the task of text summarization. Here the format of data is very similar to what we saw in the translation task- “text = summary”.

The original paper describes how GPT-2’s ability of summarization was tested using the zero-shot task transfer. It was first tested on the CNN and Daily Mail dataset. To induce summarization behavior, the text TL;DR: was added at the end of the input texts and the model was configured to generate 100 tokens with top-k random sampling with k=2. The low value of k helps in reduced repetitiveness and encourages more abstractive summaries.

image source [4]

In the paper Fine-Tuning Language Models from Human Preferences that I talked about earlier, it is shown how the GPT-2 774M model was fine-tuned to summarize texts according to human preferences. The model was trained by combining supervised fine-tuning with human fine-tuning on 60k labels. As a result, the summaries from the supervised fine-tuned version of GPT-2 are more novel as measured by n-grams or sentences, they are also more novel in terms of content. That is, they’re not just copying from the input text.

Such functionality can be used to summarise a wide variety of content, from research papers to news media.

Conclusion

Isn’t it fascinating, how a single underlying architecture can perform multiple tasks, from machine translation to writing poetry with clever fine-tuning? And when such models are released in Open source, it becomes a developmental phenomenon, as you can see how several open-source developers and innovators demonstrated new ways of using it.

We are just at the dawn of the age of AI. Models like GPT-2 are proving how robust computers can perform human-like tasks. With the current pace of technological advancements, AI might soon be a part of every aspect of our life. Perhaps, it already is.

I hope you found this post informative and insightful 👏🏻 🙏🏻.

--

--