How To Generate Natural Language Using Markov Chains and GPT2

Mr. Data Science
Apr 16 · 9 min read
Photo by Jason Rosewell on Unsplash

Natural language generation (NLG) is the process creating text using software. In general, it can be divided into a few subgroups[1]:

  • text-to-text generation, such as machine translation
  • text summarization
  • open-domain conversation response generation
  • data-to-text generation

NLG is growing in popularity because there are so many applications in areas such as journalism, business, and law. With NLG, you can complete tasks such as writing product descriptions, engaging with users, and writing investigative reports. NLG is frequently used to generate social media posts, such as on twitter, and retroactively caption images throughout the web.

One important concept in natural language generation is the idea of collocation which simply means that there is a relationship between words and their location in a sentence. For example, the word college is often found with words like professor, student, and campus but would not usually be associated with words like sandwich, sky, or elephant. Later in this article, we will look at the use of collocation and Markov Chains to generate text.

A more complex approach to text generation involves the use of multi-layer neural networks. Traditionally, NLP and NLG have relied heavily on RNN (recurrent neural networks) and LSTMs (Long short term memory networks) because of their unique ability to deal with sequences. More recently, however, other ideas such as reinforcement learning, re-parameterization, and generative adversarial networks (GANs) are being explored[2]. In this article, we will briefly show you how you can use a pre-trained GPT2 model to generate text.

While we will only consider natural language generation at the word level in this article, people have also tried to apply thes concepts at the character level. The drawback of this approach is that it is much more computationally expensive as there can be many orders of magnitude more combinations of characters than words. Generally, this approach may be more or less appropriate depending on the language. Some languages like Russian and Finnish have more complex morphologies than English and may be suited for character level NLG.

In order to follow this tutorial, you will need to install the following python libraries:

The examples below use text from the novel Alice’s Adventures in Wonderland by Lewis Carroll. This is now in the public domain and can be downloaded from the Gutenberg website.

This first example we will look at is very basic. The first concept we will explore is n-grams. Rather than tokenizing text into individual words it is often a good idea to divide text into groups of adjacent words, called ngrams, where n is a positive integer. Common values of n have special names, n=2 gives bigrams and n=3 gives trigrams. For this example we’ll use a short text from the Alice story file:

text = '''So she was considering in her own mind as well as she could, for the hot day made her feel very sleepy and stupid, whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.'''

To keep everything as basic as posiible lets just use split() to tokenize the text.

tokens = text.split()import random

n = len(tokens) - 1
sentence = ''

for i in range(10):
r = random.randint(0,n)
next_word = tokens[r]
sentence = sentence + ' ' + next_word

print(sentence)
and worth a the daisies, feel her hot feel when

Unsurprisingly this approach produces a random list of words. We know, of course, that this is not how language works. We can improve slightly on this using higher “n” ngrams. Lets try using a trigrams (n=3).

from nltk import ngrams

n = 3
n_grams = list(ngrams(text.split(), n))

sentence = ''

for i in range(3):
r = random.randint(0,50)
next_word = n_grams[r]
sentence = sentence + ' ' + str(next_word)

print(sentence)
('was', 'considering', 'in') ('White', 'Rabbit', 'with') ('eyes', 'ran', 'close')

While the combination of all these words still doesn’t make much sense, it makes a lot more sense than the random choice of words in the first part of this example. Each ngram is natural English language, but they are still combined in a random way, and depend heavily on the original text, so the text makes little sense.

Next, lets introduce the use of probability. Specifically, lets look at which ngrams are more or less likely to follow each other using Markov Chains.

Let’s say we live in a climate with just three types of weather, A) ‘dry and sunny’, B) ‘dry and cloudy’, and C) ‘cloudy and raining’.

Starting at A, there are three possible states we can move to: we can stay at A, go to B, or go to C. After careful study of the weather records in this simple climate system, we discover the average probabilities for each possible sequence. So for example, starting at A, there is a probability of 0.2 that B will be the next state in the sequence. There is a probability of 0.2 that the next state will be A and, since the probabilities have to add up to 1, there must be a probability of 0.6 that the next state is C. Put another way, if it is ‘dry and sunny’ today we know that tomorrow will probably be ‘cloudy and raining’ with a probability of 0.6. This is the basic concept of a Markov chain, the probability of a state depends only on the previous state.

This basic idea of using Markov Chains can be applied to the generation of text. This task, however, is much more complex. We no longer have just 3 states to worry about, we now have hundreds or thousands of words and possible combinations. Fortunately for us, we don’t have to write the code from scratch. Instead, we can use a Python library called Markovify.

Markovify is available on Github. It is a simple Markov chain generator. According to the documentation: “By default, ‘markovify.Text’ tries to generate sentences that do not simply regurgitate chunks of the original text. The default rule is to suppress any generated sentences that exactly overlaps the original text by 15 words or 70% of the sentence’s word count.” The State_size parameter is the number of words the probability of the next word depends on.

import markovify

The make_short_sentence function allows us to set an upper limit on the number of characters, so we can keep the generated text short.

filename = "11-0.txt"; # Alice's Adventures in Wonderland by Lewis Carroll
with open(filename, encoding="utf8") as f:
text = f.read()

text_model = markovify.Text(text, state_size=1)

for i in range(3):
print(text_model.make_short_sentence(280) + "\n")
“I went on the state law.

“Serpent, I am in the hedgehog had ordered.

“Not the silence.

This text still make little sense. We can try to improve the quality of the output by changing the state_size parameter. We’ve seen with ngrams that words exist within a context not in isolation so the probability of a word occurring is dependent on not just the previous word but the previous 2 or 3 … words. Let’s try one more time and increase the state_size to 3.

text_model = markovify.Text(text, state_size=3)

for i in range(3):
print(text_model.make_short_sentence(280) + "\n")
“Ten hours the first day,” said the Mock Turtle, “Drive on, old fellow!

She said it to the Knave of Hearts, he stole those tarts, And took them quite away!”

“That’s enough about lessons,” the Gryphon interrupted in a very humble tone, going down on one knee.

This text reads like text from the Alice in Wonderland stories. If we “trained” our model with more text, the ouput sentences would probably make a bit more sense. This Markov Chain approach is simple, but powerful, and the markovify library makes it easy to implement.

In the next example, we will ramp up the NLG complexity by using Pytorch, GTP2 and the Transformers library.

First, lets import the pytorch library

import torch

The text generation approach I’ll describe first tokenizes the input and builds a text generation model using GTP2 pretrained models. If you want to learn more about GTP2, click here. The text is encoded/vectorized and then fed into the model which produces an encoded output. The final step is to decode the output.

Some of the parameters used in our model are:

  • temperature — this varies the creativity (randomness) of the output
  • do_sample — when True this will prevent lots of repetition in the output.

Like most complex neural networks, there are many parameters that can be adjusted to fine tune the output. A discussion on this tuning is beyond the scope of this article.

from transformers import GPT2LMHeadModel, GPT2Tokenizertokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…






HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…






HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…






HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…






HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…
sequence ='''So she was considering in her own mind as well as she could, for the hot day made her feel very sleepy and stupid, whether the pleasure of making a daisy-chain
would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her. Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a
watch to take out of it, and burning with curiosity, she ran across the field after it'''
inputs = tokenizer.encode(sequence, return_tensors='pt')outputs = model.generate(inputs, max_length=300, do_sample=True, temperature=1, top_k=50)Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.text=tokenizer.decode(outputs[0],skip_special_tokens=True)print(text)So she was considering in her own mind as well as she could, for the hot day made her feel very sleepy and stupid, whether the pleasure of making a daisy-chain
would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her. Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a
watch to take out of it, and burning with curiosity, she ran across the field after it, and at first took a deep breath, but soon thoughtfully

be quieted down, even though she had come so soon, since she had had enough of going straight into one's heart, while her cheeks were more flushed. She looked at the rabbit in wonder, that black, yellow, white with white veins.

"Oh, don't touch them, they're too dark!"

It's not true that she was thinking of her rabbits; she knew that she should be thinking of the White Rabbit in the very first breath she made; as the light had faded.

"Oh, don't you have any, your rabbits are beautiful too? But they look good under my light, but they didn't look as nice as yours did, and they are all too pink and yellow. Why do you think you should care about that? Even if you were to have such a

As you can see, the generated text starts by repeating the input and then adds new text. The output text is limited in quality partly because we only had a few lines of input data, but the potential power of GPT2 is obvious. Training a neural net from scratch to do the same job could take several hours, if not days depending on the dataset. This code ran in a few seconds.

If you’ve made it this far, you should have learned a few things. Specifically, you should now know:

  • How to divide text into ngrams
  • How to use the Markovify library to generate text
  • How to use Pytorch and GTP2 to generate text

I encourage you to try these models on real text found on the internet. Wikepedia might be a good place to start.

If you have any feedback or suggestions for improving this article, we would love to hear from you.

  1. Glorianna Jagfeld, Sabrina Jenne, Ngoc Thang Vu, Sequence-to-Sequence Models for Data-to-Text Natural Language Generation: Word- vs. Character-based Processing and Output Diversity, date retrieved: 04/12/2021, link to article
  2. Sidi Lu, Yaoming Zhu, Weinan Zhang, Jun Wang, Yong Yu; Neural Text Generation: Past, Present and Beyond; date retrieved: 04/12/2021, link to article

MrDataScience.com, GitHub, Medium,

MLearning.ai

Data Scientists must think like an artist when finding a solution

Mr. Data Science

Written by

I’m just a nerdy engineer that has too much time on his hands and I’ve decided to help people around the world learn about data science!

MLearning.ai

Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.

Mr. Data Science

Written by

I’m just a nerdy engineer that has too much time on his hands and I’ve decided to help people around the world learn about data science!

MLearning.ai

Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store