Generation of poems with a recurrent neural network

Overview

11 min readMay 25, 2018

In this article, I will present the structure of a neural network (NN) that is capable of generating poems. Neural networks are the technology standing behind deep learning, which is part of the machine learning discipline. The main value of this article is not to present you with the best possible artificially generated poems, or the most advanced state of the art NN architecture for generating poems, but rather presenting a relatively simple structure that performs surprisingly well in a quite complicated natural language processing (NLP) task.

If you are a machine learning (ML) practitioner, understanding the structure of this network could give you ideas on how to use parts of this structure for your own ML task.

If you are willing to start developing NN by yourself, recreating this network by yourself could be a good place to start. This network is simple enough to build from scratch, as well as complicated enough to require the usage and understanding of basic training techniques.

Next, we will see related works, some real predictions that my neural network has made, and then see the network structure.

A video of my talk is available on Youtube.

Figure 1: Poem fragments generated by RNN

Related work

Andrej Karpathy [1] has a very interesting article about poem generation with RNN. His article provided the background and motivation for this writing. Karpathy’s implementation uses Lua with Torch, I use Python with TensorFlow. For people who are interested in learning TensorFlow, the code behind this article may be a good reference implementation.

Hopkins and Kiela [3] propose more advances NN architectures that strive to generate poems indistinguishable from human poets. Examples of poems generated by their algorithms can be seen here [4]. Ballas provides an RNN to generate haikus and limericks here [6].

A whole magazine with machine generated content including poems is available here [5]. Online poem generator is available here: [7]. Lakshmanan describes how to use Google Cloud ML for hyper-parameters tuning of a poem generating NN [8].

The poem writing problem definition

As a first step, let’s rephrase the problem of writing a poem to a prediction problem. Given a poem subject, we want to predict what a poet would write about that subject.

Figure 2: Poet writing

As a second step, let us break down the large prediction problem into a set of smaller ones. The smaller problem is to predict only one letter (character) that a poet would write following some given text. Later we will see how to predict a poet’s writing on a subject using one character predictor.

For example, can you guess what would be the next character here?

Figure 3: Prediction riddle 1

This is an easy riddle to solve for two reasons:

It appears in the training text when we use Shakespeare for training
It is the last letter of a sentence. The last letter is easier to guess because there are few grammatically correct variants.

Let’s try another one:

Figure 4: Prediction riddle 2

Here we want to guess the first letter of the new sentence. This is much harder, because many grammatically correct variants are possible, and it is hard to know which variant Shakespeare would choose.

Prediction of the next character

Theory

To predict the next character we need a neural network that can read any number of given characters, remember something about all of them, and then predict the next.

Figure 5: Input

A good candidate for this kind of task is a recurrent neural network (RNN).

A recurrent neural network is a neural network with a loop in it. It reads input one character at a time. After reading each character xt it generates an output ht and a state vector st, see Figure 6. The state vector holds some information about all the characters that were read up until now and is passed to the next invocation of the recurrent network. A great explanation of RNNs is provided by Olah [2].

Figure 6: Recurrent neural network

Figure 7 shows the RNN unrolled in time.

Figure 7: Unrolled RNN

The first input character goes to x₀ , the last goes to xt, the output h₀ is the prediction for the character that a poet would write after x₀, where h₁ is the character that will follow x₁, and so on.

Real examples of RNN outputs

Now let us see some examples of the real predictions that my NN has made. Figure 8 shows the example input, the expected output, which is the input shifted by one character right, and the actual output.

Figure 8: Example RNN output

The actual output does not match exactly the expected output. This is natural because otherwise, we would have an ideal network that predicts with perfect accuracy, which is not the case in practice. The difference between the expected and the actual prediction is called error or loss.

During training, the NN is improved step by step to minimize loss. The training process uses training text to feed the network with pairs of input and expected output. Each time the actual output differs from the expected output, the parameters of the NN are corrected a bit. In our case, the training text is the collection of Shakespeare’s works.

Now let us see more examples of the predicted characters, and in particular how the prediction improves as the training goes. Figure 9 shows a sequence of predictions after a different number of training steps.

Figure 9: Outputs at different training stages

Here, the input string is “The meaning of life”. After 4,941 steps, we have 11 incorrectly predicted characters (marked in red). After 34,587 steps, the number of prediction errors fell to 7.

We can see that more errors appear at the beginning of a string than at the end of a string. This is because by the end of the string the network reads more characters and its state contains richer information. This richer information leads to better and more informed predictions.

Generation of the entire poem

At the beginning of this article we focused on a smaller problem of predicting one character of a poem, now we are coming back to the larger problem of generating the entire poem. So having a trained RNN at hand that can predict one character, we can employ the scheme depicted in Figure 10 to generate any number of characters.

Figure 10: Generation of many characters

First, the poem subject is provided as an input at x₀, x₁, x₂, …. Outputs preceding h₀ are ignored. The first character that is predicted to follow the poem subject, h₀, is taken as the input to the next iteration. By taking the last prediction as the input for the next iteration we can generate as many characters as we desire.

We can look at the above scheme from a different perspective, see Figure 11.

Figure 11: Encoder — decoder perspective

The left part of the network is an encoder that encodes the poem subject in a vector representation, called subject vecotor or theme vector. The right part of the network is a decoder that decodes the subject vector into a poem.

This perspective is used in machine translation systems. There, an encoder encodes a sentence in a source language into a vector representing its meaning. Then, the decoder decodes the meaning vector into a sentence in a target language.

Examples of generated poems

Shakespeare

We will now see a series of examples of generated poems. Those examples were generated at various stages of the training process, and demonstrate how the generated poem improved during the training. The poem subject is: “The meaning of life”. The network is trained on the works of Shakespeare. Here is a small excerpt from the training text, which is the original Shakespeare writing:

MARCIUS:
I am glad on 't: then we shall ha' means to vent
Our musty superfluity. See, our best elders.

First Senator:
Marcius, 'tis true that you have lately told us;
The Volsces are in arms.

MARCIUS:
They have a leader,
Tullus Aufidius, that will put you to 't.
I sin in envying his nobility,
And were I any thing but what I am,
I would wish me only he.

COMINIUS:
You have fought together.

Training step: 1 — time: 0 min

eamltfl!G
YAKhI PKKnenYoChGj.FkLXKHrsKALryKN;vMIO;.ao. KoU	-E:VcVtte?,aZHYVT,p
vFE tgBqjX;?beBP IiEULaSj?Bkwt 'ovTyGamGoCCFo;-QREqB
-tEDSsaKrqDd?dk-d.L;FCllwbSEkhvr
hMWQM,lgOzbjWly	uMuyEzBhBRBPr;!tTgtAQGbCqag?Y.yq?IPdXvHVivztIrXL?IyqI-FQg.wPHKQ?ca:;S!CMLxQ?NX.qKzRD-

This is just the beginning of the training process. All the network parameters are initialised to random values, and still remain at this state. Therefore, the output is just a random collection of characters.

Training step: 140 — time: 5 min

n       r              o    r                                              r      r                                                        h  r             r                                          r                                 r         e          r  e     o                            r    r               r         
		 r               e    r                                    r               r          r         r

We are 5 minutes in to the training process, at step 140. The network learned the distribution of characters in English text and outputs the most frequent characters, which are: space, e, n, r, and o

Training step: 340 — time: 11 min

e       e    e   e      e   e           s e    h     e      e         t           a e et      o    hoe     e   e e  e    e                  t  ea t   n e       e             o                            e       e t i            e   e i            e    a    i     a                 e        e                 e     h           n enot          e        es                         t      a    e   e  e  ee      o          e oe    e      e    o       e      e     t et  nn   o    se             r         e  e        a         ee

Here, additional frequent characters appeared: t, h, s, and i.

Training step: 640 — time: 21 min

PAEEEE:
I har sor  the the toe so an the tore, an  me the  reed tor  the soeees the tou toar tout the the tord our toor me con the r aou the the the theteoon woe s ere aoud ind on ther tou the no er  to were toee worethe tour on the mi lere ther the toureen ao he ter wourheon the thes th hhes the ther touherthor the h nr tore the sare he the tere the ther tous 

PONRES:::
Hou  y the an the tou an  the thet  otere we on he terer

Here, the network learned several new things:

The space characters are now distributed correctly. Word lengths now closely resemble the lengths of words in English text.
The text is organized in paragraphs of meaningful length.
Every paragraph begins with a name of a play personage, which is followed by a colon.
Short and frequent words start to appear, such as: the, so, me
The network learned the concept of vowels and consonants. They appear in a more or less natural order. For example, a sequence of letters "touherthor" from the above text, if read, sounds like a valid word. This is due to the correct distribution of vowels and consonants.

Training step: 940 — time: 31 min

Har th the would  o ter here or  the someng here of hire the coment of  the warte
 wnd the fare 

CIMES:
No the weat the so the eorserhe  aour tou the nother the prother somer  and weat 
re and the wordher wo the rarl ao er

Longer words appear, like: would, here, hire

Training step: 1,640 — time: 54 min

And heve tade the  the fort of the hands
Toat the world be the worth of the  have
The  th thre tore  te e the world,
The godd ma to the  be toe the world

nd then the beart of the wors
Tnd we will the

Here we start to see correct sequences of correct words: "the fort of the hands", or "the world be the worth of the".

Training step: 6,600 — time: 3h 29 min

That they are gone an  the   where I shall then 
And then I shall be stranger  ae t the land.

SIR TOBY BELCH:
Ih thou wast stander ao the world in the day,

CARILLO:
I   she had spone to mour  the  the  base made te to
her facher with a man to sae  the 
e th n t ware  tor the world in the harth of

Now we see the first signs of a grammatical structure of a sentence. The sequence of letters: "That they are gone" resembles a sentence with a correct grammatical structure.

Training step: 34,600 — time: 19 hours

TITUS ANDRONICUS:
Tow now, what maans thes saeet tnd dead of me?

KING RICHARD II:
What would you  a the eoence? 
hat saing is this?

MARGARET:
Io we a woman that makes me eo d faith,
And that the e was bhe shawl  be sone to hee


e that whe wors ao mane me lovg a stn.

This is as far as this network can get. After 19 hours of training process, it reaches its limit, and the output does not improve any more.

Goethe

Mit tiefend Schrerßen wird zur Bpitze,
Wo sihl ich denn  uch aule Weidenschaften,
Und so besetzt's un  Menschenvolk!
Si on ist es mie er ein Verwulren,
Ich seh mir  ihr veitees nicht zurehrt;
Und noch den Tag su nee verlieeen,
Wan habt duc nicht sohon alles gutegenanntn.   
   
MEPHISTOPHELES:
Ich wünschte nicht, Euch irre zu fhfee,.
Als andere nae der Keefezu schmeicen.

FAUST:
Das binten wir,

This is an output of an RNN trained on Goethe’s Faust. The output is taken after the training process reached its limits.

Pushkin

Поиою лизнь и берственной и равый,
П  верный круг дрдибее предала:
Как дремлет мой воселой трой укор;
Треди тебя – пеезреньем вознереенье,
И с тобоо томные тооводы.
	И только б ним  е то презраался
И с вечным и с кордцю кровью
И стан т в свортнным солеаом,
И  влром станется Эрат;
Погда пооорствуя сердеч
И п езь  ие проболжать.не смею.
То да  как сон себе марода
К тол прелестный и простой.
Слешите в кровь  оассказал смерясь,
Под стнью парусо за стеклом.

The same as above but trained on Pushkin.

Conclusion

We have seen a recurrent neural network that can generate poems. We have seen how the network output improves as the training process goes.

This is not the best possible neural network to generate the best poems. There are many ways to improve it, some of them mentioned in related works sections. This is a simple neural network that achieves surprisingly good results.

If you are interested in repeating this exercise by yourself, the code behind this article can be found at: github.com/AvoncourtPartners/poems. The network is implemented in Python using TensorFlow.