Adventures in Narrated Reality, Part II
By Ross Goodwin
Due to the popularity of Adventures in Narrated Reality, Part I, I’ve decided to continue narrating my research concerning the creative potential of LSTM recurrent neural networks here on Medium. In this installment, I’ll begin by introducing a new short film: Sunspring, an End Cue film, directed by Oscar Sharp and starring Thomas Middleditch, created for the 2016 Sci-Fi London 48 Hour Film Challenge from a screenplay generated with an LSTM trained on science fiction screenplays.
Today, Sunspring made its debut on Ars Technica, accompanied by a superb article by Annalee Newitz. Have a look!
To call the film above surreal would be a dramatic understatement. Watching it for the first time, I almost couldn’t believe what I was seeing — actors taking something without any objective meaning, and breathing semantic life into it with their emotion, inflection, and movement.
After further consideration, I realized that actors do this all the time. Take any obscure line of Shakespearean dialogue and consider that 99.5% of the audience who hears that line in 2016 would not understand its meaning if they read it in on paper. However, in a play, they do understand it based on its context and the actor’s delivery.
As Modern English speakers, when we watch Shakespeare, we rely on actors to imbue the dialogue with meaning. And that’s exactly what happened in Sunspring, because the script itself has no objective meaning.
On watching the film, many of my friends did not realize that the action descriptions as well as the dialogue were computer generated. After examining the output from the computer, the production team made an effort to choose only action descriptions that realistically could be filmed, although the sequences themselves remained bizarre and surreal. The actors and production team’s interpretations and realizations of the computer’s descriptions was a fascinating case of human-machine collaboration.
For example, here is the stage direction that led to Middleditch’s character vomiting an eyeball early in the film:
I don’t know anything about any of this.
(to Hauk, taking his eyes from his mouth)
There’s no answer.
And here’s the final description that resulted in the strange action sequence at the end of the film, containing the shot that’s my personal favorite, wherein Middleditch’s character breaks the fourth wall and pulls on the camera itself, followed by a change in camera angle that reveals that he is, in fact, holding nothing at all:
He is standing in the stars and sitting on the floor. He takes a seat on the counter and pulls the camera over to his back. He stares at it. He is on the phone. He cuts the shotgun from the edge of the room and puts it in his mouth. He sees a black hole in the floor leading to the man on the roof.
The machine dictated that Middleditch’s character should pull the camera. However, the reveal that he’s holding nothing was a brilliant human interpretation, informed by the production team’s many years of combined experience and education in the art of filmmaking. That cycle of generation and interpretation is a fascinating dialogue that informs my current understanding of this machine’s capacity to augment our creativity.
As I wrote in Part I, I believe that augmenting human creativity is the core utility of these machines. However, their use can also inform our understanding of the semantic mechanisms, or lack thereof, embedded in the words we read every day. We typically consider the job of imbuing words with meaning to be that of the writer. However, when confronted with text that lacks objective meaning, the reader assumes that role. In a certain way, the reader becomes the writer.
That’s what I love most about Sunspring — that so many versions of it exist. First, there’s the version I nervously watched emerge from my Nvidia Jetson TK1 computer early on the first morning of the contest, after receiving our team’s prompts and using them to seed my LSTM generator. Second, there’s the version that Oscar Sharp, the director, and the rest of the production crew created by cutting down that initial version, which was far longer than the contest’s 5-minute limit would allow, as well as the version they created in their heads when they gave direction to the actors. Third, there’s the version that Thomas Middleditch and our other superb actors created, when they imbued objectively meaningless lines with context using inflection, emotion, and movement. And finally, there are the countless versions that every viewer creates by projecting their own experiences and ideas onto what they’re seeing, and joyously struggling to make sense of something that objectively does not make any sense at all.
From the experience of screening this film, I have begun to gather that while some people may not want to read prose or poetry that makes no sense, many of them will watch and enjoy a film that makes no sense (after all, look at any David Lynch project), and will certainly listen to music with lyrics that make no objective sense.
A lot of lyrics, across every musical genre and with countless artists, have no single, correct interpretation. I could list a whole lot of examples, but I’d rather just point to a single particularly salient one: My Back Pages by Bob Dylan. That song is brilliant because you can hear whatever you want in it and project your own experiences onto it, because the lyrics make no sense, even if the song’s Wikipedia page would have you believe otherwise.
In Sunspring, the song you hear toward the end, when the surreal action sequence begins, uses lyrics generated by an LSTM I trained on about 25,000 folk songs. I worked with Tiger Darrow and Andrew Orkin, two talented musicians who chose lines that would work well with their music, and composed an impressively catchy song in just a few hours.
Here’s a sample of what that lyrics model’s raw output looks like:
I’m going to see my soul
I’m trying to be free
I can take you through the way you do
And I’m gonna lay down the river in the sky
I’m a man and I won’t be on your hand
And if I was moving at the street
I want to be a pretty little girl
I can’t stand the way that I can’t leave me
And I can’t be really walking on
I will be all right
I’m a coming around
Somebody to go
I want to have the sun and I can’t stay
I know I am a long time ago
The time I was gone
I have to sing a show
I can’t be the one that I could be
I’m gonna be the one to love
I wonder what I was always to be
I can’t come to me
I’m gonna be the one that I need
I love you, I want to change your love
For me, the most rewarding part of watching professional musicians play with lyrics from the machine was watching their transition, in just a few hours, from questioning the machine’s utility entirely to asking if they could use the generated lyrics in their work outside the 48-hour film competition. Later, after someone on the crew asked Tiger if using such a machine would be considered cheating, she made an excellent comparison: “No,” she said, “It’s like using a rhyming dictionary,” which is a tool that lyricists use all the time. (At the time, I hesitated to mention that rhyming is a fundamentally algorithmic process, even if many of us don’t usually think of it that way.)
And it’s easy to imagine how such a tool might work in practice: the lyricist writes a few lines, maybe gets a bit stuck, then uses the machine to generate possibilities for the next line until she finds one she likes, edits it a bit (if necessary), and continues writing. Autocomplete for lyricists could turn writing song lyrics into a conversational process, with a writing partner made of metal and electricity, perhaps fine tuned with bias toward the topics and references that its human collaborator prefers.
Prose by Any Other Name
Those with significant attention to detail may wonder why the character names in Sunspring are all single letters. It’s because LSTMs have a lot of trouble with proper names, so when I trained on a corpus of science fiction TV and movie screenplays, I replaced each name with the first letter of that name.
I discovered this proper name issue when training on a model that I hoped would produce science fiction prose. After noticing that the LSTM output rarely contained proper names, as compared to the frequency of use in its training corpus, I wondered if the model might be somehow afraid to use them. (Obviously, neural networks cannot get scared, at least so far as we know, but I’m using that word because it approximates what I was thinking at the time.)
After further consideration, it occurred to me that proper names would be logically difficult for an LSTM to recognize, as the LSTM relies on decoding complex patterns of individual letters. For example, consider two dramatically different surnames, without any overlapping letters, and that two such names could be used interchangeably in the exact same sentence without any issues whatsoever — for a human to comprehend. For an LSTM, this case would be very confusing indeed: two dramatically different words in the exact same context.
Anyway, that was my hypothesis — I had neither science nor data to back it up. But it made sense to me logically, so I decided to test it.
Using AlchemyAPI, I identified all of the proper names in my prose corpus, and found there were about 10,000 unique names. So I decided to condense the “namespace” that the model would need to consider. (I recognize that the word “namespace” is typically used in programming to describe all the object names in a particular context, and I feel that my use of the word here is equally valid, albeit with minor semantic differences.)
In my science fiction prose corpus, which I trained on the complete works of Margaret Atwood, Isaac Asimov, and other writers, AlchemyAPI’s entity extraction tool revealed approximately 10,000 unique persons’ names. After finding these names, I used regular expressions to replace them with a set of about 100 names — mostly from my Facebook friends, but also a few that I made up. Training on the corpus with the reduced namespace resulted in about 10% reduction of the final validation loss measurement (see: explanation of LSTM loss metrics in Part I) after about five days of training.
In assembling a corpus for the model that wrote the Sunspring screenplay, I tried another technique: replacing each character’s name with the first letter of that name. That technique seemed to be even more effective in terms of validation loss reduction, and as mentioned above, it’s why all the characters in our film have single-letter names.
Quest for Cohesion
In attempting to make a decent prose model that would produce cohesive text, I developed another technique that I call resequencing, which entails labeling paragraphs based on how far into a book they appear, grouping them by TF-IDF keyword, then using the generated labels to reorder the generated paragraphs.
I’ll begin by describing the labeling process, which starts with dividing each book in a corpus into eight sections, then labeling each paragraph by its section number (0–7). The labels are characters that prepend each paragraph.
In order to ensure the LSTM could distinguish between labels and the text itself, I used binary numbers encoded with tildes (‘~’) as zeroes and pipes (‘|’) as ones, because those are unusual characters that don’t appear very often in most books, and which can be removed safely without causing too much semantic chaos.
After labeling the paragraphs, I shuffle them randomly, then group them by top-scoring TF-IDF keyword beneath the definition of each keyword. TF-IDF, or “term frequency — inverse document frequency” is an algorithm that tries to determine the words in each paragraph that are most important to that paragraph’s semantic meaning. It gives each word a score, corresponding to the number of times that word is used in a particular paragraph, and the number of times it is used in other paragraphs. Words that are used a lot in a particular paragraph, but not used a lot in the corpus as a whole, would receive a higher TF-IDF score. Words that appear less often in a particular paragraph, and appear very often across the whole corpus, would receive a lower TF-IDF score.
After clustering the paragraphs by TF-IDF keyword, I decided to group them beneath each keyword’s definition from the Oxford English Dictionary for two reasons: (1) it allows me to seed with a word or definition and receive a set of paragraphs emphasizing a particular word, and (2) the OED trains really, really well, and seems to help LSTMs understand English better, as I discussed in Part I.
As a result of these techniques, snippets of the corpus look like this:
n. (pl. same or zucchinis) NORTH AMERICAN a courgette.
Italian, plural of zucchino, diminutive of zucca ‘gourd’.
||~They have Elvis Presley zucchini moulds now: you clamp them around your zucchini while it’s young, and as it grows it’s deformed into the shape of Elvis Presley’s head. Is this why he sang? To become a zucchini? Vegetarianism and reincarnation are in the air, but that’s taking it too far. I’d rather come back as a sow-bug, myself; or a stir-fried shrimp. Though I suppose the whole idea’s more lenient than Hell.
|~~Today is our Feast of Serpent Wisdom, and our Children have once again excelled in their decoration. We have Amanda and Dick to thank for the gripping mural of the Fox Snake ingesting a Frog — an apt reminder to us of the intertwined nature of the Dance of Life. For this Feast we traditionally feature the Zucchini, a Serpent-shaped vegetable. Thanks to Rebecca, our Eve Eleven, for her innovative Zucchini and Radish Dessert Slice. We are certainly looking forward to it.
||~Pete put on a suitably solemn face. Bloodgood next? Some gruesome new food substance, no doubt. A liver tree, a sausage vine. Or some sort of zucchini that grew wool. Kim braced himself.
|||He remembers the night in Baltimore when he photographed their month-old daughter at her right breast and a twenty-inch zucchini cradled in her left arm. He wonders who or what he will want, who or what he will desire, who or what he will love tomorrow after her surgery.
After seeding with a word and generating some prose paragraphs, I put them back in order using the generated labels. As a whole, this resequencing technique works better than I ever thought it would. And I used it to generate a short story for the flash fiction contest that occurred alongside the Sci-Fi-London 48 Hour Film Challenge.
Unlike the film, my short story did not make the contest’s shortlist. However, I think it’s pretty good — it’s certainly the most cohesive document I’ve ever seen generated this way, considering its length is 1500 words. You can read the full story if you’d like, but I’m just going to highlight my favorite paragraph here:
Mindy Mandible was a person who was still alive. It was a transient and superior thought that was shameful. It was too heavy for her. But there was no way of standing beside her. It was a sense of astonishment, and it was the panic of a mind, and that was a muscular and almost familiar reason.
Mindy Mandible is a fantastic name that the LSTM invented entirely on its own. It’s not in the set of names I inserted, nor is it in the corpus at all, but it sounds like a name that Thomas Pynchon could’ve plausibly created.
The story has no objective interpretation. However, I think it reads like a conversation between a number of machines and the last few humans, in a far future that’s totally machine dominated, where humans are significantly outnumbered. The main character (the person who speaks as “I”) could be a human whose conflict is whether to stop being human in some way (perhaps involving corporeal death) that we, as the reader, never fully understand.
I shared the story with author Robin Sloan, who has been hacking around with LSTMs as well:
Robin offered his own interpretation of the story:
Okay, I have to confess that after reading it twice, I don’t have a coherent “theory of the story”; in my head, it’s very Beckett-ian, like a set of characters trapped in the white-walled mind of a sleeping author, arguing about who they are. Haha I guess that’s a theory of the story after all!
I think this relatively early line —
“I don’t know, I was a writer,” said Shackie. “I’ll tell you. You’re going to try to tell us what to do.”
— sort of “set the valence” for me… that is, lent a flavor to some of the subsequent (nonsensical) proceedings. Interesting to see that effect.
The proper nouns really lock things in, don’t they? Basically whenever there’s a proper noun — a name, “the Council of Science,” “Galaxy,” “Empire” — something just clicks as a reader. It’s like they’re solid places to steady yourself & survey the scene. But then, the language drifts away, and you lose the thread again… anyway that’s another interesting effect. And it’s interesting to consider how an, er, Proper Noun Management (?) scheme might play into one of these generative systems.
Powers of Two
There’s a lot of mystery surrounding how one chooses hyperparameter settings when training LSTM recurrent neural networks. First, let’s address the question of what a hyperparameter is — because, in the context of LSTMs, it’s not the same as a parameter.
When we talk about LSTM parameters, we’re generally speaking about the numbers within the model that change as the model trains. The LSTM models I’ve trained in my research have generally had between 20 and 85 million parameters per model.
Hyperparameters, on the other hand, are the settings we adjust for training: the learning rate, learning rate decay, batch size, sequence length, dropout, and other levers we can pull in order (see: relevant section in Part I) to train a better LSTM model (i.e. one with a lower final validation loss measurement) on a particular corpus. Choosing the right settings for a given corpus can be a matter of exhaustive trial and error, but there’s a certain technique I’ve found helpful in that process. It’s not very technical, nor is it rigorous, nor comprehensive, nor particularly scientific in any way, but it works for me.
It involves powers of two.
If you’re not a programmer, the dark magic of the number two might not be readily apparent. But as any programmer can tell you, it’s real — disguised in plain sight, in the numbers zero and one: the two possible states of a bit.
Before you begin calling me a mad numerologist, allow me to defend myself by citing the (admittedly fictional) work of Snow Crash by Neil Stephenson:
The number 65,536 is an awkward figure to everyone except a hacker, who recognizes it more readily than his own mother’s date of birth: It happens to be a power of 2–2¹⁶ power to be exact — and even the exponent 16 is equal to 2⁴, and 4 is equal to 2². Along with 256; 32,768; and 2,147,483,648; 65,536 is one of the foundation stones of the hacker universe, in which 2 is the only really important number because that’s how many digits a computer can recognize. One of those digits is 0, and the other is 1. Any number that can be created by fetishistically multiplying 2s by each other, and subtracting the occasional 1, will be instantly recognizable to a hacker.
For LSTM hyperparameters, the solution I most enjoy is ascending and descending possible values in the powers of two until I find one that seems to work best. A quick look at one of my recent Torch-RNN training script calls will confirm this:
$HOME/torch/install/bin/th train.lua -input_h5 $WORK/rap/rap.h5 -input_json $WORK/rap/rap.json -rnn_size 2048 -num_layers 3 -dropout 0.25 -max_epochs 65536 -seq_length 256 -batch_size 128 -checkpoint_name $WORK/rap_checkpoints/rap > output.txt
Unless you know your way around the Unix shell, the words above aren’t going to mean much to you, but look at the numbers. If you know your powers of two, they should all look familiar — except the number of layers
(-num_layers), which Kyle McDonald tells me should actually be set to two as well in most cases, possibly, because the present notion in the #ML thought-sphere is that the third layer of an LSTM might not actually do anything.
Truth is, it’s all dark magic, at least for now. As Kevin Slavin, a professor at the MIT Media Lab, says w/r/t algorithms today:
We’re writing these things that we can no longer read.
By playing and hacking around with complex systems, I believe we can learn to understand them better. We are writing things we can’t read, for now, but that may be a transitional state in the history of machine intelligence.
Camera, Compass, Clock
As I continue to train LSTMs on different types of text, my work toward developing devices to contain these machine intelligences has advanced, and I now have a set of interchangeable LSTM neural network models. I put them on SD cards, so all of them could be used with each device in the series: a camera, a compass, and a clock that narrate images, location, and time respectively. Since I featured the camera prominently in the prior installment, I’ll fill in some gaps on the other two devices.
I’ll start with the compass, which is a device for a car, inspired by wardriving, a hacking activity wherein one drives around with a computer, hacking into wifi networks. Making art with a car is also a time honored tradition, as I recently discovered on the first page of a book of art assignments:
If artists have been drawing with their cars for years, why not let them write with cars as well? That’s where the compass comes in — using location to generate text and enabling automatically narrated automotive journeys.
Along the lines of the wardriving theme, I decided to use a navigator’s compass from a B-17 Flying Fortress as the centerpiece of the device, which I mounted on the metal center console (the module with arm rest between the driver and passenger seats) from a police cruiser, alongside the Nvidia Jetson computer and thermal printer that outputs text based on your location, depending on which of the eight models you choose to insert into the Jetson’s SD card slot. I scraped locations from Foursquare into a local database in order to provide more human readable landmark descriptions than bare GPS coordinates, and those locations (e.g. “NYU ITP, a college arts building”) seed the LSTM text generator.
Because drivers typically demand as many instruments as possible, I added an oven knob to control temperature, a hyperparameter discussed in Part I:
Next, there’s the clock. For this piece, I purchased an antique punch clock from a junk store in Brooklyn. The clock face says: GENERAL TIME RECORDER EXCHANGE / NEW YORK CITY / CHELSEA 3–3886. I initially thought the clock had been manufactured by the Chelsea Clock Company, but the final line turned out to be an ancient phone number. In the process of researching the piece, I stumbled across antiquetimeclocks.com, the owner of which insists at the bottom of the main page that anyone needing research assistance should contact him, so I did. Here’s his response:
Thanks for contacting me. General Time was a re-seller of used time clocks, so the clock you have was not made by General Time. It was made by the International Time Recording Company of Endicott, NY. This company eventually became what we know today as IBM. Your clock was originally spring wound and pendulum driven. It is one of the most common clocks I see today as thousands were made. General Time reconditioned the clock by replacing the spring driven time movement with an electric movement, and then they replaced the original dial.
This information explained why the clock needed to be plugged in to keep time, and made my prospective modifications seem more appropriate, as it would be the clock’s second major overhaul. So I inserted a mechanical keyboard switch beneath the punch clock’s main lever, then installed the Nvidia Jetson and printer behind its glass door.
Here’s the result:
Finally, I’d be remiss if I failed to mention the camera, which now has its own vintage wooden tripod.
But rather than focusing on the camera itself, which hasn’t changed much since Part I, I want to discuss a certain reaction to its output that I found very interesting and constructive.
It came from a nine-year-old girl.
That her father is Blaise Agüera y Arcas, who leads a machine intelligence team at Google, is not inconsequential. I met Blaise in San Francisco at Gray Area’s Art and Machine Learning Symposium in February, where I demonstrated my messenger bag prototype (see: Part I). After we talked, I showed him the device and gave him a few output receipts to take home. (I hand out the receipts to pretty much everybody, as often as I can, since I consider them to be ephemera. The artwork is the concept and the devices themselves.)
A few days after meeting Blaise, he sent me an email, in which he said:
My […] 9yo daughter, who is in a local writing cohort […] made an impassioned critique of my word.camera printout (“there’s nobody home”, “the feeling is fake”, etc.) which took us into a multi-hour conversation about cognition vs. metacognition, art vs. “meta-art” (art that is about generating art), questions of agency, etc. It was awesome.
This revelation struck a chord with me, as it sounded like exactly the type of discussion I was trying to promote with my work. With Blaise’s permission, I followed up with his daughter, Eliot, who was even more critical than I hoped she would be.
Here’s what she said:
Let me be clear on this: I think that what you are doing is extremely interesting.
However, I think that this art is cheating anyone who look at this art. You are taking credit for something that is not yours. Or is it? If you have a phone, it is your phone. If your phone makes something, technically it is yours.
But what if an intelligent pet does it? A dog or pig? Definitely less so.
But my Dad often speaks of a monkey-intelligent device. So if, in a future of monkey-intelligent devices, if your phone writes a poem, can you really take advantage of it?
When a device gets that intelligent, whether it has emotion or conscience or not, it just must be it’s own intelligence, out of your possession. More like a colleague or a coworker. A housemate.
Maybe then, Mr. Goodwin, maybe while devices still remain relatively dumb, you should continue to create this art. Take advantage of the time you can still do this.
Next thing on the agenda: This work is fake.
In what is commonly defined as real art, there are tens of layers of meaning. There is something behind the art. Imagine a stack of paper on a window. The top sheet is what you see at first. Your first glance. Now, looking at it more closely, you can dig into the layers of paper, usually never reaching the absolute last sheet, but still feeling satisfied with how much farther you are than when you began looking at the piece.
With this art, there is one piece of paper on a window. No layers. All there is to it is what you see at first glance. The device did not mean anything when it wrote this poem. Just words, sentences, phrases, lines. This is just like teaching the device math. It gets the input (1+1=2). It remembers what the process was, how to do it. (1,2.) And it comes up with a result (2). Did you see that? The device was not being creative. It was doing what it was told to do to take the input and come up with a result. Like math.
This is not real poetry. However, I’m not saying it never will be.
In my response to her, I explained that the artwork is the concept and device, not the output, and she seemed mostly satisfied with that. I was thrilled by how critical she was, how she gave me her real opinions, and how my work could provoke a response like that in someone so young. Because I hope my work does make her think — I hope it makes young people think about the future we’re creating right now, and how machine intelligence fits into that future.
Oscar and I wondered if Sunspring would win the 48-hour film challenge we had entered, and we didn’t, but we did get a special mention. If you’re curious about what happened with the contest’s audience vote, this tweet explains pretty much everything:
The organizers had some questions they wanted to ask our screenplay LSTM, so we obliged and asked it each question three times.
What do you think of your historic nomination against human opponents in this contest?
I was pretty excited.
I think I can see the feathers when they release their hearts. It’s like a breakdown of the facts. So they should be competent with the fact that they won’t be surprised.
I don’t know. So there’s this new world about this situation at the moment.
What is your screenplay really about?
Something about people who are in a representative of the Tribune.
I don’t know. I think I understand that.
I don’t know. I wish I could see the sun in the first place.
Why did you cheat in the voting?
I believe you. I think I should go to the bathroom.
The only thing I can do is see what the world is.
I don’t know, I didn’t know that. The way I have to take them away and then you don’t want to be a cop. It’s a little interesting. I’m sorry.
What is the future of machine written entertainment?
It’s a bit sudden.
The man who has not been able to represent the charges of the meaning of such a distant strike — and you must remain in mind.
I was thinking of the spirit of the men who found me and the children who were all manipulated and full of children. I was worried about my command. I was the scientist of the Holy Ghost.
What’s next for you?
Here we go. The staff is divided by the train of the burning machine building with sweat. No one will see your face. The children reach into the furnace, but the light is still slipping to the floor. The world is still embarrassed.
The party is with your staff.
My name is Benjamin.