AMI Residency Part 2 : Realtime control of sequence generation with Recurrent Neural Network Ensembles
static unsigned love
This is the third of a series of articles [1, 2, 3], on research I’ve done over the past few years that I’m only just getting round to writing about. In May/June 2016 I did a residency at Google’s Artists & Machine Intelligence program and explored two related but separate lines of inquiry. So I’ll be writing about it in two posts. Part 1 is here. This is part 2.
This particular research was accepted as a demo presentation at NIPS (Neural Information Processing Systems) 2016, as well as a poster presentation at the NIPS Recurrent Neural Network Symposium. I won’t get very technical in this post, details can be found in the accompanying paper (it’s a very short paper, more of an extended abstract).
I’m hoping to release the source code (and models!) very soon, as soon as I’ve tidied it up (and simplified dependencies).
I’ll start with the end result. This is about a system that allows users to gesturally ‘conduct’ the generation of text. It’s a method of real-time continuous control and ‘steering’ of sequence generation, using an ensemble of Recurrent Neural Networks, allowing users to dynamically alter the mixture weights of multiple models, each trained on a different dataset.
In plainer English: The system spits out text, character by character (seen on the ‘text’ window below, on the right), as if someone were typing at a typewriter (at around 10–20 characters per second). While the text is being written on-screen, you, the user, can mix in different ‘styles’ in real-time (I use the word ‘style’ very liberally here). For example while the system is generating text in the style of the Bible, you can tell it to start mixing in a little bit of Mary Shelley, and then pull it towards the poetry of Coleridge, and then towards love song lyrics etc. This doesn’t affect the text that has been generated so far, but only the new text that is being generated. And you can change styles mid-sentence, even mid-word.
Short (2:15) video demo:
Long video (13:56) with voice over explanation:
Styles are not mutually exclusive, i.e. you don’t necessarily pick one style over another. You can also mix multiple styles together, with varying ‘mixture weights’. E.g. the sub-heading of this post “static unsigned love” was generated by the system when I had the ‘love song lyrics’ style and ‘Linux C source code’ style mixed in roughly equally. Letter by letter, the system first wrote “static unsigned lo” (probably the C model was on a path to write “static unsigned long”, but then the ‘love songs’ model saw the “lo” and probably thought “this has got to be completed to ‘love’”).
Like I mentioned, I use the word ‘style’ very liberally here. In this context, what I mean by ‘style’ is a particular data-set. I settled on 24 easily identifiable styles (i.e. data-sets): Apollo11 source code, Aristotle, Baudelaire, Bible, Chilcot Report, Coleridge, Dalai Lama, DaVinci, Iliad & Odyssey, Jane Austen, Carl Jung, Kama Sutra, Immanuel Kant, Kuran, LaTeX, Abraham Lincoln, Linux C source code, Love Songs lyrics, Victorian Erotic Novels, Mary Shelley’s Frankenstein, Mein Kampf, Mrs Beeton, Nietzsche, Trump. All English translations where relevant. (NB. There were many other writers, speakers, thinkers etc. that I would have loved to train on, but finding enough data is quite hard. At least a few million characters per style is needed for a decent model. E.g. the Mary Shelley model trained on a single novel, and the Dalai Lama model trained on a few speeches, texts and tweets, are not very good models).
I should also clarify, that when the system is generating text in a particular ‘style’, it is not copy / pasting existing text from that data-set, but it is has learnt probability distributions for the next character, given the previous characters. And most importantly, the model even learns rules on how to construct those probability distributions. Also, when sampling from this distribution, it doesn’t necessarily pick the most likely character, but sometimes less likely characters, and so it does very often generate brand new text, that is reminiscent of the desired style, but not in the training data. And especially when we start mixing styles it generates ‘very novel’ sequences. I go into a bit more detail about all of this in later sections.
There are few different methods of user interaction for style control. First, one can select which styles are ‘active’ via the left-most panel. Internally, all styles are always loaded. Temporarily activating (or deactivating) a style from this panel simply temporarily adds it to (or removes it from) the other interaction modes (e.g. the blue and pink panels). This allows a user to simultaneously interact with fewer than 24 styles, reducing clutter on screen and simplifying the interaction.
A direct method to adjust the style mixture weights, is using the mouse (or touchscreen) to play with on-screen sliders, via the centre (blue) panel. I also have this mapped to a hardware midi controller, so the user can use physical faders to mix different styles. The active styles are also arranged in a circle in the rightmost (pink) panel. Here the user can use the mouse (or touchscreen) to move a small puck (the white dot) around, and the pucks distance to each style icon determines the mix amount of the corresponding style. Finally, I implemented a LeapMotion based gestural interface, where waving your hand around in space controls the puck, and is able to mix in different styles. E.g. In the example above, while Baudelaire is the most dominant style, moving your hand a little bit to the right and a little bit up would fade towards Jane Austen, further up from there would fade to Trump, going across all the way to the left would mix in Love Songs, and moving gently down would then bring in the Kuran, and then eventually Kama Sutra etc.
As an added extra, I also started trying out other interaction methods, like detecting facial features and mapping them to style. E.g. the classic pouty-lips-and-lamp-post-up-the-butt-scrimpled-face for Trump; feeling-really-enlightened-closed-eyes-and-raised-eyebrows for the Dalai lama etc. The best part is of course, I don’t actually have to explicitly define what each of the facial poses need to be, that’s the strength of machine learning, I can just pull a face, and tell the system to learn the features of that particular facial pose, and then map it to a particular style (e.g. using Multi-Layer Perceptron regression).
Unfortunately I didn’t have time to finish the facial feature style mapping. So for now I’ll just talk about the gestural interface, or rather, I’ll talk about the underlying system, since that’s the bit that’s perhaps more interesting. I.e. there is a system which is constantly generating text, character by character, and it takes as input, style mixture weights, and it uses those weights to influence the style of the following characters. Adding different methods of interaction is simply a case of adding an interaction layer that just takes any kind of input (e.g. facial features) and maps it to style mixture weights. Then the existing system can take it from there.
Introduction, motivation and background
The motivation for this project is a continuation of some of my much older work, and is a perfect intersection of the motivations for my previous two posts:
1. Real-time interactive manipulation of generative systems
(previous post re deep learning Collaborative creativity with MCTS & CNNs)
aka New Instruments for Creative Expression (inspired by New Instruments for Musical Expression). This line of inquiry is a continuation of projects such as
2. Playing on the boundaries of the figurative and abstract
(previous post re deep learning Exploring space, projecting meaning onto noise)
i.e. Producing artefacts abstract enough to allow observers to project their own meaning, but providing enough structure to guide their meaning creation process. This line of inquiry is a continuation of projects such as
Sequence generation with Recurrent Neural Networks
Recurrent Neural Networks (RNNs) —in particular, a recurrent architecture called Long Short-Term Memory (LSTM, initially proposed in 1997) — are very popular and successful in modelling and generating sequences. I’m not going to talk in detail about how they work, as there are quite a few great tutorials on the subject, e.g.
- The Unreasonable Effectiveness of Recurrent Neural Networks (2015) by Andrej Karpathy (a very nice, non-technical explanation),
- Understanding LSTM Networks (2015) by Chris Olah (bit more technical)
- Generating Sequences with Recurrent Neural Networks (2013) by Alex Graves. This is essential reading for a full technical explanation, and really lays the groundwork I think for a lot of generative RNN work (and even attention!).
But to give a very brief explanation, an RNN can learn to predict (amongst other things):
Which can be read as “The probability of x at time t+1, given the values of x at time 1, time 2, all the way up until time t” (NB. the vertical bar means ‘given’ or ‘conditioned on’ or ‘assuming we know this to be true’ etc.). In other words, the RNN learns (and outputs) a probability distribution for the next item, for a given sequence of items. (This may make RNNs sound a bit like Markov Chains or Hidden Markov Models etc. But they are quite different under the hood and can model much more complex sequences. Some nice experiments here. In fact, RNNs are theoretically Turing complete and hence are now being used in Neural Turing Machines or Differentiable Neural Computers to learn algorithms from data, not just functions).
So RNNs have been very successful in generating sequences in domains such as music (Eck2002, Sturm2015), images (Gregor2015), handwriting (Graves2013), speech (Wu2016), choreography (Friis2016) etc. and became very popular in 2015 with Karpathy’s blog post and easy to use open-source implementation char-rnn, and subsequent implementations like torch-rnn.
Overall, the generative process can be summarised as:
- Train an RNN model on a ton of data
- Ask the RNN to generate a sequence. If the RNN hasn’t been trained well, it well generate junk. If it has been trained well, it will generate a sequence in the ‘style’ of the data that it has been trained on. I.e. if you train an RNN on folk music, it will (at best) generate folk-y music, not classical music (NB. if you train an RNN on both classical and folk music, it may just generate some kind of bland mushy mix of the two. If you’ve done well, it may be able to generate both, more on this later when I talk about ‘priming’).
Currently, a lot of the applications which work with such systems are not real-time, let alone interactive, let alone expressively interactive. I.e. you tell the system to generate a sequence, e.g. some music or text, and after waiting some time (ranging from a few seconds to a few minutes) you get back a sequence, e.g. a few seconds or minutes of music, or a few words or sentences or paragraphs of text.
One thing that I’m interested in, is how can I, a human user, interact with the generative process, and steer or control it while it’s generating. I want to be in control, and respond to the generative process in real-time. In this particular case, I want to be able to control the style of the output as it’s being generated. I want to be able to play the system like a piano, or at least, ‘conduct’ it.
There are many ways one could approach this problem. I wrote about one possible method using an agent (e.g. I tried an agent driven by MCTS, I’m also looking at Reinforcement Learning, which Karpathy also has a great post on :). Another method is to use something colloquially referred to as ‘priming’ — which I have also tried and will write about in another post.
In this particular project I tried using an ensemble of models trained on different data-sets.
An ensemble basically means a bunch of models. Usually one would train a bunch of different models, maybe different architectures for each model, maybe even different learning algorithms altogether. The idea being that when the time comes to make a prediction, all of the models are fed the new input, and they are all asked to make a prediction. Then through a kind of ‘voting’ process, the outcome of all of the models are averaged. E.g. if we were working on an image recognition system, and we trained 10 models, each using different architectures and methods, then if 8 of those models predicted that a particular image was of a cat, and the other two models predicted that it was of a dog, then we can be a bit more confident that the image is indeed that of a cat.
In this case, I trained a bunch of models, each on an entirely different data-set, and looked into mixing their predictions. This also ties in nicely with calculating a joint distribution given a marginal distribution of styles.
One of my end goals is to apply this interactive control to the audio-visual domain (images, music, sound etc.), but I decided to start with character based text (i.e. the models produce text character-by-character. Karpathy’s post explains this really well). I chose this for a number of reasons:
- Character based text is relatively low dimensional (compared to say images or raw audio), and so it requires much less processing power, memory, training time, storage space etc.
- Training data is very easy to find (thank you Project Gutenberg, and Twitter, and web scraping)
- The ‘style’ is relatively simple to judge qualitatively and unambiguously (it’s very easy to identify the language of Trump vs the Bible vs C code vs Jane Austen)
- Graves and Karpathy demonstrated that LSTM RNNs are very good at modelling character based text.
I trained one LSTM RNN model per data-set. i.e. For every ‘style’ there is a corresponding data-set (e.g. Trump, Love Song lyrics, Jane Austen etc.), and a corresponding model trained on that data-set.
Conceptually, what the system does is very simple:
- All of the models (i.e. styles) are loaded at the start of the application and always in memory.
- All of the models (i.e. styles) are fed the same text sequence and they each make a prediction for the next character. To be more precise, each model produces a probability distribution over all characters, for the next character.
- The system does a weighted mix of those probability distributions (i.e. to calculate a joint distribution)
- A character is sampled from the joint distribution.
- The mixture weights are controlled interactively by a user in real-time.
With a bit of maths
Each model learns and then outputs:
This is the output of the ith model at time t and is the probability distribution for the next character, conditioned on the previous characters and the parameters of that model. That last statement ‘[conditioned] on the parameters of the model’ might seem obvious and implicitly suggested, but when we have multiple models, each trained on a different data-set, thinking of the probability like this enables us to think of it as a conditional probability, conditioned on a particular style (i.e. data-set), so then we can calculate a joint probability distribution via
(Here the denominator is just a normalising factor). The system then samples a character from this distribution to predict the next character. This new character is printed on screen, and fed back into each of the models so they’re all in sync, and at the next frame the whole process is repeated. (Actually I don’t feed the new character into all 24 models, only the ones which have a mixture weight greater than 5%. This is just an optimisation to not waste compute power on unused models. When a model becomes active for the first time, I do feed it the full history of characters — up to a max history of 80 characters — to make sure that it is making predictions on the correct text).
I separated the prediction and visualisation into two separate processes which communicate via OSC:
- A back-end server which I wrote in python and Keras. It deals with the models (i.e. loading, running etc.)
- A front-end visualiser which I wrote in C++ and openFrameworks. It displays the interface, visualises the predicted probabilities from the models, and manages the interactivity. I.e. gets input from mouse, LeapMotion or midi faders, or potentially any other input such as a kinect or face feature detection etc. (NB. I started writing this bit in python as well using vispy, but I soon gave up. I’m too used to doing this kind of stuff in C++/openFrameworks).
The system works surprisingly well. One way of using the system is just mixing a bunch of styles and letting it generate. In this case, because the system takes a weighted average of the probability distributions, it generally tends to pick characters that are common to all active models, so the output is usually not very interesting.
E.g. in the example above, given the current input sequence (the text leading up to the prediction is “…not make A_”). Below is the probability distribution from the Jane Austen model over the 128 characters of the standard ASCII set. This model is very confident that the next letter should be ‘n’, probably to write ‘Anne’. There is a big red spike at the letter ‘n’ (I know that it’s the letter ‘n’ because I can see it in the software).
Below is the probability distribution for the same input sequence, from the Bible model. This is a much more varied distribution, the model isn’t very confident over any particular character, but the highest spike is at the letter ‘b’, probably to eventually write ‘Abraham’ or something.
Also the Chilcot model has a similar wider distribution, as does the Dalai Lama model. However in the joint (i.e. mixed) distribution below, we can see only a few bars, which are the characters that are common to all active models. This makes sense because we have six models active, and they each have around 5–20% mixture weight. So any character which is predicted by only one or two models, is going to have a very low probability in the final distribution. Whereas any character which is predicted in most of the models is going to accumulate probabilities and be quite dominant in the final distribution.
On the other hand, below is the Trump model’s prediction for the same sequence. This model is insanely confident that the next character should be an ‘m’. Obviously to go on to write ‘America. In a case like this, even if the the Trump model has a low mixture weight, this ‘spike’ in its probability distribution is likely to appear in the final joint distribution too. Then, if the letter ‘m’ is indeed sampled from the joint, the Trump model will again be super confident that the next character should be ‘e’. In short, the Trump model will remain dominant until it loses confidence.
I wrote more about these kinds of observations — as well as other comments and notes on future development ideas (e.g. priming, beam search etc)— in the paper (under ‘Results and discussion’) so I won’t go into more detail.
However for me the more fun aspect of this demo — and the original motivation — is not letting it sit and generate with mixed styles, but to allow a human user to interactively steer and guide it while it’s generating, like conducting an orchestra. That’s really one of my long term goals, to get some AI to trawl through a ton of data, learn a bunch of stuff, and then allow somebody to collaborate with it, guide it, jam with it. Hopefully the video at the top hints at this.
In addition to my ongoing research in this field as part of my PhD, this work was supported by a residency at Google’s Artists and Machine Intelligence Program. In that capacity I’d like to especially thank Mike Tyka, Kenric McDowell, Blaise Aguera y Arcas and Andrea Held; and Doug Fritz, Douglas Eck, Jason Yosinki, Ross Goodwin, Hubert Eichner and John Platt for the inspiring conversations and ideas. The work and ideas I talk about here were also inspired by many many others, but I’d like to give a particular shout out to Allison Parrish and Ross Goodwin.
P.S. Here is some documentation of the demo at NIPS 2016.