Collaborative creativity with Monte Carlo Tree Search and Convolutional Neural Networks (and other image classifiers)
This is the first of a series of articles [1, 2, 3], on research I’ve done over the past few years that I’m only just getting round to writing about. This particular research I started last summer (2015) and then paused for a while, and carried on earlier this year after the release of TensorFlow (Jan/Feb 2016). I just presented it at the Constructive Machine Learning workshop at NIPS 2016 in Barcelona, and below is the rough transcript of my presentation (with some additional references).
First I’d like to give context to this research. I’m interested in generative systems that allow or enhance creative expression in a realtime, interactive manner, such that there’s a creative feedback loop between user and the system. Playing the piano is an example of this. Painting on a canvas is also an example of this. One may enter a state of ‘flow’, and respond in realtime to what they are creating. Tools like Photoshop, or more complex 3D animation software — as powerful as they are, are not examples of this. These motivations should be familiar to many people working in a similar field.
Recently I’ve been exploring generative deep models (nice post here) in this context. In the past few years there has been some very impressive work, particularly on generating images, using deep models, using methods such as Activation Maximisation, Generative Adversarial Networks, Variational Auto-Encoders, and autoregressive models such as PixelRNN and PixelCNN.
However, these image generation techniques are far from realtime, let alone interactive. Even if they were to be optimised to run realtime, they do not allow realtime control of the output in a manner to allow for creative expression — at least to the extent that I’d like. (NB. Actually the very recent PPGN can run realtime, and is interactive to some degree, though still not exactly what I’m after. It is very impressive nevertheless).
So I’d like to harness the capabilities of deep models — learning hierarchies of features and semantic latent representations directly from high dimensional raw inputs with minimal or without hand-crafted feature engineering — but in the context of realtime interaction for creative expression as outlined above.
I’ve been conducting a few experiments to explore this, in fact I shared another study using Recurrent Neural Networks in the NIPS Demo session on Tuesday, and the RNN Symposium on Thursday, and I’ll be writing about that soon too. But today I’m going to share this study using an agent driven by Monte Carlo Tree Search and Image Classifiers such as Deep Convolutional Neural Networks.
In this particular study, my aim is to have an agent that can autonomously draw something novel, based on instructions. E.g. I ask it to draw a ‘cat’, and it should draw a cat. I don’t want it to draw based off a particular image of a cat, or a collage of images of cats. I don’t even want it to ‘sample’ a random cat image (e.g. from a latent space, as a VAE or GAN might), and then draw what it has sampled. I don’t want the final image to be ‘decided’ the instant the agent starts drawing. I want the agent to know what a cat looks like, and I want it to just start drawing, and respond to what it is seeing. I want it to plan ahead, but continually adjust its drawing and plans as it draws.
And more importantly, while it’s drawing I want to be able to influence the agent. I want to be able to push it around, and directly affect what it’s drawing (e.g. by exerting forces on it). Also, I want to draw on the page too, and the agent should adapt. E.g. I might draw a tail somewhere, and the agent should recognise that and draw the rest of the cat to fit the tail that I drew. Or I might just draw some random lines and shapes, and the agent should try and use those lines and shapes in any way it sees fit. It should plan to include my drawings, intelligently incorporating them into its picture.
So the agent should see, imagine, plan and respond. This should be a collaborative process, that evolves in realtime, continually, interactively.
At least, that was the goal.
Models of creativity and collaboration aren’t the purpose of this research, but very loosely, the system is based on ideas of:
- Creativity: as an efficient method of searching a large space of actions or decisions,
- Imagination: as the the ability to simulate many different scenarios and outcomes,
- Evaluation: as the ability to evaluate both the current situation, as well as the outcome of many imagined actions and scenarios, against some desired criterion,
- Collaboration: as the ability to respond — somewhat intelligently and creatively — in realtime, to some kind of external input, such as the actions of another user.
This is an incredibly brief description of these topics, and I can’t go into more detail now. But suffice to say it forms the basis of the implementation explained below.
The basic idea is that the agent is a bit like a Logo Turtle, it can move forward or rotate left or right, and it draws as it moves. It uses Monte Carlo Tree Search (MCTS) for planning and decision making (the same algorithm that forms the backbone of AlphaGo — I wrote about that here).
More details on the implementation can be found in the paper, but briefly: At every timestep, MCTS does many simulations of random actions. I.e. The agent ‘imagines’ what would happen if it took particular actions. These simulations are run for an arbitrary depth (i.e. timesteps into the future). Each of the simulated trajectories are rendered into a texture, fed into an image classifier — such as a Convolutional Neural Network — and the probability returned from the classifier is back-propagated up the partially expanded decision tree as a reward. This process of ‘simulating and evaluating’ is repeated many times per timestep. Because of the probabilistic nature of MCTS, and the balance between exploration vs exploitation, the system converges towards imagining actions and trajectories which are more likely to produce the desired outcome. When a predetermined time budget is reached (e.g. 100ms — this way the system manages to remain realtime, at interactive rates), MCTS stops simulating new actions and trajectories, and picks the most robust child, i.e. the most visited (i.e. ‘promising’) action. The agent then performs that action (i.e. makes that move), and at the next timestep, the whole process of ‘imagining and evaluating’ is repeated.
I tested this with a number of models.
MNIST + Multinomial Logistic Regression
Because it’s compulsory in the Machine Learning world, I first tested the system on MNIST, a dataset of handwritten digits. The first model I trained and tested is a simple Multinomial Logistic Regression.
The task of this agent is to draw a ‘3’. And as you can see, it’s working pretty well. The top left viewport shows the outcome of what the agent is drawing. The top centre viewport is what the classifier sees, i.e. the drawing scaled down to 28x28. The top right viewport shows all of the simulated (i.e. ‘imagined’) paths (all of the bright white squiggly lines). So at every timestep, each one of those white squiggly lines are being imagined by the MCTS, and evaluated by the classifier. Those imagined paths are random, but not totally random (i.e. not random from a uniform distribution). As the MCTS imagines more and more paths, it starts to get an idea of what performs better and what doesn’t, so it leans towards picking actions that produces paths that are closer to the desired outcome (but it always tries to keep a balance between exploiting what it knows to work, vs exploring new territory).
The red bar that’s rising is the probability (i.e. confidence) of the image classifier that the drawing is of the desired class, in this case, a ‘3’.
Note this is not a deep model. It’s simply a linear transformation on the pixels of the image (i.e. Wx+b, where x is the vector of pixels, W and b are respectively the weights matrix and bias vector), put through a softmax (nice tutorial on this here for tensorflow, and here for theano).
MNIST + Deep CNN, LeNet
I also trained and tested a deeper, Convolutional Neural Network (CNN) — a network architecture inspired by the visual cortex (particularly based on the work of Hubel and Wiesel), specialising in image processing — similar to LeNet5 (second half of this tutorial here for tensorflow, or here for theano). I’ll skip the results on this for now as I’ll summarise it in the conclusion.
ImageNet + Deep CNN, Inception-v3
I then tried a much deeper CNN. Specifically, I downloaded Google’s 2015 state of the art architecture and classification model inception-v3, pre-trained on ImageNet, millions of images in 1000 categories.
Here’s the agent trying to draw a ‘meerkat’ using this model as the classifier, and much to my disappointment, failing miserably. It looks nothing like a meerkat, but just some kind of noise.
And here it is trying to draw a ‘white wolf’, and again, failing miserably.
Interestingly, in both of these latter two cases, the confidence of the classifier is very high (higher than 99.9+% in fact) that the outcome is of the desired class. You can just about see this in the videos as the very thin red line that slowly rises (the other blue lines at the bottom of the screen are the probabilities for the remainder of the 1000 classes).
So the failure in this case is not necessarily a failure of the planning done by MCTS, or the way the MCTS is integrated with the classifier. In fact, the first MNIST study is a good demonstration that the overall system architecture and concept works. The failure in this case is because Google’s very deep model trained on ImageNet is almost 100% confident that this random-looking noise is indeed a meerkat (in the first example) or a white wolf (in the second example). So the classifier is feeding an undesired reward signal into the MCTS (‘undesired’, because clearly the results are not as we would like them).
Turns out these networks are really easy to fool. This isn’t necessarily related to this particular model by Google, but deep CNNs trained for classification in general. They’re discriminative models, trained to classify images. So they have all kinds of tricks in their architecture (weight shared convolution layers, maxpool layers) to provide translational, scale (and a little bit of rotational) invariance. While the networks do classify natural images correctly with super-human performance, they also classify noise, and ‘random shapes’ incorrectly, with equally high confidence.
In short they produce lots of false positives, and the manifold of what the network thinks a meerkat looks like, includes what meerkats really look like, as well as a whole bunch of other junk.
Actually this was discovered by others around the same time I started working on this idea. They used a very different method, evolving input pixels to maximise desired neuron activations — a very non-realtime, non-interactive process (in fact, this same research is what gave rise to Deepdream). In the past few years there has been some work overcoming these problems using natural image priors and adversarial networks — which I showed at the start of this article. These are methods I’m now also looking at to incorporate into my MCTS driven agent, to try and constrain its output to the sub-manifold of more natural looking images. Alternatively, I’m also looking at other classifiers altogether, perhaps not trained on ImageNet, that might not produce as many false positives.
(For a hopefully not very technical intro to Manifolds and Latent Spaces see the relevant sections in this very long post).
NB. It’s very interesting to note, that while the output of the first ImageNet study looks like complete junk noise to us, and the classifier is almost 100% confident that the image is that of a meerkat, its remaining top 5 guesses are: mongoose, hyena, badger, cheetah — arguably all meerkat-like animals. In the second ImageNet study the output again looks like complete junk noise to us, but it clearly looks different to the meerkat noise. Again the classifier is almost 100% confident that the image is a white wolf, but again its remaining top 5 guesses are timber wolf, arctic fox, samoyed, west highlands terrier — all are dogs (or dog-like animals). So clearly there is something wolf-like about this particular distribution of noise, and something meerkat-like in the other distribution of noise. I wish I had tried this on non-animal classes like church, or submarine or tennis racket before I put this research on hold to move onto other things. As soon as I find the source I will try it out again!