Autoencoding Blade Runner
In this blog I detail the work I have been doing over the past year in getting artificial neural networks to reconstruct films — by training them to reconstruct individual frames from films, and then getting them to reconstruct every frame in a given film and resequencing it.
The type of neural network used is an autoencoder. An autoencoder is a type of neural net with a very small bottleneck, it encodes a data sample into a much smaller representation (in this case a 200 digit number), then reconstructs the data sample to the best of its ability. The reconstructions are in no way perfect, but the project was more of a creative exploration of both the capacity and limitations of this approach.
This work was done as the dissertation project for my research masters (MSci) in Creative Computing at Goldsmiths.
In the past 12 months, interest in—and the development of — using artificial neural networks for the generation of text, images and sound has exploded. In particular, methods for the generation of images have advanced remarkably in recent months.
In November 2015, Radford et al. blew away the machine learning community with an approach of using a deep neural network to generate realistic images of bedrooms and faces using an adversarial training method in which a generator network generates random samples, and a discriminator network tries to determine which images are generated and which are real. Over time the generator becomes very good at producing realistic images that can fool the discriminator. The adversarial method was first proposed by Goodfellow et al. in 2013, but until Radford et al.’s paper, it hadn’t been possible to generate coherent and realistic natural images using neural nets. The important breakthrough that made this possible was the use of a convolutional architecture for the generation of images. Before this it had been assumed convolutional neural nets could not be used effectively for the generation of images, as the use of pooling layers lost spatial information between layers. Radford et al. did away with pooling layers entirely and simply used strided backwards convolutions. (If you are not familiar with what a convolutional neural network is, I made an online visualisation of one.)
I had been investigating generative models prior to Radford et al.’s paper, but when it was published it was obvious that this was the approach to follow. However generative adversarial networks cannot reconstruct images, they only generate samples from random noise. So I started investigating ways in which to train a variational autoencoder — which can reconstruct images — with the discriminator network that is used in the adversarial approach, or even some kind of network to assess how similar a reconstructed sample is to the real sample. But before I even had a chance to do that, Larsen et al.  published a paper that combined both of those approaches in a very elegant way; by comparing the difference in response of the real and reconstructed samples in the higher layers of a discriminator network, they are able to produce a learned similarity metric that is far superior to a pixel-wise reconstruction error comparison (which otherwise leads to a blurred reconstruction — see Fig.2).
Larsen et al.’s model consists of three separate networks, an encoder, a decoder and a discriminator. The encoder encodes a data sample x into a latent representation z. The decoder then attempts to reconstruct the data sample from the latent representation. The discriminator processes the original and reconstructed data samples, assessing whether they are real or fake; and the response in the higher layers of this network are compared to assess how similar the reconstruction is to the original sample.
I implemented the model in TensorFlow, with the intention of extending it with an LSTM in order to do video prediction. Unfortunately due to time constraints I was not able to pursue this. It did however, lead me to building this model to generate large non-square images. The previous models described both modelled images at a resolution of 64x64 with a batch size of 64, I scaled the network up to model images at a resolution of 256x144 with a batch size of 12 (the largest I could fit on my GPU — a NVIDIA GTX 960). The latent representation has 200 variables, meaning the model is encoding a 256x144 image with 3 colour channels (110,592 variables) into a 200 digit representation, before reconstructing the image. The network was trained on a dataset of all of the frames of Blade Runner cropped and scaled to 256x144. The network was trained for 6 epochs, taking about 2 weeks on my GPU.
Ridley Scott’s Blade Runner (1982) is the film adaption of the classic science fiction novel Do Androids Dream of Electric Sheep? by Phillip K. Dick (1968). In the film Rick Deckard (Harrison Ford) is a bounty-hunter who makes a living hunting down and killing replicants — androids that are so well engineered that they are physically indistinguishable from human beings. Deckard has to issue Voight-Kampff tests in order to distinguish androids from humans, asking increasing difficult moral questions and inspecting the the subject’s pupils, with the intention of eliciting an empathic response in humans, but not androids.
One of the overarching themes of the story is that the task of determining what is and isn’t human is becoming increasingly difficult, with the ever-increasing technological developments. The new ‘Nexus-6’ androids developed by the Tyrell corporation start to develop their own emotional responses over time, and the new prototype Rachel has had memory implants leading to her thinking that she is human. The method of determining what is human and what is not, is most certainly borrowed from the methodological skepticism of the great French philosopher René Descartes. Even the name Deckard is strikingly similar to Descartes. Deckard goes through the film trying to determine who and who isn’t human, with the unspoken assertion that Deckard himself is having doubts whether he is human.
I won’t go into all of the philosophical issues explored in Blade Runner (there are two good articles that explore this), but what I will say is: that while advances in deep learning systems are coming about by them becoming increasingly embodied within their environments; a virtual system that perceives images but is not embodied within the environment that the images represent, is — at least allegorically — a model that shares a lot with the characteristics of Cartesian dualism, where mind and body are separated.
An artificial neural network however, is a relatively simple mathematical model (in comparison to the brain), and anthropomorphising these systems too readily can be problematic. Despite this, the rapid advances in deep learning are meaning that how models are structured within their environments, and how that relates to theories of mind, must be considered for their technical, philosophical and artistic consequences.
Blade Runner — Reconstructed
The reconstructed film is surprisingly coherent. It is by no means a perfect reconstruction, but considering that this a model that is only designed to model a distribution of images of the same type of thing taken from the same perspective, it does a good job given how varied all of the different frames are.
Scenes which are static, high contrast and have little variation are reconstructed very well by the model. This is because, in effect, it has effectively seen the same frame many more times than the 6 epochs of training. This would normally be considered overfitting, but given that the training dataset is deliberately skewed, this is not of great concern.
The model does however have a tendency to collapse lots of similar frames with little variation into a single representation (i.e. an actor speaking in a static scene). As the model is only representing individual images, it is unaware that there is subtle variation frame-to-frame and that this variation may be important.
The model also struggles to make a recognisable reconstruction when the scene is very low contrast, especially with faces. The model also struggles to reconstruct faces when there is a lot of variation, such as when someone’s head is moving or rotating. This is not particularly surprising given the limitations of the model and peoples sensitivity for recognising faces. There have been some very recent advances combining generative models with spatial transformer networks that may address these issues, but is beyond the scope of this current project.
Reconstructing Other Films
In addition to reconstructing the film that the model was trained on, it is also possible to get the network to reconstruct any video that you want. I experimented with several different films, but the best results were definitely from reconstructing one of my favourite films - Koyaanisqatsi (1982). The film consists almost entirely of slow motion and time-lapse footage, with a huge variety of different scenes, making it the ideal candidate for seeing how the Blade Runner model reconstructs different kinds of scenes.
It is no great surprise that the model does a much better job reconstructing the film it was trained on, in comparison to videos it has never seen before. This could certainly be improved by training the network on a much larger dataset and more varied dataset, like hundreds of hours of random videos. But the model would then almost certainly loose the aesthetic quality it captures that is inherent in a single, self contained film. And while individual frames are more often than not unrecognisable as the scene they are depicting when viewed alone, in motion the reconstructions are temporally coherent and have a rich unpredictability.
In addition to Koyaanisqatsi, I also got the network to reconstruct two other films. One of the films reconstructed is the famous 1984 Apple Macintosh advertisement was directed by Ridley Scott (who directed Blade Runner) who got the hired for directing the advert after Steve Jobs had seen Blade Runner at the cinema. The advert shares a lot in common with Blade Runner in terms of visual style, so was a suitable choice for testing something similar to Blade Runner.
The other film I got the network to reconstruct was John Whitney’s seminal animation Matrix III. John Whitney was a pioneer of computer animation and was IBM’s first ever artist in residence between the years 1966–69. Matrix III (1972) was one of a series of films demonstrating the principle of “harmonic progression” in animation. This film was chose to see how well the model could reconstruct abstract, non-natural image sequences.
Autoencoding A Scanner Darkly
After reconstructing Blade Runner, I wanted to see how the model would perform being trained on a different film. I chose the 2006 film A Scanner Darkly — another adaptation of a Phillip K. Dick novel — as it is stylistically very different from Blade Runner. Interestingly A Scanner Darkly was animated using the interpolated rotorscope method, meaning it was filmed on camera, and then every frame was hand traced by an animator.
The model does a reasonably good job of capturing the style of the film (though not nearly as well as the recent style transfer for videos), but struggles to an even greater degree in reconstructing faces. This is probably because of the high contrasted outlines and complexities of the facial features, as well as the exaggerated and unnatural frame-to-frame variation in shading.
Once again, the reconstruction of other films through the model are most often unrecognisable. The results are less temporally coherent than the Blade Runner model, which is probably due to the fact that there are many more colours in the distribution of images in A Scanner Darkly, and that modelling natural images — as opposed to stylised ones — may be more difficult for this model. On the other hand, the images are incredibly unusual and complex, once again producing video with a rich unpredictability.
In all honesty I was astonished at how well the model performed as soon as I started training it on Blade Runner. The reconstruction of Blade Runner is better than I ever could have imagined, and I am endlessly fascinated by the reconstructions these models make of other films. I will certainly be doing more experiments training these models on more films in future to see what they produce. In further work I would like to adapt the training procedure to take in to account the consecutive order of frames, primarily so the network can better differentiate between long sequences of similar frames.
This project was completed as the dissertation for my MSci in Creative Computing at Goldsmiths, University of London under the supervision of Dr Mick Grierson.
This project would not have been possible where it not for Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle and Ole Winther’s paper Autoencoding beyond pixels using a learned similarity metric.
*This article was edited on the 26th of May to clarify what an autoencoder was, how many latent variables were used, and what the objective of the project was.