Review of machine / deep learning in an artistic context

I'm currently working towards a PhD at Goldsmiths, University of London under Dr Mick Grierson, Dr Rebecca Fiebrink & Prof Huosheng Hu, with working title “Realtime image and sound synthesis and expressive manipulation using deep learning and reinforcement learning in responsive environments” (Quite a mouthful I know, it’s academia after all). My research is at the intersections of algorithmic image and sound generation + human computer interaction / embodied interaction / expressive gestural interaction + artificial intelligence / machine learning / reinforcement learning / deep learning + responsive environments / virtual reality / augmented reality / mixed reality.

The text below is excerpts from my literature review from September 2015. Now, just 4 months after I submitted it, parts are already quite out of date (especially related to image/sound synthesis using deep learning — developments in the field are incredibly fast!). Also artistic interest in the field is growing rapidly, with many contributions coming from artists as well as academia. So instead of publishing my results in years to come, or in purely academic platforms, I'm publishing this as is on here now in case it (or bits of it) is useful to anyone. It was aimed at an academic audience, so I've tried to edit out the compulsory academese, hopefully it still makes sense. It’s not meant as a tutorial for ML/DL, but an overview and review of the current state of the field (as of September 2015), and not a comprehensive review, but mostly in relation to my research area. It is a bit long, but could be read in sections and hopefully still make some sense, and act as a starting point for further research.


Introduction

Machine Learning (ML) is a field of Artificial Intelligence (AI) that investigates how algorithms can learn from observations and data, as opposed to humans explicitly programming in each step of what a software should do. These algorithms enable computers to find complex relationships and patterns in data, and they produce outputs or decisions based on statistical models. The field has remained primarily academic for decades. However with recent developments, especially in Deep Learning (DL), and increases in computing power, ML is starting to appear in our everyday lives. These algorithms are now in our pockets powering speech recognition [Hinton et al. 2012], in our email clients filtering spam [Guzella and Caminhas 2009], they are captioning images [Karpathy and Fei-Fei 2015], translating text [Sutskever and Vinyals 2014] and driving cars [Thrun et al. 2006].

For traditional shallow ML to work optimally, high dimensional complex data needs to be manually analysed and domain specific low-dimensional features are hand-crafted[LeCun 2012]. This makes them inefficient for many real world problems. DL algorithms however, can learn which features to extract. They can eliminate — or at least minimise — this manual feature engineering phase by learning a sequence of representations suited to the problem, allowing them to operate directly on high dimensional, complex, real world data. These qualities have made them very popular in recent years, especially Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).

Algorithmically generating images and sound is a well-established research area. However, there are recent developments in the field using deep learning that have produced very unique results (Nguyen et al., 2015; Gatys et al., 2015; Mordvintsev et al., 2015; Sturm, 2015; Radford et al., 2015). Unfortunately, these deep learning content generation techniques currently cannot run realtime or interactively. Other recent research has combined deep learning with agent based AI such as Monte Carlo Tree Search (MCTS) or Reinforcement Learning (RL) to allow adaptive online learning or planning. This has been demonstrated with a system learning to play video games simply by ‘watching the screen’, with no prior knowledge of the game (Mnih et al., 2013; Guo et al., 2014; Mnih et al., 2015).

Furthermore, there are converging trends in the industry between gaming, general entertainment and media consumption — a trend which is inspired and driven primarily by anti-disciplinary artists. Major film festivals around the world such as Sundance, Tribeca, and Toronto are exploring and promoting ‘interactive storytelling’, ‘nonlinear narrative’ and ‘transmedia experiences’. Product launches are now accompanied by an almost compulsory ‘immersive interactive experience’. Similar developments are being seen in music, dance and theatre.

Technology that was once confined to academia and research labs are becoming mainstream consumer devices, and driving new markets. Inspired by CAVE-like immersive environments [Cruz-Neira et al. 1993], our own 2011 work [Akten et al. 2011] uses six consumer projectors and three Sony PlayStation3’s with PSMove controllers to projection map a living room with dynamic content reacting to its inhabitant. Two years later Microsoft Research’s IllumiRoom [Jones et al. 2013] followed by RoomAlive [Jones et al. 2014] demonstrates their interest in bringing this technology to the living room. Likewise MIT Media Lab’s SixthSense [Mistry and Maes 2009] investigates bringing augmented mixed reality on a smaller, personal, portable scale. Nintendo’s Wii controller, Sony’s PSMove controller and Microsoft’s Kinect depth camera all brought alternate interaction paradigms to the living room. Virtual reality — for decades only found in academia or military use — is now making its way into the living room with Facebook’s recent purchase of Oculus Rift, Google’s Cardboard VR, and many other mainstream technology brands. Even Kellogg’s are making a cereal box which can be used as a cardboard VR headset complete with their own iOS and Android app. Within a few years, with the launch of Microsoft HoloLens and Google backed MagicLeap, augmented and mixed reality is also likely to become a common household experience.

We’ve also seen trends in hardware platforms, expanding from PCs and dedicated gaming consoles, to mobile phones and tablets, and potentially to other platforms such as emerging smartwatches, Internet Of Things, and even consumer quadrotor drones, equipped with programmable embedded computers, potential platforms for augmented games and activities.

Through pervasive, ubiquitous smart hardware and software, all of these developments are leading to the mainstream adoption of Multi-Modal Mixed Reality Responsive Environments. The most ground-breaking work done in these fields are often driven by re-appropriated technology, and most frequently the pioneers of this discipline bending misuse of technology are artists and creative hackers.


Brief overview of Machine Learning

Machine Learning is a field of artificial intelligence that investigates how a system can improve its performance on a task, with respect to a specific measure, based on its past experience [Mitchell 1997].

ML algorithms find complex relationships in data. They build models based on observations, and learn rules required to make optimum decisions or predictions. They can recognise patterns in data that we humans may not be able to recognise. Or even if we could recognise those patterns, we might not be consciously aware of how to formulate it in a way that we could program into a computer in traditional non-ML ways.

This enables us to create systems which exhibit more intricate and complicated behaviour than we would be able to implement directly. Even though the behaviour of these system may be statistically consistent with the training data, they may find patterns we were unaware of, and thus may exhibit unexpected behaviour. This is a quality of ML which is integral to our research, for its ability to generate behaviour which is unpredictable, yet somehow controllable via training data. This is also a quality of ML which is both its greatest strength, but also its greatest danger — as it can introduce biases through over-fitting or under-fitting, or even learn undesired biases found in the training data. These dangers are even more amplified as ML algorithms are relatively difficult to debug and peer inside of (Yosinski et al., 2015; Caruana et al., 2015).

The field has been studied for many decades. In his 1948 essay [Turing 1948], Alan Turing describes how machines could be designed to learn using what he named ‘B-type unorganised machines’, conceptual pre-cursors to modern day artificial neural networks. For many years machine learning remained primarily an academic research area. In many cases it did not see mainstream use due to high computational requirements and inferior performance compared to other AI methods [LeCun 2014]. However, advances in machine learning algorithms, and increases in computing power — specifically highly paralleled GPU computing — has enabled dramatic advancements in how machine learning can be applied [Ciresan et al. 2011]. Gradually machine learning has outperformed other AI techniques in fields such as speech recognition [Hinton et al. 2012], natural language processing [Collobert et al. 2011], computer vision [Couprie et al. 2013], email spam filtering [Guzella and Caminhas 2009], image captioning [Karpathy and Fei-Fei 2015], robotics and self-driving cars [Thrun et al. 2006].


The key aspect of ML is supplying a learning algorithm with training data. In the most simple terms, during the training phase, the learning algorithm analyses the training data and builds a model. After training, during the prediction phase, new input data is presented to the model, the model processes the input and produces an output, a decision or prediction.

The input to the model is a vector (or more broadly speaking, a tensor, with dimensions and shape depending on the problem). The output of the model is also a vector (or tensor), which may or may not have the same dimensions as the input, again depending on the problem.

The model is parameterised by model parameters, and it’s these model parameters that the learning algorithm tries to learn. Training the model consists of specifying an Objective function (also known as Loss, Cost or Energy function) and the learning algorithm tries to find the set of model parameters which minimise the Objective function over the training data (or it may be a Utility function which the algorithm tries to maximise).

In reality, ML implementations may be a lot more complicated than this, sometimes without such clear distinctions between training and prediction phases (e.g. in Online Learning, Transfer Learning, Reinforcement Learning etc where the model is updated continually as new data becomes available), but these concepts are at the root of it all.

Supervised Learning

In Supervised Learning, the model is trained on labelled data, where each training example is an input-target pair. During training, the learning algorithm tries to learn the model parameters which effectively implement a function that maps the input of each training pair, to the associated target. These targets can also be thought of as a supervisory signal. For classification problems, the target is usually a discrete class label, often represented as a one-hot vector (i.e. a vector where all elements are zero, except for the entry for the desired class, which is one). For regression problems, the target is usually a real-valued vector (or tensor). Having input-target pairs makes it relatively straight forward to specify the objective function, thus right now Supervised Learning is one of the most popular and successful branches of ML. However, the training pairs often need to be manually associated by people. This makes them cumbersome and very time consuming to prepare. Online crowd-sourcing platforms such as Mechanical Turk have helped accelerate the preparation of large labelled datasets which is why we’re starting to see more success in this field (LeCun, 2014).

Unsupervised Learning

In Unsupervised Learning, training is performed on unlabelled data. Without an external supervisory signal, it can be more ambiguous as to how to specify the Objective function, and unsupervised learning is currently one of the big open problems in ML (Le et al. 2012). One of the common training objectives of unsupervised learning is found in an Auto-Encoder, in which the target is the same as the input, and the learning algorithm tries to learn how to compress and decompress each training example with minimal loss. If successful, it finds regularities in the training data, and learns more compact and meaningful representations. Another common objective is clustering, in which the learning algorithm tries to organise the training data into groups based on similarities that it tries to learn. When new inputs are presented to a model trained with one of these unsupervised learning methods, the model can transform the input data to one of the more compact representations, or predict how it relates to other data based on the patterns it has already found.

Semi-supervised Learning

A combination of the two — Semi-supervised Learning— is used when some of the data is manually labelled, and some of it isn’t. Upon training, the model learns how to classify the data, and associate the unlabelled items with the correct labels supplied in the training.

Reinforcement Learning

In Reinforcement Learning (RL), the model is not trained with labels associating inputs with targets. It is neither supervised (with a direct supervisory signal) nor unsupervised (with no supervisory signal), but instead there is a delayed reward signal (Kaelbling, Littman, & Moore, 1996). The terminology is slightly different in RL, where decisions or predictions are called actions, and the decision making (or action taking) entity is called an agent. This is because RL is based on a Markov Decision Process (MDP) (Bellman, 1957), where there is a notion of time, and at each time-step, agents take actions and move between different states. In RL, at every time-step, the agent receives a reward from the environment. But this reward might not be an immediate reward for the last action, but could be a delayed reward for an action or series of actions taken much earlier on, or it could even be related to ‘random’ events outside of the agent’s control or knowledge. Part of RL is to solve this attribution problem of delayed rewards. The general objective of the algorithm is to learn what the optimal decisions are by maximising its long term reward. This process also involves a balance between exploration (of new actions which haven’t yet been made) vs exploitation (of actions which are known to reward higher than others). RL can also be thought of as learning by trial and error.

Deep Learning and motivations

Deep Learning (DL) is the name given to a family of architectures and methods that aims to minimise or eliminate domain-specific feature engineering by learning a sequence of non-linear feature transformations and hierarchical representations (LeCun, 2014).

Traditional machine learning techniques — such as support vector machines or shallow neural networks — are unable to work with highly complex, high dimensional data. When working with such models, it is necessary to pre-process the data, reduce dimensions and extract hand-crafted, domain-specific features, a process called feature engineering [LeCun 2012]. Collectively, these features are called a representation (of the data). The learning algorithm is then trained on this hand-crafted representation. The process of feature engineering is often quite difficult, time-consuming and requires skill [Ng 2013]. Furthermore, the success of the training is highly dependent on this chosen representation [Bengio et al. 2013]. As a result, feature engineering approaches can provide inconsistent, unreliable results.

Using deep learning techniques, the pre-processing and feature extraction steps can be skipped. Instead, the model is fed the higher dimensional, raw data. The deep learning model is a stack of parameterised, non-linear feature transformations that learn hierarchical representations [LeCun 2014]. During training, each layer learns what kinds of transformations to apply to the previous layer, i.e. what kinds of features to extract from it. As a result, the deep learning model is a hierarchy of representations with an increasing level of abstraction.

This makes deep learning very powerful in handling real world data. However, with many more model parameters to learn, often in the millions, this comes at the cost of more complex implementations, higher computational requirements, and larger training sets [LeCun et al. 1998].

In summary: A machine learning model requires a representation of the world. In shallow learning, this representation needs to be hand crafted using domain-specific pre-processing and feature extraction. In deep learning, the raw data can be directly fed into the deep model and the algorithms learn a hierarchy of representations.

However, even though for deep learning domain-specific feature engineering is not necessary, representation of the input data is still important. A key component in implementing a successful deep learning system, is the relationship between the representation and architecture of the model, they need to be compatible so that the relevant features can be extracted efficiently from the inputs [Bengio et al. 2013]. In the seminal research [Krizhevsky et al. 2012], the authors found that removing any one of their convolutional layers — each of which contained no more than 1% of their 60 million parameters — resulted in inferior performance.

DL researchers are very aware of these limitations and architecture design is a very active research area. Thus currently DL is far from the silver bullet / universal learning algorithm that the media hype sometimes proposes it to be.

Very brief history of Deep Learning

An in-depth survey of DL and related algorithms is not within the scope of this text and we’ll only focus on recent major milestones relevant to this research. For an in-depth survey please see (Schmidhuber, 2015).

In [LeCun et al. 1989] Yann LeCun used a Convolutional Neural Network (CNN) — a deep learning neural network with many layers and connections inspired by those found in biological systems — to recognise handwritten digits. However the computers of the era were not powerful enough to run the operations required by the network, so an additional DSP board was needed. Over the next twenty years LeCun’s research showed that CNNs had the ability to learn and recognise patterns in image and speech recognition [LeCun and Bengio 1995][LeCun et al. 1995][LeCun et al. 1998][LeCun et al. 2004].

However, DL algorithms require a great deal of computing power. Especially when dealing with large inputs such as images, in which case the amount of computation scales linearly with the number of pixels [Mnih et al. 2014]. CNNs were not practical for real world applications until highly parallel Graphical Processing Units (GPU) became available [Raina et al. 2009]. This led to an explosion of DL implementations. Quite famously, Quoc Le et al developed software that learned how to detect faces and cats and extract their features by sampling random frames from 10 million YouTube videos [Le et al. 2011].

With so many parameters to train, CNNs require massive datasets. In the absence of such data, CNNs were being outperformed by older, handcrafted, specialist pattern recognition algorithms. With the introduction of large datasets with millions of labelled images in thousands of categories, such as ImageNet [Deng et al. 2009], this balance shifted. In 2012 Geoffrey Hinton’s students Alex Krizhevsky and Ilya Sutskever designed a new deep CNN architecture with 60 million parameters and trained it across two GPUs [Krizhevsky et al. 2012]. Their model outperformed traditional image recognition methods by a large margin. This was a turning point that led to many improvements over recent years, dramatically decreasing errors in predictions.

A similar pattern can be seen in the adoption of Recurrent Neural Networks (RNNs) — deep neural networks with cyclic connections, able to store internal states, allowing them to process sequential, temporal data. In the field of speech recognition, hand-crafted Gaussian Mixture Model — Hidden Markov Model approaches were consistently outperforming other methods, including DL. Once large datasets and computing power became available, RNNs started to outperform GMM-HMM and become more widespread in speech recognition [Hinton et al. 2012][Deng et al. 2013].

As described above, a significant aspect of the successful application of DL involves the recognition of complex patterns.

Recent research has demonstrated that deep learning approaches are useful beyond classification and recognition tasks. In [Ilya Sutskever, Oriol Vinyals 2014], the authors used DL to translate text. This involved the considerable challenge of dealing with input and output sequences of variable length. The authors used Long Short-Term Memory (LSTM) [Hochreiter and Schmidhuber 1997] RNN, and their approach is known to outperform previously known methods for solving such problems. In [Graves 2013], Alex Graves demonstrated using LSTM networks to generate a variety of sequential outputs including long chunks of text and hand-writing. Further research in the field has shown promise with respect to uses of deep learning with artistic, creative output.

Deep Learning for Artistic & Creative Output

(NB: this section is already very out of date and could be 10x longer!)

In [Erhan et al. 2009], the authors were curious about methods of qualitative analysis of network architectures, particularly the effects of varying inputs on specific neuron activity in hidden layers of deep models. They achieved this by using gradient ascent on image classification deep models. With this method they were able to generate inputs which maximised neuron activity on the layers of interest in Stacked Denoising Autoencoders and Deep Belief Networks.

In 2013 Researchers at Oxford University [Simonyan et al. 2013] used a similar gradient ascent method to generate images that maximise a class score in a convolutional network trained on ImageNet. They used a similar technique to also generate a class saliency map, given an input image and a class.

The following year, Oxford University researchers [Mahendran and Vedaldi 2014] were frustrated at the lack of knowledge of the internals of image representation used in convolutional networks trained for image classification. They applied gradient ascent to ‘invert’ a CNN, reconstructing images for the various different hidden layers and neurons. They found that a CNN trained on images, such as ImageNet, stored photographic information on some layers, and other more abstract features such as edges, shapes and gradients. The visual outputs of these methods include abstract — but recognisable — representations of the trained images and classes.

In 2015, Alexey Dosovitskiy and Thomas Brox devised a new method of ‘inverting’ CNNs’ to visualise the internal representations of a convolutional neural network by using a second convolutional network [Dosovitskiy and Brox 2015].

Also in 2015, Anh Nguyen et al were curious how recognition in deep image classification models differed to visual recognition in humans. They used convolutional networks trained on ImageNet and the MNIST handwriting dataset [LeCun et al.], combined with evolutionary algorithms and gradient ascent to generate images that scored highly for specific classes, but that were unrecognisable to humans [Nguyen et al. 2015]. They found that in some cases they could generate images that would satisfy particular classes with 99.99% accuracy by the CNN, but be completely unrecognisable by humans (e.g. detecting a cheetah in white noise, a starfish in wavy lines etc.). Interestingly, they submitted the output images to an art contest, and were among the 21.3% of the submitted artworks selected for exhibition.

Again in 2015 Google researchers [Mordvintsev et al. 2015] released code for a research they called #DeepDream #Inceptionism, which went viral on social media. They used similar gradient ascent to generate images that maximised activity on particular hidden neurons, but then fed the generated output back into the input to create feedback loops that acted to amplify activity. Combined with image transformations on every iteration, this created endless fractal-like animations and ‘hallucinations’ of abstract — but subtly recognisable — imagery. They used their own GoogLeNet convolutional network architecture for this research, details of which can be found in [Szegedy et al. 2014].

Also in 2015 [Gatys et al. 2015] released similar research that was also highly shared on social networks, called #StyleNet. This research extracts the artistic style of an image — for example, a painting by Van Gogh or Edvard Munch, and applies it to another image, such as a photograph. Techniques of applying artistic styles to images have been researched for many years — as a subset of nonphotorealistic rendering, a 2013 survey can be found in [Kyprianidis et al. 2013]. However in most cases the algorithms are hand-crafted to resemble each particular style (an exception to this can be seen in [Mital et al. 2013]). In Gatys et al.’s research, the authors found that they could use convolutional neural networks to separate the content and the style of the image, storing different representations for each. Doing so enabled them to apply different transformations, or even mix and match representations from different images — e.g. applying the style of Van Gogh’s ‘Starry Night’ to a photograph. This technique works remarkably well even on very abstract styles such as Mark Rothko, Jackson Pollock, or Piet Mondrian.

Similar developments have also taken place with sequenced data using recurrent neural networks. In 2015 Andrej Karpathy released an open-source Recurrent Neural Network implementation for training on character level language models called char-rnn [Karpathy 2015a] based on [Graves 2013]. The software takes a single text file, and generates similar text based on character sequence probabilities. This software has been used by a number of people to generate text in the style of Shakespeare, cooking recipes, rap lyrics, Obama speeches, the bible and more [Karpathy 2015b]. This char-rnn library was also used by [Sturm 2015] to generate midi notes in the style of folk music. Due to it being a simple implementation using a text sequence, it is limited to monophonic output. Other examples of composing music using recurrent networks include [Eck and Schmidhuber 2002][Boulanger-Lewandowski et al. 2012].

In addition, in 2013, DeepMind technologies (recently acquired by Google) developed a system which is capable of learning how to play Atari games simply by observing the images on the screen [Mnih et al. 2013]. Given no prior knowledge of the game rules or controls, with only the screen pixels as input, the system developed strategies within a few days of playing. In some cases these strategies outperformed other AIs, and in other cases even outperformed human players. They achieved this by implementing a deep reinforcement learning algorithm training convolutional networks with Q-learning, an approach they call Deep Q Networks (DQN). In another recent research [Guo et al. 2014], the authors investigated methods of improving on DQN’s performance using Monte Carlo Tree Search methods [Browne and Powley 2012]. Not having access to the internal game-state, the researchers look to train a CNN with offline UCT using the screen pixels as input, and then use the trained CNN at runtime as a policy to select actions. They found that the UCT trained CNN is currently state of the art realtime AI and beat the score of the DQN. However in order to achieve those results there was a very long offline training period.

These recent developments using deep learning networks to generate images, sounds or actions show incredible potential for the application of deep learning for creative output. There are still many unexplored territories however, and the performance is far from realtime, and not interactive.

Brief History of Algorithmic Computational Art (in relation to ML/AI)

In this section I take a slight digression to acknowledge algorithmic art prior to Deep Learning. An indepth survey is out of scope for this review, but this serves as a brief summary of the area. A more comprehensive review can be found in [Grierson 2005] and [Levin 2000].

The use of computers for the purposes of making art dates back at least as far as the 1950s and 1960s, most notably with John Whitney’s DIY analog computers built from World War II M5 and M7 targeting computers and anti-aircraft machinery [Alves 2005]. Whitney’s works were not only pioneering technically, leading to the birth of computer graphics and special effects in scenes such as Stanley Kubrick’s 2001: A Space Odyssey, but in addition they pioneered the field of computer-aided audio visual composition [Grierson 2005]. His work continues in the tradition of experimental abstract animators and filmmakers such as Normal McLaren and Oskar Fischinger. He was joined shortly after by software artists such as Paul Brown, Vera Molnar, Manfred Mohr, Frieder Nake, Larry Cuba and many more.

However it was Harold Cohen’s AARON software from 1973 [Cohen 1973] which first introduced artificial intelligence into computer art. AARON was ostensibly a piece of software, written to understand colour and form. Cohen often talks about ‘training’ his software [Cohen 1994]. However he uses the term rhetorically. The ‘learning’ in AARON is not the machine learning mentioned in previous sections. AARON does not learn by looking at data. Instead, whenever Cohen wants AARON to learn something new, he has to analyse it himself, and implement the sets of rules required to replicate that behaviour. Often these are very complex sets of rules that take years for Cohen himself to learn before he can program them [Cohen 2006].

Other computer graphics artists working with artificial intelligence shortly after include William Latham [Todd and Latham 1992], Karl Sims [Sims 1994], and Scott Draves [Draves 2005]. Inspired by Darwinian evolution by natural selection, these artists primarily explored Evolutionary Algorithms (EA) — also known as Genetic Algorithms (GA) — in the creation of algorithmic art, eventually known as evolutionary art. Also during this period, David Cope developed an algorithm for composing music. His ‘Experiments in Musical Intelligence (EMI)’ began in 1981, and he developed it over the decades until he eventually patented the algorithm ‘Recombinant music composition algorithm’ [Cope 2010]. Using this algorithm he generated musical sequences in the style of many classical composers and styles, such as Bach, Vivaldi, Beethoven, Mozart, Chopin and Debussy. His latest software ‘Emily Howell’ has albums being released under its name.

Starting in the 1960s Myron Kruger developed gesturally interactive computer artworks and Responsive Environments, culminating in his seminal Artificial Reality environment ‘Videoplace’ [Krueger et al. 1985]. Videoplace tracked users with cameras, enabling them to interact with virtual objects in the scene using projectors. First developed in 1986, David Rokeby’s ‘Very Nervous System’ explores similar themes of gestural full body interaction — using hand built cameras — in this case to generate music [Rokeby 1986]. Other notable artists working with similar ideas in this era include Ed Tannenbaum, Scott Snibbe, Michael Naimark, Golan Levin and Camille Utterback.

With the introduction of creative coding tools and open-source communities there has been exponential growth in this field over the last decades. Some of these tools which have global communities include Processing, openFrameworks, Cinder, vvvv, Max/MSP/Jitter, PureData, SuperCollider, TouchDesigner, QuartzComposer, Three.js and many smaller bespoke ones.

This process driven creative art form can be traced back to a wider rule-based generative art movement that includes composers such as Steve Reich, John Cage, Terry Riley, Brian Eno and artists such as Sol Lewitt and Nam June Paik.

Computational Creativity

A sub-field of AI which is related to this area is Computational Creativity. Whereas AI questions whether a machine can think, or exhibit intelligent behaviour [Turing 1950], Computational Creativity questions whether a machine can be creative, or exhibit creative behaviour.

Computational Creativity research is not only concerned with the creative output of the algorithms or technical implementation details, but is equally — if not more — concerned with the philosophical, cognitive, psychological and semantic connotations of machines exhibiting creative behaviour, or acting creative. In [McCormack and d’Inverno 2012] and related papers [McCormack and D’Inverno 2014], the authors ask — and attempt to answer — questions regarding computers and creativity, creative agency and the role of creative tools.

As part of the philosophical angle of computational creativity research, there is often an emphasis on fully autonomous systems. The field includes research into software which exhibits intentionality, and is able to justify the decisions it makes when creating a piece of work by framing information in the context of the work [Colton and Wiggins 2012]. This can be thought of as analogous to an artist making deliberate, purposeful decisions at every step of the creative process. This aspect of computational creativity is sometimes referred to as strong computational creativity [Al-rifaie and Bishop 2015] — analogous to John Searle’s strong (vs weak) AI [Searle 1980]. The field is also accompanied by certain formalisms, proposed models and theories of creativity to ensure the systems’ behaviour complies with what is thought to be ‘creative behaviour’ [Colton et al. 2011].

Within this context there has been research into systems that conceive fictional concepts [Cavallo et al. 2013], design video games [Cook et al. 2014], write poetry [Colton et al. 2012], and other inherently ‘creative’ tasks.

NB: While the algorithmic techniques used for content generation in Computational Creativity is within the scope of my research, the formalisms and models of creativity are not. It can be said that my research is interested in weak computational creativity — particularly semi-autonomous, collaborative creativity where human interaction in the content creation process is not only relevant, it is essential — to create interactive systems where the human user can guide the computationally creative system in realtime.

Machine Learning for Artistic, Expressive Human Computer Interaction (AEHCI)

Introduction

In the previous sections I reviewed a non-exhaustive range of relevant literature in areas of algorithmic image and sound generation, ranging from simple algorithms to the latest developments in deep learning. Some of these have been non-realtime, for example where content is generated through the application of deep learning, while others have been realtime, even interactive. This section will cover Artistic Expressive Human Computer Interaction (AEHCI) — Human Computer Interaction for artistic expression.

In [Dourish 2001] Paul Dourish proposes new models for interactive system design. Embodied Interaction is interaction embodied in the environment, not just physically, but as a fundamental component of the setting. It is an interaction design that takes into consideration the ways we experience the everyday world. This philosophy is particularly applicable when designing gestural interfaces for artistic expression.

As mentioned before, Myron Kruger was also interested in exploring Responsive Environments, in which ‘interaction is a central, not peripheral issue’ [Krueger et al. 1985]. He saw potential in this area for the arts, education, telecommunications as well as general human-machine interaction and was motivated by creating playful environments which explore the perceptual process we use to navigate the physical world.

Human computer interaction for music — or Musician-Computer Interaction (MCI) [Gillian 2011] — is one of the more academically established fields related to AEHCI, more so than gestural human computer interaction for visual composition. However, many of the requirements for interaction design and particularly gesture recognition are similar. Both require low-latency, realtime systems that can be configured on-the-fly. They need to be capable of detecting a wide range of gestures, some AEHCI systems might concentrate on subtle finger movements, while others track whole bodies of multiple people. Furthermore, the ability to detect and respond to subtle variations in gestures is essential to convey expressivity [Caramiaux et al. 2014]. Also, in performance situations, gesture recognition need not be generalized across different people, but training can be specific to the performing individual to maximise personal expression [Gillian 2011].

Due to these similarities, in this research Musician-Computer Interaction is taken as a base model for expressive gestural interaction, and will be built on for general AEHCI.

Gestures and ‘Expressive Gesture’

A survey of definitions of gesture, especially in relation to music can be found in [Cadoz and Wanderley, 2000]. The authors conclude that the many proposed definitions do not adapt to gesture in music, but they purposefully avoid providing a new definitions, focusing instead on which aspects of the various definitions might apply.

In [Camurri et al., 2004] the authors define Expressive Gesture as “responsible of [sic] the communication of information that we call expressive content” where “Expressive content concerns aspects related to feelings, moods, affect, intensity of emotional experience”. This is the definition of Expressive Gesture that is used in this research, complemented with the “natural, spontaneous gestures made when a person is telling a story”, as described in [Cassell and Mcneill, 1991]. Particularly those with the semiotic classification of metaphoric, indicating abstract ideas [McNeill and Levy, 1980]. A wider study of gesture expressivity and its dimensions — especially in the context of musical performance and human computer interaction can be found in [Caramiaux, 2015].

This research is not concerned with detecting emotion within gesture as in [Cowie et al., 2001] or [Zeng et al., 2009]. Instead it is concerned with finding correlations between various parameters of a gesture, and parameters of the generative output model. It will map expressive gesture to trained models of artistic content synthesis and manipulation. Inspired by research in embodied cognition and the relationship between action and perception [Kohler et al., 2002, Metzinger and Gallese, 2003, Leman, 2007], in [Caramiaux et al., 2009] the authors investigate similar relationships by analysing motion capture data of participants performing free hand movements while listening to short sound samples.

Gesture Recognition Sensors

One of the significant challenges in executing gestural interaction is reading relevant information from the user — recognising their positions, movements and gestures. A further challenge is extrapolating their motivations and intentions from those gestures.

There are many hardware devices and sensors which can support this process: accelerometers and inertial measurement units (IMU), myoelectric sensors, ultrasonic and infra-red range finders, 2d cameras / depth cameras / computer vision (CV), radar, lidar etc. Surveys of gesture recognition technology and research can be found in [Gillian 2011] and [LaViola Jr. 2013].

My research does not investigate new modes of sensing. It focuses on emerging consumer technology, and applications of algorithmic image and sound synthesis and manipulation within that context. This is to remain applicable to hardware and environments that relate to commercial gaming and mainstream use. Primarily this will involve depth cameras similar to Microsoft’s Kinect 2 and LeapMotion, as well computer vision with traditional 2D cameras.

Furthermore, there is an increasing trend in consumer devices to combine multiple sensors for more varied data, contributing to higher accuracy in estimations of pose and movement. Past examples of this include Nintendo’s Wiimote controller combining accelerometer with infrared sensor (and an additional gyroscope with the MotionPlus addon ), and Sony’s PSMove controller combining a 6-axis IMU with magnetometer and high speed computer vision. Both controllers also feature buttons and D-pad as well as vibration based haptic feedback. Microsoft’s Kinect also combines an RGB camera, IR camera / depth sensor, microphone array, and accelerometer (to detect device orientation) into a single, affordable consumer device.

Next generation devices are combining increasing numbers of sensors. Project Tango by Google’s Advanced Technology And Projects Group (ATAP) [Google Advanced Technology And Projects 2014] is an example of this trend. Project Tango is a next generation mobile device with a depth sensor, motion tracking camera and 9-axis IMU enabling it to calculate its position and orientation in space, while simultaneously scanning and building a 3D map of its environment. A very recent research project from a different team at the same group have announced Project Soli [Google Advanced Technology And Projects 2015]. Project Soli uses radar to track hand and finger movements at sub-millimetric precision and high speed, enabling natural, intuitive interfaces for small wearable devices.

These are all examples of the kinds of devices that will be used for this research. But such devices share a common problem: extracting meaningful information from the data for gesture recognition is a challenging task. Currently, machine learning is a very popular and successful research area in this field.

Gestural Interaction (and Gesture Recognition) for AEHCI

Gestural interaction (and gesture recognition) is a very broad field. This section will focus on applications within AEHCI, particularly in context of machine learning. Wider surveys can be found in [Mitra and Acharya 2007][Gillian 2011] and [LaViola Jr. 2013].

Artificial Neural Networks (ANN) are particularly useful for AEHCI as they are able to map m-dimensional input vectors to n-dimensional output vectors with a learned non-linear function, allowing them to control complex parameter-sets simultaneously. This is especially useful in regression tasks, when manipulating continuous parameters of a generative visual or sonic model. They can be equally successful in classification tasks, for recognising gestures and triggering desired visual or sonic outputs. In 1992 Michael Lee et al used ANNs inside the MAX/MSP musical programming environment to investigate adaptive user interfaces for realtime musical performance [Lee et al. 1992]. They were able to successfully recognise gestures from a number of devices including a radio baton and a continuous space controller. In 1993, Sidney Fels and Geoffrey Hinton used ANNs to map hand movements captured via a data-glove, to a speech synthesiser [Fels and Hinton 1993]. They achieved realtime results with a vocabulary of 203 gestures-to-words demonstrating the potential of neural networks for adaptive interfaces. Now, with many open-source implementations, and also integrated into Rebecca Fiebrink’s Wekinator and Nick Gillian’s GRTGui, ANNs are widely used for creative gestural interaction.

Many other machine learning techniques have been used for gesture recognition, with different specific use cases. These include K-Nearest Neighbour, Gaussian Mixture Models, Random Forests, Adaptive Naïve Bayes Classifiers and Support Vector Machines to classify static data; Dynamic Time Warping and Hidden Markov Models can be used to classify temporal gestures; Linear Regression, Logistic Regression and Multivariate Linear Regression can be used for real-valued outputs as opposed to classifying the input. A survey of machine learning techniques and applications for musical gesture recognition can be found in [Caramiaux and Tanaka 2013].

As mentioned previously, in an artistic, performative context, detecting subtle variations of gestures is vital to conveying expressivity. In [Bevilacqua and Muller 2005][Bevilacqua et al. 2009], Bevilacqua et al design continuous gesture followers that allow temporal gesture recognition in realtime while the gesture is still being performed. This algorithm returns time progression information and likelihood, enabling performers to alter speed and accuracy of the gesture to control parameters of their generative model.

In [Caramiaux et al. 2014][Caramiaux 2015] Caramiaux et al develop systems that go beyond classification of the gestures, to characterise the qualities of the gesture’s execution. They use computational adaptive models for identifying temporal, geometric and dynamic variations on the trained gesture. Returning this information in realtime to the performer as they are executing gestures, enables the performer to map the variations to parameters such as time-stretching samples, modulations, and volume or custom synth parameters.

In [Kiefer 2014] Kiefer investigates the use of Echo State (Recurrent Neural) Networks (ESN) as mapping tools, to learn sequences of input gestures, and non-linearly map them to multi-parameter output sequences. The research concludes that ESNs demonstrate good potential in pattern classification, multi-parametric control, explorative and nonlinear mapping, but there is room for improvement to produce more accurate results in some cases.

Interactive Machine Learning (IML)

As discussed above, ML is a very successful technique for pattern and gesture recognition. However using machine learning can be difficult because of the technical knowledge and time required in building classifiers and setting up the signal processing pipeline [Fails et al. 2003].

Interactive Machine Learning (IML) is a field which looks at the process of using machine learning, through the lens of human computer interaction research [Fiebrink 2011].

While ML brings huge advancements to the fields of data analysis and pattern recognition, IML seeks to improve how ML systems can be used. Particularly, expanding its user-base from dedicated computer scientists and closely related disciplines, to a much wider audience. One of the ways in which this is made possible is via a Graphic User Interface (GUI) front end to a ML backend, with data streamed live to and from the ML backend. The training and predictions can be performed in realtime, without writing any code making it a perfect choice for performance and AEHCI.

Rebecca Fiebrink et al’s previously mentioned Wekinator software released in 2009 is an example of such an IML system aimed at musical performance [Fiebrink et al. 2009]. Using a GUI, users are able to setup, train and modify parameters of an ANN. The software also allows other applications — such as existing music software, visual software, or other custom generative software — to stream data to the Wekinator using a UDP-based protocol commonly used in inter-app and inter-device communications called Open Sound Control (OSC) [Wright and Freed 1997]. As the Wekinator receives this data, it runs it through a machine learning model and streams back predictions in realtime. Wekinator also has a number of built-in sensor input and feature extraction capabilities such as edge detection from a webcam. Using this tool, artists, musicians, dancers, performers and researchers from other fields can train and map gestures to arbitrary outputs, such as notes, effects, images and sounds; with no programming or need for any other computer vision software.

Nick Gillian’s Gesture Recognition Toolkit (GRT) from 2011 [Gillian 2011] provides similar functionality but with more emphasis on the signal processing / gesture recognition pipeline. It lacks built-in input functionality such as webcam or microphone inputs, but has a number of built-in pre-processing, feature extraction and post-processing algorithms. Examples for these are Fast Fourier Transform, Principal Component Analysis, various filters, derivatives, dead zones and more. In addition to being an open-source application, the underlying codebase is released as a C++ framework allowing it to be integrated into bespoke applications.

Recently NVIDIA released a similar GUI based application — Deep Learning GPU Training System (DIGITS) — allowing researchers to use deep learning in a similar interactive fashion [NVIDIA 2015]. The software uses the popular open-source deep learning framework Caffe [Jia et al. 2014], and is designed to take full advantage of GPU acceleration, scaling up automatically in multi-GPU systems.

In 2015, during my research I required a similar Multi-Model IML system. One in which I could dynamically create and train new models while leaving existing models intact. I also needed to be able to access multiple models simultaneously, feeding each model different inputs and receiving the associated predictions. For this I developed msaOscML [Akten 2015a]. msaOscML is a Multi-Model Interactive Machine Learning tool. It is inspired by and similar to Rebecca Fiebrink’s Wekinator and Nick Gillian’s GRT. However, while those tools are aimed at a non-technical audience, and have a user- friendly Graphical User Interface (GUI), msaOscML is currently aimed at creative developers who would like to add machine learning capabilities to their creative software suite, with a focus on self-running installations and performances where they would like to minimise hands-on operation of the software. For this reason msaOscML has no GUI. It runs in the background as a server with only a console to indicate status and provide visual feedback to the user if desired. Similar to Wekinator and GRT, it can be interacted with (input and output) via the Open Sound Control (OSC) protocol (Wright & Freed, 1997). The main purpose of msaOscML, and difference over Wekinator and GRT, is that it can not only train and predict multiple independent models simultaneously, but also create, save and load multiple models directly via its OSC interface without requiring a human operator at any point. This allows a host application (e.g. software such as Max MSP, Ableton Live, or custom software generating sound or visuals) to be able to control multiple models directly, and the system can be unmanned throughout a long performance or installation. Cross platform and written in C++, msaOscML uses an abstraction layer for ML implementation, which I refer to as the Machine Learning Implementation Abstraction Layer (MLIAL). This allows different machine learning libraries to plug in as a backend with a minimal MLIAL wrapper. Currently there are MLIAL wrappers for Gillian’s GRT framework, and Steffen Nissen’s Fast Artificial Neural Network Library (FANN) (Nissen, 2003). msaOscML was written for and used on an R&D interactive dance project called Pattern Recognition (Akten, 2015c).

Tools like these enable both technical and non-technical users to quickly setup, train and test models for gesture recognition and gestural interaction. Without writing any code, users can start streaming input data from their sensors, and receive predictions in their application of choice, enabling them to gesturally create, manipulate and perform audio-visual content in realtime. An example of Fiebrink’s Wekinator can be seen in the band 000000Swan’s audio-visual shows gesturally driven using a Microsoft Kinect and commercially available sensor bow [Schedel et al. 2011]. In addition, it has also been applied in contexts such as workshops with people with learning and physical disabilities [Katan et al. 2015].

Conclusions

The field of algorithmically generating images and sound is a very rich and well-established field. Expressive interaction — particularly for music — is also well-established with new advanced techniques emerging as the field is maturing. Deep learning is going through an almost revolutionary revival with many recent developments. The gaming, entertainment and media industries are converging as next generation multi-modal interaction and virtual, augmented and mixed reality technologies are becoming mainstream.

Within this context, there are still many unexplored territories with a lot of artistic potential, especially at the intersections of these trends. These include ways of generating content using deep learning, particularly in realtime and interactively. Also applications of expressive interactions to the generation of the content, particularly with next generation consumer devices set in mixed reality environments.

Bibliography

AKTEN, M. 2015a. msaOscML. https://github.com/memo/msaOscML
AKTEN, M. 2015b. ‘Pattern Recognition’ Dance Performance. http://www.memo.tv/pattern-recognitionwip/
AKTEN, M., STEEL, B., MCNICHOLAS, R., ET AL. 2011. Sony PlayStation VideoStore Mapping. . 
AL-RIFAIE, M.M. AND BISHOP, J.M. 2015. Weak and Strong Computational Creativity. In: Computational Creativity Research: Towards Creative Machines. 0–14. 
ALVES, B. 2005. Digital Harmony of Sound and Light. Computer Music Journal 29, 4, 45–54. 
BENGIO, Y., COURVILLE, A., AND VINCENT, P. 2013. Representation Learning: A Review and New Perspectives. Tpami 1993, 1–30. 
BEVILACQUA, F. AND MULLER, R. 2005. A gesture follower for performing arts. Proceedings of the International Gesture …, 3–4. 
BEVILACQUA, F., ZAMBORLIN, B., SYPNIEWSKI, A., SCHNELL, N., GUÉDY, F., AND RASAMIMANANA, N. 2009. Continuous realtime gesture following and recognition. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 5934 LNAI, 73–84. 
BOULANGER-LEWANDOWSKI, N., VINCENT, P., AND BENGIO, Y. 2012. Modeling Temporal Dependencies in HighDimensional Sequences: Application to Polyphonic Music Generation and Transcription. Proceedings of the 29th International Conference on Machine Learning (ICML-12) Cd, 1159–1166. 
BROWNE, C. AND POWLEY, E. 2012. A survey of monte carlo tree search methods. Intelligence and AI 4, 1, 1–49. 
CADOZ, C. AND WANDERLEY, M. 2000. Gesture-music. Trends in gestural control of music, 71–94. 
CAMURRI, A., MAZZARINO, B., RICCHETTI, M., TIMMERS, R., AND VOLPE, G. 2004. Multimodal analysis of expressive gesture in music and dance performances. In: Gesture-based {C}ommunication in {H}uman-{C}omputer {I}nteraction, {LNAI} 2915. 20–39. 
CARAMIAUX, B. 2015. Motion Modeling for Expressive Interaction A Design Proposal using Bayesian Adaptive Systems. International Workshop on Movement and Computing (MOCO), IRCAM. 
CARAMIAUX, B., BEVILACQUA, F., AND SCHNELL, N. 2009. Towards a gesture-sound cross-modal analysis. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 5934 LNAI, 158–170. 
CARAMIAUX, B., DONNARUMMA, M., AND TANAKA, A. 2015. Understanding Gesture Expressivity through Muscle Sensing. ACM Transactions on Computer-Human Interaction 0, 0, 1–27. 
Realtime image & sound synthesis & expressive manipulation using DL & RL in responsive environments. Memo Akten, IGGI, Literature Review 30/09/2015 
21 
CARAMIAUX, B., MONTECCHIO, N., TANAKA, A., AND BEVILACQUA, F. 2014. Adaptive Gesture Recognition with Variation Estimation for Interactive Systems. ACM Transactions on Interactive Intelligent Systems (TiiS) (In Press) V, 212. 
CARAMIAUX, B. AND TANAKA, A. 2013. Machine Learning of Musical Gestures. Proceedings of the International Conference on New Interfaces for Musical Expression, 513–518. 
CASSELL, J. AND MCNEILL, D. 1991. Gesture and the Poetics of Prose. Poetics Today 12, 3, 375–404. 
CAVALLO, F., PEASE, A., GOW, J., AND COLTON, S. 2013. Using Theory Formation Techniques for the Invention of Fictional Concepts. 176–183. 
CIRESAN, D., MEIER, U., AND MASCI, J. 2011. Flexible, high performance convolutional neural networks for image classification. International Joint Conference on Artificial Intelligence, 1237–1242. 
COHEN, H. 1973. Parallel to perception: some notes on the problem of machine-generated art. Computer Studies, 1–10. 
COHEN, H. 1994. The Further Exploits of Aaron, Painter. . 
COHEN, H. 2006. AARON, Colorist: from Expert System to Expert. . 
COLLOBERT, R., WESTON, J., BOTTOU, L., KARLEN, M., KAVUKCUOGLU, K., AND KUKSA, P. 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research 1, 12, 2493–2537. 
COLTON, S., GOODWIN, J., AND VEALE, T. 2012. Full-FACE Poetry Generation. Proceedings of the Third International Conference on Computational Creativity (ICCC’12), 95–102. 
COLTON, S., PEASE, A., AND CHARNLEY, J. 2011. Computational creativity theory: The FACE and IDEA descriptive models. Proceedings of the Second International Conference on Computational Creativity, 90–95. 
COLTON, S. AND WIGGINS, G. A. 2012. Computational creativity: The final frontier? Frontiers in Artificial Intelligence and Applications 242, 21–26. 
COOK, M., COLTON, S., AND GOW, J. 2014. Automating Game Design In Three Dimensions. AISB Symposium on AI and Games, 3–6. 
COPE, D.H. 2010. Recombinant music composition algorithm and method of using the same. . 
COUPRIE, C., NAJMAN, L., AND LECUN, Y. 2013. Learning Hierarchical Features for Scene Labeling. Pattern Analysis and Machine Intelligence, IEEE Transactions on 35, 8, 1915–1929. 
COWIE, R., DOUGLAS-COWIE, E., TSAPATSOULIS, N., ET AL. 2001. Emotion recognition in human-computer interaction. Signal Processing Magazine, IEEE 18, 1, 32–80. 
Realtime image & sound synthesis & expressive manipulation using DL & RL in responsive environments. Memo Akten, IGGI, Literature Review 30/09/2015 
22 
CRUZ-NEIRA, C., SANDIN, D., AND DEFANTI, T. 1993. Surround-screen projection-based virtual reality: the design and implementation of the CAVE. … of the 20Th Annual Conference on …, 135–142. 
DENG, J.D.J., DONG, W.D.W., SOCHER, R., LI, L.-J.L.L.-J., LI, K.L.K., AND FEI-FEI, L.F.-F.L. 2009. ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2–9. 
DENG, L., HINTON, G., AND KINGSBURY, B. 2013. New Types of Deep Neural Network Learning for Speech Recognition and Related Applications : an Overview. 8599–8603. 
DOSOVITSKIY, A. AND BROX, T. 2015. Inverting Convolutional Networks with Convolutional Networks. 1–15. 
DOURISH, P. 2001. Where the Action Is: The Foundations of Embodied Interaction. Where the action is the foundations of embodied interaction 36, 233. http://books.google.com/books?id=DCIy2zxrCqcC&pgis=1
DRAVES, S. 2005. The Electric Sheep screen-saver: A case study in aesthetic evolution. Proc. EvoMUSART, 458–467. 
ECK, D. AND SCHMIDHUBER, J. 2002. A first look at music composition using lstm recurrent neural networks. Istituto Dalle Molle Di Studi Sull Intelligenza …. 
ERHAN, D., BENGIO, Y., COURVILLE, A., AND VINCENT, P. 2009. Visualizing higher-layer features of a deep network. Bernoulli 1341, 1–13. 
FAILS, J.A., OLSEN, JR., D.R., AND OLSEN, D.R. 2003. Interactive Machine Learning. Proceedings of the 8th International Conference on Intelligent User Interfaces, ACM, 39–45. 
FELS, S.S. AND HINTON, G.E. 1993. Glove-talk: a neural network interface between a data-glove and a speech synthesizer. IEEE Transactions on Neural Networks 4, 1, 2–8. 
FIEBRINK, R., TRUEMAN, D., AND COOK, P.R. 2009. A metainstrument for interactive, on-the-fly machine learning. Proc. NIME 2, 3. 
FIEBRINK, R.A. 2011. Real-time Human Interaction with Supervised Learning Algorithms for Music Composition and Performance. Imagine January, 376. 
GATYS, L. A., ECKER, A.S., AND BETHGE, M. 2015. A Neural Algorithm of Artistic Style. 3–7. 
GILLIAN, N.E. 2011. Gesture Recognition for Musician Computer Interaction. Social Sciences March. 
GOOGLE ADVANCED TECHNOLOGY AND PROJECTS. 2014. Project Tango. https://www.google.com/atap/projecttango/
GOOGLE ADVANCED TECHNOLOGY AND PROJECTS. 2015. Project Soli. https://www.google.com/atap/projectsoli/
Realtime image & sound synthesis & expressive manipulation using DL & RL in responsive environments. Memo Akten, IGGI, Literature Review 30/09/2015 
23 
GRIERSON, M. 2005. Audiovisual composition. http://www.strangeloop.co.uk/Dr. M.Grierson — Audiovisual Composition Thesis.pdf. 
GUO, X., SINGH, S., LEE, H., LEWIS, R., AND WANG, X. 2014. Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning. Advances in Neural Information Processing Systems (NIPS) 27 2600, 3338–3346. 
GUZELLA, T.S. AND CAMINHAS, W.M. 2009. A review of machine learning approaches to Spam filtering. Expert Systems with Applications 36, 7, 10206–10222. 
HINTON, G., DENG, L., YU, D., ET AL. 2012. Deep Neural Networks for Acoustic Modeling in Speech Recognition. Ieee Signal Processing Magazine November, 82–97. 
HOCHREITER, S. AND SCHMIDHUBER, J. 1997. Long short-term memory. Neural computation 9, 8, 1735–80. 
ILYA SUTSKEVER, ORIOL VINYALS, Q.V. LE. 2014. Sequence to Sequence Learning with Neural Networks. Nips, 1–9. 
JIA, Y., SHELHAMER, E., DONAHUE, J., ET AL. 2014. Caffe : Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093. 
JONES, B., SHAPIRA, L., SODHI, R., ET AL. 2014. RoomAlive: Magical Experiences Enabled by Scalable, Adaptive Projector-camera Units. Proceedings of the 27th annual ACM symposium on User interface software and technology — UIST ’14, 637–644. 
JONES, B.R., BENKO, H., OFEK, E., AND WILSON, A.D. 2013. IllumiRoom: peripheral projected illusions for interactive experiences. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems — CHI ’13, 869. 
KARPATHY, A. 2015a. char-rnn. https://github.com/karpathy/char-rnn
KARPATHY, A. 2015b. The Unreasonable Effectiveness of Recurrent Neural Networks. . 
KARPATHY, ANDREJ, L.F.-F. 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions. Cvpr. 
KATAN, S., GRIERSON, M., AND FIEBRINK, R. 2015. Using Interactive Machine Learning to Support Interface Development Through Workshops with Disabled People. CHI ’15 Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. 
KIEFER, C. 2014. Musical Instrument Mapping Design with Echo State Networks. Proceedings of the International Conference on New Interfaces for Musical Expression, 293–298. 
KOHLER, E., KEYSERS, C., UMILTÀ, M.A., FOGASSI, L., GALLESE, V., AND RIZZOLATTI, G. 2002. Hearing sounds, understanding actions: action representation in mirror neurons. Science (New York, N.Y.) 297, 5582, 846–848. 
Realtime image & sound synthesis & expressive manipulation using DL & RL in responsive environments. Memo Akten, IGGI, Literature Review 30/09/2015 
24 
KRIZHEVSKY, A., SUTSKEVER, I., AND HINTON, G.E. 2012. ImageNet Classification with Deep Convolutional Neural Networks. Advances In Neural Information Processing Systems, 1–9. 
KRUEGER, M.W., GIONFRIDDO, T., AND HINRICHSEN, K. 1985. VIDEOPLACE — -an artificial reality. ACM SIGCHI Bulletin 16, 4, 35–40. 
KYPRIANIDIS, J.E., COLLOMOSSE, J., WANG, T., AND ISENBERG, T. 2013. State of the ’Art: A taxonomy of artistic stylization techniques for images and video. IEEE Transactions on Visualization and Computer Graphics 19, 5, 866–885. 
LAVIOLA JR., J.J. 2013. 3D Gestural Interaction: The State of the Field. ISRN Artificial Intelligence 2013 2013, 2, 1–18. 
LE, Q. V., RANZATO, M.A., MONGA, R., ET AL. 2011. Building high-level features using large scale unsupervised learning. International Conference in Machine Learning, 38115. 
LEAPMOTION. LeapMotion. https://www.leapmotion.com/
LECUN, Y. 2012. Learning invariant feature hierarchies. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7583 LNCS, PART 1, 496–505. 
LECUN, Y. 2014. The Unreasonable Effectiveness of Deep Learning. Facebook AI Research & Center for Data Science, NYU. 
LECUN, Y. AND BENGIO, Y. 1995. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361, 255–258. 
LECUN, Y., BOSER, B., DENKER, J.S., ET AL. 1989. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation 1, 541–551. 
LECUN, Y., BOTTOU, L., BENGIO, Y., AND HAFFNER, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11, 2278–2323. 
LECUN, Y., CORTES, C., AND BURGES, C.J.C. The MNIST Database. http://yann.lecun.com/exdb/mnist/index.html
LECUN, Y., HUANG, F.J.H.F.J., AND BOTTOU, L. 2004. Learning methods for generic object recognition with invariance to pose and lighting. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004. 2. 
LECUN, Y., JACKEL, L., BOTTOU, L., ET AL. 1995. Comparison of learning algorithms for handwritten digit recognition. International Conference on artificial neural networks, 53–60. 
LEE, M., FREED, A., AND WESSEL, D. 1992. Neural networks for simultaneous classification and parameter estimation in musical instrument control. Proceedings of SPIE 1706, 244–255. 
Realtime image & sound synthesis & expressive manipulation using DL & RL in responsive environments. Memo Akten, IGGI, Literature Review 30/09/2015 
25 
LEMAN, M. 2007. Embodied Music Cognition and Mediation Technology. . 
LEVIN, G. 2000. Painterly Interfaces for Audiovisual Performance. Media, 1–151. 
MAHENDRAN, A. AND VEDALDI, A. 2014. Understanding Deep Image Representations by Inverting Them. . 
MCCORMACK, J. AND D’INVERNO, M. 2012. Computers and Creativity: The Road Ahead. Computers and Creativity, 421–424. 
MCCORMACK, J. AND D’INVERNO, M. 2014. On the Future of Computers and Creativity. . 
MCNEILL, D. AND LEVY, E. 1980. Conceptual representations in language activity and gesture. . 
METZINGER, T. AND GALLESE, V. 2003. The emergence of a shared action ontology: Building blocks for a theory. Consciousness and Cognition, 549–571. 
MISTRY, P. AND MAES, P. 2009. SixthSense: a wearable gestural interface. ACM SIGGRAPH ASIA 2009 Sketches, ACM. 
MITAL, P.K., GRIERSON, M., AND SMITH, T.J. 2013. Corpus-based visual synthesis. Proceedings of the ACM Symposium on Applied Perception — SAP ’13 July, 51–58. 
MITCHELL, T.M. 1997. Machine Learning. McGraw Hill. 
MITRA, S. AND ACHARYA, T. 2007. Gesture Recognition : A Survey. IEEE Transactions On Systems, Man, And Cybernetics — Part C: Applications And Reviews 37, 3, 311–324. 
MNIH, V., HEESS, N., GRAVES, A., AND KAVUKCUOGLU, K. 2014. Recurrent Models of Visual Attention. Nips, 1– 12. 
MNIH, V., KAVUKCUOGLU, K., SILVER, D., ET AL. 2013. Playing Atari with Deep Reinforcement Learning. arXiv preprint arXiv: …, 1–9. 
MORDVINTSEV, A., OLAH, C., AND TYKA, M. 2015. Deepdream inceptionism. http://googleresearch.blogspot.ch/2015/06/inceptionism-going-deeper-into-neural.html
NG, A. 2013. Machine Learning and AI via Brain Simulations. Stanford University. 
NGUYEN, A, YOSINSKI, J., AND CLUNE, J. 2015. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. Cvpr 2015. 
NISSEN, S. 2003. Fast Artificial Neural Network Library. http://leenissen.dk/fann/wp/
NVIDIA. 2015. Deep Learning GPU Training System (DIGITS). https://developer.nvidia.com/digits/
RAINA, R., MADHAVAN, A., AND NG, A.Y. 2009. Large-scale deep unsupervised learning using graphics processors. Icml 9, 873–880. 
Realtime image & sound synthesis & expressive manipulation using DL & RL in responsive environments. Memo Akten, IGGI, Literature Review 30/09/2015 
26 
ROKEBY, D. 1986. Very Nervous System. . 
SCHEDEL, M., FIEBRINK, R., AND PERRY, P. 2011. Wekinating 000000Swan : Using Machine Learning to Create and Control Complex Artistic Systems. Proceedings of the International Conference on New Interfaces for Musical Expression June, 453–456. 
SCHMIDHUBER, J. 2014. Deep Learning in Neural Networks: An Overview. arXiv preprint arXiv:1404.7828, 1–66. 
SEARLE, J.R. 1980. Minds, Brains, and Programs. Behavioral and Brain Sciences 3, 1–19. 
SIMONYAN, K., VEDALDI, A., AND ZISSERMAN, A. 2013. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. arXiv preprint arXiv:1312.6034, 1–8. 
SIMS, K. 1994. Evolving virtual creatures. Siggraph ’94 SIGGRAPH ’, July, 15–22. 
STURM, B. 2015. Recurrent Neural Networks for Folk Music Generation. https://highnoongmt.wordpress.com/2015/05/22/lisls-stis-recurrent-neural-networks-for-folkmusic-generation
SZEGEDY, C., LIU, W., JIA, Y., ET AL. 2014. Going Deeper with Convolutions. arXiv preprint arXiv:1409.4842, 1–12. 
THRUN, S., MONTEMERLO, M., DAHLKAMP, H., ET AL. 2006. Stanley: The Robot That Won the DARPA Grand Challenge. Journal of Field Robotics 23, 9, 661–692. 
TODD, S. AND LATHAM, W. 1992. Evolutionary art and computers. Academic Press, Inc. 
TURING, A. 1948. Intelligent Machinery. . 
TURING, A. 1950. Computing Machinery and Intelligence. Mind 59, 433–460. 
WRIGHT, M. AND FREED, A. 1997. Open Sound Control: A new protocol for communicating with sound synthesizers. Proceedings of the International Computer Music Conference (ICMC). 
ZENG, Z., PANTIC, M., ROISMAN, G.I., AND HUANG, T.S. 2009. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 1, 39–58.