AIArt: A Neural Network Perspective into artstyle

Hung Giang
Bucknell AI & CogSci
10 min readNov 7, 2019

Since the turn of the 19th century, artistic style has been considered as the defining characteristic of an artist. When people think of Van Gogh, they think of long, thick brush strokes and massive swirls; when they think of Monet, they think of hazy purple sunrises; and so on. This style is often considered to be qualitatively different from the content they are producing, and it is also often considered to be a unique and irreproducible aspect of their art. With the advent of computer vision, however, this is not necessarily the case. Researchers in the field have been using neural networks in order to not only quantify style, but also produce a scheme which is able to recreate the style of a painting without any of the content of the painting being carried over in the recreation (think, for example, of a program which paints the ideal Van Gogh style from Starry Night onto a canvas without any of the houses or hills).

This research brings with it a collection of ethical and philosophical implications which can be difficult to resolve. If a neural network is able to imitate style, what is preventing an artist from recreating this recreation on a canvas and committing artistic fraud? Furthermore, if a CNN can know a quantified sense of style, then can a properly developed AI system be called an artist if it develops an “original” style? These are the questions which we will be examining in this paper.

1. Neural Networks:

A Neural networks is a collection of connected and tunable units (a.k.a. nodes, neurons, and artificial neurons) which can pass a signal (usually a real-valued number) from a unit to another. The number of (layers of) units, their types, and the way they are connected to each other is called the network architecture. Neural Network as a class of machine learning has recently gain a lot of attention due to the availability of Big Data and fast computing facilities.

Fig 1. Different types of Neural network

Neural Network is a general term of the class of study, which consist of different type of network architectures (picture). Each unit (nodes or neurons) in the architecture has a weight value, loosely patterned after the neurons of a human’s brain. A node combines input from the data with its weights, that either amplify or dampen that input, thereby assigning significance to inputs with regard to the task the algorithm is trying to perform; this determine which input is more helpful in the process of achieving the goal of the action. The combined input is then “activated” through an activation function, and the result is sent to nodes in the next layers of neuron, where the same thing happens until the last layer.

A Convoluted Neural Network is a type of Neural Network with architectural constraints to reduce computational complexity and ensure translational in-variance. This means that the algorithm will interpret input pattern the same regardless of translation — where the input is. If a general neural network is, loosely speaking, inspired by a human brain, the convolutional neural network is inspired by the visual cortex system. CNNs have several different architectural features from other types of neural network:

- Local Connectivity: Neurons in one layer are only connected to neurons in the next layer that are spatially close to them. This design trims the vast majority of connections between consecutive layers, but keeps the ones that carry the most useful information. In the context of image processing, the relationship between two distant pixels is much less important than two close pixels.

- Shared Weights: there exist one or more layer of convolutional neurons. These convolutional neurons shared the same weight. This is equivalent to placing a filter in the image, differentiating between distinct features.

- Pooling Layers: Pooling layers consider a block of input data and simply pass on the maximum value. Doing this reduces the size of the output and requires no added parameters to learn, so pooling layers are often used to regulate the size of the network and keep the system below a computational limit.

These additional features makes convoluted neural network a best-fit for image and video processing programs. For example, images of size MxN are stored on a computer as a discrete set of M*N*3 values, each pixel (M, N) holding a triplet of integer values between 0 and 255, representing the red, green, and blue color channels. Viewed in this way, images are data, and data can be fed into a neural network, producing outputs for each layer and ultimately a network output which is commonly used for regression and classification tasks. But images are different from data in general in a key way: the values a pixel takes on are not independent, but rather exhibit a spatial dependency (e.g. pixels in a picture of an apple are red because its neighbors are red, and together the pixels form a continuous image). Standard feed-forward networks cannot account for this spatial dependency due to the fact that its inputs are 1-dimensional vectors; convolutional neural networks were developed to solve this issue by passing in data as a matrix and computing each position in the activation matrix not as a function of all inputs, but as a function of data within a neighborhood of that position.

For our project, we use a pre-trained Neural Network model calls VGG — 19. VGG is a convolutional neural network model proposed by K. Simonyan and A. Zisserman from the University of Oxford. The model achieves 92.7% top-5 test accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes.

2. Algorithm:

Our algorithm follow closely the algorithm sketched out in the paper “Image Style Transfer Using Convolutional Neural Networks” (Gatys et al, 2016 IEEE Conference on Computer Vision and Pattern Recognition).

When VGG are trained on object recognition, they develop a representation of the image that makes object information increasingly explicit along the processing hierarchy. Therefore, along the processing hierarchy of the network, the input image is transformed into representations that increasingly care about the actual content of the image compared to its detailed pixel values. We can directly visualize the information each layer contains about the input image by reconstructing the image only from the feature maps in that. Higher layers in the network capture the high-level content in terms of objects and their arrangement in the input image but do not constrain the exact pixel values of the reconstruction. In contrast, reconstructions from the lower layers simply reproduce the exact pixel values of the original image (Fig 2, content reconstructions a,b,c). We therefore refer to the feature responses in higher layers of the network as the content representation.

Fig 2. Content and style of an image

To obtain a representation of the style of an input image, we use a feature space originally designed to capture texture information. This feature space is built on top of the filter responses in each layer of the network. By including the feature correlations of multiple layers, we obtain a stationary, multi-scale representation of the input image, which captures its texture information but not the global arrangement.

Fig 3. Loss function and Gradient descent

We use VGG to train a “hybrid” image, first initialize to be image of random noise, to produce a final image with minimum content and style loss function. The content loss function as well as style loss function can be simply understood as error from original content and style matrix respectively. The training goal for our algorithm is to reduce the mean squared total error of the two loss function, meaning converging into an image with content similar to that of a source image and style resembling that of another.

3. Result:

Fig 4. Content Image
Fig 5. Style Image
Fig 6. Result after 50 iteration

We use VGG to train a “hybrid” image, first initialize to be an image of random noise, to produce a final image with minimum content and style loss function. The content loss function, as well as style loss function, can be simply understood as an error from the original content and style matrix respectively. The training goal for our algorithm is to reduce the mean squared total error of the two loss functions, meaning converging into an image with content similar to that of a source image and style resembling that of another.

After minimizing the loss function a few times, we begin to see the style form on the “hybrid” image or the blank canvas. As it runs more iterations, the canvas gets more and more filled with the content from the content image with the style of the style image. Earlier iterations are images mostly filled with random noise but we begin to see the image form and start to recognize the picture. After enough iterations, we can see the blank canvas filled with the content of the content image in the style of the style image.

4. Ethical Concerns:

a. Social Implication:

Deepfake is one result of image processing and machine learning. It makes it easy to combine a face with a video. It will replace the face from the source video with the one given through images. The power of Deepfake recently have raised worries from user of different discussion forum (Reddit, 4chan, Stack to name a few).

Fig 7. Nvdia Face Generation Technology

It is because of the ease of use of deep fake, while at the same time being a very robust and versatile tool, that trouble arises. Anyone, from wherever, with whatever agenda, possessing any level of knowledge, is capable of using deep fake. Much like it is not the knife fault when someone is injured, deep fake only acts as an enabler for bad deeds. So what harm exactly does it enables?

Firstly, we consider where it all started. User u/deepfake from Reddit use the image of popular celebrities in his video, of course without their consent, without their knowledge to create porn video of celebrities. These porn videos are damaging to the celebrities in question’s reputation, and may as well be regarded as defamation. In the now ban r/Deepfakes subreddit, where these types of videos usually ended up, I have seen comments on using this to satisfy their personal fantasies, with celebrities, with high-school sweetheart, with ex-girlfriend… These products can be used as blackmail tool or sabotage tool against the victim. Even when the creator does not have any intention of doing harm, merely discovering the disturbing videos might deal psychological damage on the victims, and stir up their every day’s lives.

The harm does not stop at the level of individual. What if the target of these fake products are political figure, or well-known people? Even if the video are not defaming, it definitely still carries fabricated information. A deep fake video has the potential to easily spread misleading information, causing confusion and chaos in viewers — especially in today’s society, where information can be broadcast wide and fast, and not many have the time or care enough to fact check everything on the internet. People will be able to fine sophisticated way to manipulate the mass, causing serious damage to public safety and security.

b. Human implication:

One can argue that these tool is born from our creativity, and most of them actually promote further exploration. It is an agreeable statement in the context of research. What about the normal mass? Consider the Semantic Image Synthesis technology, developed by researcher at Nvidia based on GauGAN, a type of CNN:

With just a few strokes of the brush, everyone has the skill of an artist. People without innate talent for drawing will be able to produce amazing paintings, freely expressed their ideas. The question whether this is an enabling tool, or instead a tool that discourage training and creativity is a question garnering different contrasting responses.

c. Artificial Intelligence implication:

The last implication can be stated as a set of philosophy questions: What is originality? At what point can AI be considered capable of creating their own art? AI right now is capable of generating their own drawing. In October 2018, an art piece generated by GAN, is offered in Christie auction house, sold for $432500.

Fig 8. The first AI-generated painting to be put on auctioned.

The painting is one of a group of portraits of the fictional Belamy family created by Obvious, a Paris-based collective consisting of Hugo Caselles-Dupré, Pierre Fautrel and Gauthier Vernier. In the 1850s, when the camera was first introduced to the public, many criticisms were made, all focus around the point that the discovery will eliminate art itself. Now, we have long passed replicating the process, but are trying to replicate creativity itself. In the future, will AI be treated as a tool for artist to further their talent, or will there be a point where the technology become sophisticated enough to consider AI artists themselves? Will we be able to realize such sophistication? Who will have rights over the art created by these AIs?

5. Conclusion

From the initial question of what is art and how we can understand the style of an art piece, we were able to successfully transfer the style of an image onto another. We also ponder many ethical concerns with image processing using machine learning in general.

This project is still in its infancy. Further research can be done to improve the quality of the image, or apply style transfer to not only images but video as well.

References:

  • Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” ArXiv.org, 10 Apr. 2015, arxiv.org/abs/1409.1556
  • Gatys et al, “Image Style Transfer Using Convolutional Neural Networks”, 2016 IEEE Conference on Computer Vision and Pattern Recognition.
  • D. Güera and E. J. Delp, “Deepfake Video Detection Using Recurrent Neural Networks,” 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 2018, pp. 1–6.
  • Park, Taesung, et al. “GauGAN: Semantic Image Synthesis with Spatially Adaptive Normalization.” ACM Digital Library, ACM, 28 July 2019, dl.acm.org/citation.cfm?id=3332370
  • Tero Karras, Samuli Laine, Timo Aila et al, “A Style-Based Generator Architecture for Generative Adversarial Networks”, 12 Dec 2018, arxiv.org/abs/1812.04948

--

--