The technology behind the controversial AI-Generated Art

Warwick AI
Warwick Artificial Intelligence
10 min readOct 16, 2023

An article by Derya Çatalyürek.

When you think of art, what is the image that is created in your head? The artist painting a picture in their own style, or an Artificial Intelligence (AI) that has learned all known painting styles — using them to generate unique breath-taking paintings? The creativity associated with art makes AI generated images difficult to grasp however the practice is quickly making its way into modern media [1]. In this article we will explore the most popular tools and how they use AI to produce never-before seen art pieces — some being hard to discern between human or artificially made.

An introduction to AI-Generated Art

AI-generated art refers to art that is created using a variety of Artificial Intelligence algorithms — ranging from simple techniques that use pre-designed templates or styles, to more advanced techniques that use Machine Learning (ML) to generate original artwork. There are diverse perspectives and viewpoints on AI-generated art; some people find it to be very attractive and innovative due to its unique and abstract style, while others believe that AI-generated art is lacking the human touch or emotional depth that is present in art created by human artists.

OpenAI and DALL-E

OpenAI is the company responsible for many products at the forefront of AI development, notably ChatGPT which takes large amounts of real-world data for the training of a powerful unsupervised text generation model. In January 2021 OpenAI introduced DALL-E [2], a system that can generate images from user-given text prompts. The name of DALL-E originates from the famous surrealist artist Salvador Dali combined with the Pixar Animation robot: Wall-E. DALL-E produces a range of outputs from whatever the user requests — from original, photo-realistic images to art imitating painting styles. Alongside its more obvious uses, DALL-E has provided a key resource for AI researchers, helping humans understand how advanced AI systems see and understand our world.

DALL-E can take what it has learned from a variety of other labelled images and then apply it to its work — for example, given a picture of a dolphin it can infer what a dolphin would look like in different situations or scenarios (Figure 1). However, if we input a word or situation it hasn’t learned yet, DALL-E is only able to give the best idea of what that word could mean. DALL-E allows you to further customise images by adding more prompts post-generation — allowing you to modify shadows, reflections, and textures.

The implementations of DALL-E range far and wide — not only being used to create art but also in commercial careers such as marketing and advertising. The use of DALL-E not only allows for the creation of images that aren’t currently available but also produces a final product significantly quicker than its human counterpart. DALL-E could also be utilised in the medical field to create detailed images of organs and other internal structures for use in research and education.

Figure 1 — DALL-E Output when given the prompt “A dolphin reading a book underwater”.

DALL-E possesses a chilling ability to impersonate and create artworks in the distinctive styles of famous painters. Replicating popular styles in scenarios and environments in which the painter has never worked before. DALL-E accomplishes this through leveraging its deep learning and pattern recognition abilities to analyse the styles and nuances of renowned artists’ techniques, brushstrokes, and colour palettes -enabling the model to produce paintings eerily similar to the artist’s own creations. An example of this ability is given in Figure 2; one is an actual painting of the famous painter Monet and the other one is DALL-E when given a prompt to impersonate the style of Monet. Can you guess which one is the original Monet?

Figure 2 left — The Monet Family in Their Garden at Argenteuil Spring- Claude Monet (1875), right — DALL-E output when given the prompt “An oil painting of people in a garden by Monet”

This capability to analyse and reproduce paintings can be viewed as both exciting and somewhat alarming. DALL-E not only explores the immense potential of AI in the art realm but also invites us to stretch the boundaries between human and machine creativity — ushering in a new era of art. The distinction that some individuals can find between art crafted by humans and AI is poised to diminish progressively as the technology improves — until perhaps one day all AI-generated art is indistinguishable from human creations.

DALL-E was created by training a Neural Network on images and their text descriptions [3]. Neural Networks are like a human brain for computers, instead of our brain’s neurons, the computer’s brain is made up of interconnected layers of nodes that each have an associated weight and transmit data to make decisions. Deep Learning uses Neural Networks with many hidden layers to solve various complicated problems. Through Deep Learning it gains the knowledge of what objects are and the relationships between them. A method called “diffusion modelling” or “stable diffusion” allows the AI to turn this knowledge of objects and the prompt into an image for the user [4]. Stable Diffusion starts with randomly placed dots and progressively transforms them using these key elements until it reaches its target — a finished image that should resemble the user’s original prompt. Using OpenAI’s text classification algorithms, it splits the prompt into its key elements which can be used as a Stable Diffusion prompt.

DALL-E is an example of how humans and intelligent systems can work together towards creative applications of AI that empower our ideas & potential. Although the possibilities of this technology are endless, we cannot yet foresee the consequences of such powerful technology in the hands of malicious people. For this reason, the creators of DALL-E have limited its ability to create violent, hateful or adult content and also state that they are working to prevent the AI from producing photorealistic versions of faces, including those of famous people. The hope of OpenAI is to maintain DALL-E as a safe technology whilst not restricting its original use as a tool for unrestricted creativity.

How can an AI create images?

Generative Adversarial Networks

Generative Adversarial Networks (GANs) are the first of the two standard methods used to generate images [5] — consisting of an algorithmic architecture and two sub-models or neural networks.

The first model in a GAN, known as the generator, creates an image based on the text input by the user [6]. The second model (the discriminator) is used to train the first model, ensuring that the generated images are similar in style and content to similarly themed images within its dataset. This second model aims to decide whether or not the image it is fed is real or fake — i.e. is it generated by the first model or not. Using this tool, the first model can create images to be fed to the discriminator. At first, all images will be flagged as fake and the generator is informed that the image was not realistic. This feedback is given to the first model to use for future generations and expand the AI’s knowledge, producing images that are more realistic and closer to the dataset materials over time. As the first model improves, the second general network will find it difficult to differentiate between the real and the fake. When the first model is ‘realistic’ or trained enough the second model is discarded as it is no longer needed.

An example of such a model is ThisPersonDoesNotExist [7]- a website that generates a random face of someone who ‘does not exist’ by learning facial structures from a large dataset of existing people [8].

Diffusion Models

As AIs are often very difficult to train, networks with a small dataset often produce the same images over and over. Having a network that produces the same images each time is not very interesting or beneficial for the user, therefore randomness is required to make sure each experience is unique.

So the new problem: how do you generate a dataset large enough to train a network? Diffusion models aim to solve this issue by taking an entirely different approach, by employing noise [9]. Adding noise is the process of deliberately altering pixels to be different than what they previously represented. Adding randomness and noise helps to ensure that each generated image is distinct. In diffusion modelling, noise is added to an image to remove key identifiable data [10]. The model will then attempt to recover the data by independently reversing the noise adding process; this is known as ‘denoising’. At the start of the process, a small amount of noise is added, making it easy for the model to ‘denoise’. Over time more noise is added until eventually the model will be able to recover the original image from an image containing only random pixels.

There are different strategies on the amount of noise we apply to the original image, such as: progressively increasing the noise (linear), or starting with small noise and then dumping loads at the very end. To map the amount of noise to each step, we use a schedule: given an image at time t=0, each time unit until time t = T can be assigned an amount of noise according to the chosen strategy. We can then very easily produce the image wanted with the right amount of noise at each stage t in the schedule. .

To visualise the process of noise adding see Figure 3, where Gaussian noise is employed to add randomness to an image. This method uses a mathematical function to add random values to the input data, in this case the colour values of each pixel [11]. On the left of the figure, Gaussian noise is added to an image to add variety — the first step in diffusion modelling. On the right we can see the process of ‘denoising’ as the model removes noise from a random image until it resembles the target image. It is important to understand how noise affects images in order to develop effective noise reduction techniques.

Figure 3 — Gaussian noise being added to an image of a classroom environment (left) and noise being removed from random pixels to depict the same image (right).

The aim now is to deploy a network that can undo this process. Intuitively, it is easier to remove the noise step by step. Taking the image with noise, we predict the amount of noise added by comparing to an estimate of what the original image was. From there we can remove as much noise as needed to get to the previous step. This process is iterated a number of times, getting closer and closer to the original image.

To generate images based on text input such as DALL-E, we condition this network by also giving it access to the text — taking text and embedding it using the GPT-3 style Transformer Embedding [12]. This technology takes some inputted text and uses decoders to understand each word’s meaning. From here it can ‘transform’ the text as required — in this case into an image. It can use this prediction image instead of a target image to add and remove noise in the training stage.

In summary: we start with a clean target image and add noise to it. Noise is progressively added until the original image cannot be detected. The noise is then gradually removed from this seemingly random image until we end up back at what the network thinks the target image was. This often does not perfectly match the target image, having similar features and themes but giving the AI-image generation a sense of randomness that other methods do not possess.

If we take a look at Figure 4 — showing DALL-E when given the input “a bowl of fruits by the window” — we can observe how each generated output shares a common theme but possesses their unique characteristics. This effectively demonstrates the randomness involved in Generative AI, as the outputs exhibit different lighting conditions influenced by factors such as the position of the sun or the chosen perspective for capturing the image.

Figure 4 — DALL-E output when given prompt: “a bowl of fruits by a window”. Displaying four different images that represent the same prompt.

Conclusion

As the technology grows, the quality of AI-generated art will depend more upon the specific algorithms and techniques used to create it than the dataset it is trained upon.

As with any art form, some AI-generated art may be more successful in capturing the viewer’s attention and evoking an emotional response than others; nevertheless the fundamental element of art is the feelings it blossoms in a person. These feelings are different for everyone, no one person can feel the same emotions looking at a painting as another, therefore artists do not have a specific target audience.

Regardless of one’s personal opinion on AI-generated art, it is important to recognize that it represents a rapidly developing field that has the potential to revolutionise the world of the creative arts. The decision to embrace and harness the use of these technological advancements for personal growth and education is a matter of individual choice as we navigate their integration into our lives.

References

[1] McFadden, C. (2023). The rise of AI art: What is it, and is it really art? [online] Interesting Engineering. Available at: https://interestingengineering.com/culture/what-is-ai-generated-art.

[2] OpenAI (2022). DALL-E 2. [online] Available at: https://openai.com/dall-e-2.

[3] Ryan O-Conner (2023). How DALL-E 2 Actually Works [online] AssemblyAI. Available at: https://www.assemblyai.com/blog/how-dall-e-2-actually-works/.

[4] Kevin Pocock (2023). Does Dall-E Use Stable Diffusion? [online] PC Guide. Available at: https://www.pcguide.com/apps/does-dall-e-2-use-stable-diffusion/.

[5] Kundu, R. (2022). AI-Generated Art: From Text to Images & Beyond [Examples]. [online] v7labs. Available at: https://www.v7labs.com/blog/ai-generated-art.

[6] Jason Browniee (2019) A Gentle Introduction to Genarative Adversarial Networks (GANs) [online] Machine Learning Mastery. Available at: https://machinelearningmastery.com/what-are-generative-adversarial-networks-gans/

[7] Phillip Wang (2019). This Person Does Not Exist. [online] Available at: https://thispersondoesnotexist.com/

[8] Danny Paez (2019). ‘This Person Does Not Exist’ Creator Reveals His Site’s Creepy Origin Story. [online] Inverse. Available at: https://www.inverse.com/article/53414-this-person-does-not-exist-creator-interview

[9] Sabre PC (2023) GANs vs Diffusion Models — Generative AI Comparison. [online] Available at: https://www.sabrepc.com/blog/Deep-Learning-and-AI/gans-vs-diffusion-models

[10] Vivek Muppalla, Sean Hendryx (2022) Diffusion Models: A Practical Guide. [online] Scale. Available at: https://scale.com/guides/diffusion-models-guide

[11] AI TutorMaster (2023) What is Gaussian Noise in Deep Learning? How and Why is it used? [online] PlainEnglish. Available at: https://plainenglish.io/blog/what-is-gaussian-noise-in-deep-learning-how-and-why-it-is-used

[12] Baeldung CS (2023) Transformer Text Embeddings. [online] Available at: https://www.baeldung.com/cs/transformer-text-embeddings

--

--

Warwick AI
Warwick Artificial Intelligence

Society run blog on all things artificial intelligence - written and edited by a team of researchers from Warwick AI at the University of Warwick.