Why AI has made me the best artist I’ll ever be

Amelia Woodward
Amelia’s blog
Published in
5 min readJan 23, 2023

DALL-E 2 has wowed the internet by allowing anyone to generate truly lifelike or artistic artificial images from text input. This is a leading example of recent breakthoughs in AI that produce new creative material when prompted, and that you may have heard buzzing across the internet: a field called generative AI. DALL-E 2 is not the only generative AI text-to-image application out there either: Midjourney and Nightcafe, as well as releases by Stability AI and Scale AI are just some examples of how generative AI programs are being commercialized and delivered to increasingly many adopters.

A few things I generated using DALL-E 2’s beta

These technologies are revolutionizing the way we create in the world. Artists, and you and I alike, can generate beautiful images quickly and easily, simply typing a description of what we want to produce and clicking a button. Similar abilities are already available and being developed for music, video, text production, game avatars and even slide design. And it’s going to become increasingly difficult to tell what is made with human touch versus AI rendered.

Another artwork using DALL-E 2’s beta

On one hand, proponents share that generative AI can set the world’s imagination free. As these algorithms become increasingly tailored to different contexts, artists and marketers (or anyone!) will be able to use AI to quickly generate media to spark and engage ideas, to build vision boards and as a basis for further creation.

There’s also — as with many new emerging technologies — a bunch of risks and ethical questions, along with new questions for law makers and regulators. Without guardrails in place, many generative models could (and are) producing violent or pornographic content. (For more in depth examples of risks, see this article on the Verge).

Generative algorithms may even directly copy (or almost exactly copy) images that already exist on the internet, raising copyright concerns, and issues for artists whose style of work is now somewhat recreatable by a machine. These models are also far from perfect yet, with limitations in composing multiple disparate ideas together, or issues generating signs with texts or lots of spatial requirements.

Want to dabble at a very high level in how the AI works? Read on.

DALL-E 2 takes a text input and generates an image as output.

A very high level overview of how DALL E 2 works

A couple of key factors are making generative AI work.

(1) Recent breakthroughs in diffusion models. (Yes, these methods are named after the same diffusion you learnt about in high school chemistry.) Diffusion models are trained to be able to reconstruct a picture after noise (mathematically randomized interference making it harder to ‘see’ the original image) has been added to it.

DALL-E 2 and other applications have tapped into diffusion models and applied some redirection to optimize for something close to what you ask for in your text input. At a very high level, they take a ‘draft’ of an image and apply diffusion, but change the direction of the noise applied so that it is reconstructed into something close to what you’re asking for (e.g., giving DALL-E 2 the prompt “a carrot dancing”, it will take a ‘draft’ from its existing ‘memory’ of images and then apply additional context to push it more in the direction of what you’re asking for). The specific algorithm that DALL-E 2 uses for this is called GLIDE.

You’re probably asking… how could adding noise and a bit of redirection possibly make something so sophisticated? A large piece of the answer is (in addition to many state of the art machine learning methods)…

(2) A huge amount of data

DALL-E 2 was trained on hundreds of millions of texts and images as shown in their public paper release (https://cdn.openai.com/papers/dall-e-2.pdf). As another example, Stability AI trained their Stable Diffusion model on Laion 5B, which is a dataset of 5 billion text and image pairs from Google image searches (https://laion.ai/blog/laion-5b/)

One thing I mentioned above is that some of these systems have been generating content very similar to existing images on the internet, raising copyright concerns. This occurs because the algorithms have been exposed to so many examples already in training such that its best generation may be close to something that the model has ‘seen’ before, especially if the text you are searching for is generic, like “cat” or “flight attendant” and likely appeared in the training dataset in some capacity. Part of the reason DALL-E 2 has such great ‘memory and recall’ of what it’s seen is because as part of the algorithm, it has a…

(3) Really great ability to match images with text.

For DALL-E 2 the algorithm used is called CLIP (Contrastive Language Image Pre-Training). CLIP was trained given a dataset of text captions and images, to find the best match for each text and image input.

The “Contrastive” part of its name refers to how this was done in order to differentiate between subtly different text inputs (e.g., it tries to make sure that given a text and image input for “a cat” and “a cat cooking”, that the text is mapped to the most relevant image, even though it would still be technically correct to map the phrase “a cat” to the image of a “cat cooking”. )

What is the most exciting use case you can think of for generative AI? What concerns you most? What do you want to understand more? I’m keen to share more posts diving into greater depth on these technologies and explaining the mechanics in more depth/more precisely than in this article.

All opinions expressed are my own and not of my employer or external affiliations.

--

--