Against Magic — Part One

AI as a Strong Signals Machine

A guide to Generative AI for non-machine learning people

Katharina Köth
Creative Complexity
15 min readJan 14, 2024

--

Table of Contents

  1. First, you curate a pile of content
  2. Then you train the model
  3. At last, you can work with it

Note: all AI-generated content in this article is unedited output.

At the end of 2023, Midjourney released the 6th version of its proprietary, self-titled image generation model. It was promised to be even better, and “better” in this case means: more precise, accurate and realistic in its image generation. With the release, many examples of the model’s performance were posted to the subreddit r/Midjourney.

One post — titled “V6 is amazing” — depicts a series of fake movie posters promoting the imaginary Netflix production “LENIN”, starring Leonardo DiCaprio.

These were the movie posters:

OP’s prompt: balding Leonardo DiCaprio as Lenin, soviet movie poster, 1920s, “LENIN”, “NETFLIX” —ar 2:3 —style raw —v 6.0 —s 300

Pretty convincing, right? We can clearly see DiCaprio but with recognizable features of Lenin, most notably his attire of a 3-piece suit and coat, stern look, beard and balding head.

Additionally, Midjourney not only generated correct spellings of Lenin and Netflix, something the model’s previous versions struggled with, but also truthfully recreated the Netflix brand logo (in the lower right corner on the third image).

Side note: All ethical & legal considerations aside, I find it absolutely remarkable that engineers were capable of building AI models to that level of realism. It’s an achievement more than 60 years in the making.

One commenter asked OP if they could also generate images of Nick Offerman as Stalin. They posted the results in the thread.

Image generations provided by the original thread poster

For reference, this is what we know Nick Offerman to look like:

left: Nick Offerman today, middle: during the filming of “Parks & Recreation”, right: shaved after “Parks & Rec”

As you can see, the “Stalin” posters show little to no resemblence with Offerman.

Inspired from this request and just for fun, I asked Midjourney to turn Matthias Schweighöfer (who is probably one of the most recognizable German actors appearing in international productions such as Oppenheimer and Inglourious Basterds) into Lenin, and myself into Catherine the Great, both using adaptations of the reddit poster’s original prompt.

Prompt: {balding Matthias Schweighöfer as Lenin, Katharina Köth as Catherine the Great}, soviet movie poster, 1920s, “LENIN”, “NETFLIX” —ar 2:3 —style raw —v 6.0 —s 300

And for reference, here is what we look like:

left: Matthias Schweighöfer, right: the photo I’m using on my socials

I guess, it’s obvious why Midjourney works so well using Leonardo DiCaprio as a prompt, less with Nick Offerman, and not at all with Matthias Schweighöfer and myself. I still want to spell it out: We’re not that famous, and I’m not really a public person at all.

Looking at Getty Images, their editorial search will find:

  • 34.570 photos of Leonardo DiCaprio
  • 4.821 photos of Nick Offerman
  • 251 photos of Matthias Schweighöfer
  • 0 photos of me, Katharina Köth

The basic rule of thumb is: The more present a person is, the more recognizable they can be recreated by Generative AI. Or more technical: The more present information is in a training set, the more accurately it can be recreated. Let’s check in with a more basic prompt.

Prompt: {Leonardo diCaprio, Nick Offerman, Matthias Schweighöfer, Katharina Köth}, head shot —v 6.0

Comparing these head shots, you can clearly see that the four images of Leonardo DiCaprio resemble him and almost look like official, photorealistic head shots. The generated images have high consistency and family resemblance with just a little difference in aging.

The images of Nick Offerman still look like him, but within the grid of four, his facial features are visibly less consistent — his face shape, haircut, nose and eyes change. In comparison, they could be four brothers, but these four images are not the same person.

The images for Matthias Schweighöfer and me are just generic white people with mostly model-like facial features. No surprise here since the “head shot” is a portrait format most commonly used by models and actors.

So, let’s put us four into different settings.

Prompt: {Leonardo DiCaprio, Nick Offerman, Matthias Schweighöfer, Katharina Köth} as royalty —v 6.0
Prompt: {Leonardo DiCaprio, Nick Offerman, Matthias Schweighöfer, Katharina Köth} as a construction worker —v 6.0
Prompt: {Leonardo DiCaprio, Nick Offerman, Matthias Schweighöfer, Katharina Köth} as a pastry chef, standing behind the counter of a bakery, kawaii, in the style of wes anderson —style raw —v 6.0

The more you play around with these prompt variations, the better you can grasp how much reference material must be needed to not only create a realistic image but to recreate recognizable elements, let alone people.

Overall, the images for DiCaprio keep being quite consistent, Offerman becomes less realistic — except his consistently greying beard. For Matthias and myself, there is no consistency or resemblance at all.

The pastry chef prompt is especially remarkable. It’s a complex instruction, not only asking to generate a specific person but also 2 distinctive styles — Wes Anderson representing a symmetric composition with retro looks, and kawaii to emphasize pastel color schemes. It’s so complex that you can see the consistency on DiCaprio slip.

Side note: While there are ways to give weights to the keywords in a prompt: the more complex your prompt, the more difficult to get the precise image you’re looking for. But take inspiration from the randomness and mistakes that might occur.

It’s all piling up

This lengthy intro is to make you aware of one of the most important concepts of Generative AI: strong and weak signals.

The signal strength of information, through its presence in real world and therefore in a training dataset has heavy influence on the output you will generate.

One thing important to understand is that the goal of many AI research teams is to develop something they call Artificial General Intelligence (AGI): an autonomous machine that can react to any kind of input with any kind of realistic output, and eventually even reasoning.

Side note: There is a case to be made that advancements in machine learning and AI are driven by a conviction of engineers that every aspect of human life, every skill and behavior, can be broken down to binary decisions. And thereby can be reproduced by code.

Depending on whom you ask (and how much money they currently try to raise), AGI might already be here or 15–20 years away, assuming it’s even possible.

Currently, there are different Generative AI models and services for different use cases, by different providers. Here are some examples, the one that ✨ sparkles currently considered a market leader (as of January 2023):

  • Text generation: ChatGPT (OpenAI), Claude (Anthropic), Gemini (Google), ✨GPT-4 (OpenAI), GPT-Neo (EleutherAI), Grok (X.AI), LLaMA (Meta)
  • Image generation: Dall-E (OpenAI), Firefly (Adobe), ✨ Midjourney (Midjourney), Stable Diffusion (StabilityAI)
  • Video generation: ✨Gen-2 (RunwayML), W.A.L.T. (Google), Make a video (Meta)
  • Speech generation providers: Apple, ElevenLabs, Murf, Google, Replica Studios, Speechify

As a stepping step towards AGI, OpenAI, Google and Meta are developing so-called “multimodal content generation”. Multimodal content generation means that any content can be created by any input.

Illustrative slide from a talk, visualising the idea of multimodality.

Audio input can create a video game. Image input can become video. A video becomes a book. Static code becomes interactive. This is where we are heading and what research teams at those companies are working on.

As end users — people not involved in building these models — it is still important to understand how the training process works. These concepts apply to different types of Generative AI, though I’ll mostly focus on image and text.

First, you curate a pile of content

Easier said than done. But I would even argue that —assuming the algorithms are in place — content curation is probably the most crucial part of Generative AI.

I mentioned training datasets before. These mostly include data and content scraped from the internet. That means a program is instructed to follows links and code patterns to extract content from any websites it can access or was instructed to visit.

For image generation, one of those content piles is provided by LAION, a Hamburg-based non-profit organization, that for self-proclaimed academic research reasons built an open-source library with hyperlinks pointing to publicly available licensed and royalty-free images.

Most notably StabilityAI’s Stable Diffusion and OpenAI’s Dall-E are using the LAION library. Earlier Midjourney model versions were built upon Stable Diffusion, but they recently pivoted to their own proprietary model.

Comparing the most recent models Dall-E 3 and Midjourney v6, you can see that former generates a more commercial and the latter a more artistic rendition when provided the same prompt.

left: Dall-E 3 via ChatGPT: Generate an oil painting of a cat licking a popsicle
right: Midjourney: oil painting of a cat licking a popsicle — v 6.0

It is rumored that Midjourney curated the training data of their models with a stronger focus on fashion editorial photos and artistic creations from platforms like Deviant Art. In contrast, Dall-E relies more heavily on the LAION dataset that is filled with random images and a lot of stock material.

Recently, providers of proprietary, closed-source models such as Midjourney and OpenAI have become more secretive about the datasets their models were trained on. Probably because they also understood that the curation and weighting of the data provided is their main differentiator from other models, when all models are basically built upon similar information.

Speaking of piles: Researchers of the “The Pile”, an open-source dataset used by text-based Large Language Models like EleutherAI’s GPT-Neo or Meta’s LLaMA, shared a tree chart depicting the contents of their training dataset, 886.03 GB of data.

To pick some examples from the chart:

  • ArXiv is a preprint server for research papers with a focus on maths, computer science and physics
  • Pile CC is a modification of the widely used CommonCrawl dataset of scraped public online content, but filtered for raw HTML code
  • Stack Exchange is the umbrella network for Q&A communities such as most notably StackOverflow
  • PG-19 is the open-source Project Gutenberg library of unlicensed literary text released before 1919
  • Subtitles is the OpenSubtitles dataset gathered from movies and series for research purposes
  • GitHub includes publicly available code respositories of multiple programming languages

Other notable sources not represented in the graph (because they are part of compilations) include reddit and YCombinator’s HackerNews for links and conversations. Maybe even your personal website or blog is part of the CommonCrawl dataset.

No matter if images or other content forms: It is important to realize that different model providers have different approaches to curating.

It is also completely up to the model provider if and how content is moderated. Just think of the “non-woke” models like “Grok” by X.AI/Twitter or “TruthAI” that right-wingers are working on, promoting the idea of “unfiltered” models that keep hate speech as part of the training dataset.

Side note: “Data is the new oil” is a phrase that, afaik, was coined in the mid-2000s. It is only now, a mere 20 years later and seeing the real life impact of data crawling, that companies realize what this means. And it is also exactly why reddit and Twitter (X) were so adamant to putting their previously-public APIs behind paywalls.

Then you train the model

Basically, once the data is collected, you can start running a training program. Your dataset can be as simple as providing one large .txt file with all the writings of your favorite author. Or as large as the previously described data piles.

In a professional settings, there are some steps in between: The data needs to be cleaned. Think of a book you got from gutenberg.org as a text file — it would need to be stripped of the imprint, chapter titles, unnecessary whitespace, page numbers, and so on; So that the machine would not identify false patterns of language.

The first training step that usually occurs in training a machine learning model is called tokenizing. This means (if needed) splitting words into common chunks of characters and assigning each token a unique identifier.

You can try this out yourself on the OpenAI website: https://platform.openai.com/tokenizer

Training the machine learning model then actually means that the machine builds a network between those tokens. It goes over the tokenized text again and again. And with each round, the relations between each single token become ever stronger and more reliable.

The data won’t be uploaded like you upload a file to Dropbox or similar services. It can’t be searched for specific information and sources. Think of it more like memorizing vocabulary through repitition.

The model will build this vast network of tokens, words and characters, with strong and weak connections between each other, sorting itself into groups of content forms. Within this network, “Once upon a time” is where the fairytales are. <body> indicates HTML code. “Here is the one thing you need to now today” could be where the Linkedin lingo is located.

Prompt: cloud of white node points in a pitch black space —ar 16:9 —v 5.2

The more content (and iterations) you provide, the more accurate and general the output becomes. Something like correct facts, grammar or even translation capabilities can emerge.

As a non-machine learning person, you’ll probably never train a model yourself, but rather use a so-called “foundation model” like GPT-4, Midjourney, etc. That means that all the work has been done; the model already has already been trained on giga- and terabytes of content through millions of iterations, estimated to take as long as a whole month in the case of GPT-4. We can use these models and adapt them for our needs.

The model training also includes quality assurance and optimization. The current best practice is reinforcement learning by human feedback. This means, a user writes a prompt and the machine responds. The user then rates the response. Good responses get rewards and incentivize the model to improve in a way the model provider intends it to (“alignment”).

It is worth noting that this quality assurance — which can also include removing (child) pornographic imagery from datasets — is most often outsourced to low-wage workers in countries of the Global South.

In the previous paragraphs, I have talked a lot about text generation models. I don’t want to repeat this and go too much into depth about image generation. Just so much: image generation is based on a bit different machine learning approach, called diffusion models where the model learns from its pile of description-labelled images.

For image generation, a radom noise field is generated — like a very unclear dream (if you remember, one of the first image generation services by Google was even named “Deep Dream”) — and instead of token-probability, pixels are sorted by probability over time to create a realistic image. You can experience the process while using Midjourney and watching the generation steps.

At last, you can work with it

With the trained model in place, you can start prompting and generate content.

Online and especially in the context of text generation, you can find many discussions, even among people with machine learning background, whether these models are just “next-word-predictors” or if a type of intelligence is already emerging.

The main (philosophical) question is: If it’s really just predicting the next word, how come that responses are coherent and actually make sense?

And we just don’t know right now. There are standalone fields of ML where people try to reverse-engineer the neural networks after their training processes. While the text models are next-token-predictors, the model itself is an unimaginable, multidimensional mathematical space. So the issue of those discussions lies mostly in using the word “just” and thereby devaluing the engineering achievement; with the next-token-prediction actually being quite remarkable.

Another thing emerging with text models and especially since the introduction of ChatGPT is “overreliance” on the model and its generated output. (btw, OpenAI themselves name it as one of several safety challenges.)

I genuinely commend the OpenAI product team on tailoring ChatGPT so perfectly that you forget it’s a machine. You use your natural human language and get a natural human language response in a conversational format. This doesn’t happen with image models that are much more used like a Google search interface of keywords and phrases.

But the overreliance induced by conversational interaction is where the idea of ✨ magic comes back in. All of the more technical explanations above are suddenly forgotten when a response seems so perfect. But the prediction is still in place.

Let’s take a closer look.

For earlier GPT models (up to version 3.5), OpenAI offered a probability mode that provided the likeliness of the token it chose for its completion. You can still access it through the API playground’s legacy “complete” mode.

Screencast of GPT-3.5 completing the prompt “Once upon a time, there was” with showcase of the probability scoring

As you can see in the video and screenshots below, each token gets an assigned probability. You can check token-by-token to see it’s probability of being the next one, in the context of on the initial prompt.

In its settings, OpenAI offers something called “temperature”, a parameter for randomness. As an analogy, think of the process of boiling water. When it’s cold, the water is still. When it’s hot, the bubbles burst and the water moves.

In the video, you can see the completion for the sentence fragment “Once upon a time, there was” at temp = 0. This means it will always choose the highest probability. So basically everyone who prompts the GPT-3.5-turbo-instruction model with this context at temp = 0 will receive the exact same output. But of course, you can increase the level of randomness (up to a temperature of 2.0) and thereby decrease probability.

Prompt “Once upon a time, there was “ with temperature: 0.5 , 1.0, 1.6, 1.9 and colored probability indicator

GPT-4 and ChatGPT are still working with the same mechanics explained earlier, only that the “completion” is now framed as a conversation rather than an open-ended continuation; also ChatGPT has pre-assigned text generation settings.

If you ask ChatGPT the same question multiple times, the facts or sentiment of the response will stay the same, but concrete wordings will change. I’m guessing that it’s set to temperature ≈ 0.7, so that the response is still aligned with your prompt, but allows for enough flexibility and randomness to not generate the exact same output when prompted with the exact same input.

You could also consider the temperature being the “creativity” of the model, if you define creativity as the ability to diverge from normative and expected responses.

Conclusion

What you really should take away from all these explanations and examples is that the machine learning model itself is static.

It doesn’t have access to current websites, updated knowledge or information that you can find in login-restricted areas. It is trained on crawled content, historic information and majority ratios. If you want to add information to the model, it needs to be retrained.

And more importantly (as a takeaway): If you’re not visible in the data set, you’re not visible in the generated output.

There are ways of prompting and building apps (like ChatGPT Plugins, or so-called RAG — Retrieval-Augmented Generation prompts) to include current information, but they are built on top of the model.

It is also completely up to the model provider to curate the training datasets. Will they keep duplicated content? There are thousands of websites sharing Franz Kafka’s “Metamorphosis” which is also a popular placeholder text. Will they dedupe their content? Or will prompts of “One morning, when Gregor” be always be continued with “Samsa woke up from troubled dreams, he found himself transformed into a gigantic insect.”?

The more present phrases, sequences and information are within the datasets, the stronger becomes the network connection between those tokens. And the higher the probability of the model reproducing them. Though, of course, model providers such as OpenAI are actively try to suppress these “regurgitating” behaviors.

As well in the image generation examples shared in the beginning of this article, we saw that the very popular Leonardo DiCaprio — in contrast to less present people — could be recognizably recreated.

The strong signals will prevail.

Next Part

In the next part, I will diverge from understanding Generative AI models, but putting them into the context of strategic and creative work.

It’s called “The Beauty of Weak Signals”. And I guess from the title alone, you can guess where I’m heading.

References & further reading

--

--

Katharina Köth
Creative Complexity

Experience Architect & Strategist, based in Berlin. Working on and with Creative Complexity https://creativecomplexity.com