Image Analogies and Relationships: AI & Interactions in Word and Image
2023.12.9–12 Geoffrey Gordon Ashbrook
First Two Attempts (Failures) at Low-Hanging-Fruit Image-Generation
Though my usual focus is language NLP work and ~music (I am the only person in my family who is not an expert painter) I figured I should give the up and coming AI-image generation technologies a go. My first ideas of what to try may have produced some potentially useful observations…of how my attempts failed so badly.
Attempt One: Dungeon Room with logic-gates between the rooms
(with stable diffusion)
My first idea, thinking to explore how AI technology might help people (especially kids with scant gaming-resources) to run their own table-top role playing games, and developing their literary imaginations, was to see how well image-AI could help someone map out a gaming world, such as a dungeon map.
No matter what I tried the result was more or less the same, with occasional blobs and smudges in random places: an indistinct sort-of-floor-plan.
Attempt Two: The Armistice
In a bid to assuage an AI-fearful good friend of mine with familiar comic relief, I decided to try some images featuring familiar figures of good social standing.
A first brainstorm was trying to get an image of Darth Vader and C3PO playing chess, but this only produced mangled images of darth vader’s head and chess pieces (with occasional random gold coloring).
I think the next idea was an image of Optimus Prime brokering peace between He-Man and Skeletor in a symbolic wedding. That somehow became Optimus Prime as a minister presiding over a wedding between He-Man and Oscar the grouch…for some reason. Trying to get this with just a language prompt only produced vague images of a single robot. So I decided to hand draw (which didn’t work either) and then collage the image along with a text description. The catastrophe resulting from which you can study below.
Autopsy The Failures
It was probably only after both of these attempts failed so miserably that I wondered how ‘normal’ people could make specular intricate dazzling images whereas my attempts at low-hanging fruit were something between blank-canvases and hilarious atrocity.
What was happening?
- the model could not do any interactions/relationship between (dungeon) rooms
- the model totally confused and distorted any combination of separate characters (playing chess, at a wedding…)
Ah…relationships, interactions, combinations…there was the problem and the focus difference.
A Singular Composition
As usual, my default approach to something builds in the spaces of factors of interactions and outcome dynamics, however discrete and narrow (e.g. the whole object-relationship-space, coordinated decision making, and definition behavior study conglomerate that this essay is part of), whereas how many of those advertised truly spectacular AI-generated images are of only one subject/object with no interactions or relations? At a glance: 100%, all of them. Take a look at playground.com (if that exists when you read this)
misc screenshot of playground.com’s home page
So my attempt to find a tool to help people manage the visual aspects of world or game scenario with many moving and interacting and inter-relating parts was a bit of a mismatch with what these tools do. And maybe this relates to why we have not yet seen AI content in games and game design; there appears to be one or more valleys of death in the way of making that work…but I’m still optimistic (Go go gadget professor pangloss).
3. Analogies: Visual, Aural (auditory), and Textual
Since relationships were now front and centre, I decided to try the next obvious low-hanging-fruit simple question: putting a classic vector analogy language-question to an image-vector AI. What will happen?
Prompt 1: “Man is to king as woman is to ___, what?”
Prompt 2: “Bark is to dog as meow is to ___, what?”
The first prompt produced painfully bland images of plasticine looking royalty. There was a ‘man’ with a metal-head-ornament and a ‘woman’ with a metal-head-ornament in each result, but not in a particularly conclusive way. ‘King and woman’ might have produced the same results (high traffic throttling prevents me from actually confirming that now…apologies.). But it seems vaguely promising that a ‘queen-ish’ image does appear somewhere in the image.
The second prompt may, or or may not, be a bit more interesting. As in the picture below, sometimes the result contains no dogs whatsoever while the word ‘cat’ did not appear in the prompt. On the one hand, because this is a vector-embedding search ‘meow’ might just as well be ‘cat,’ pointing to the same concept-area. You can’t have a meow without a cat, generally speaking; whereas in the first example not all women are queens and ‘queen’ did not appear in the prompt. But here, see below, the image is 100% cats. The prompt was “Bark is to dog as meow is to ___, what?” which maybe (very maybe) could mean the model is trying to answer the question. However, the results varied across jumbled composites of cat and dog images (with some funny mergings) so it may be predictable that any jumbled set of random cat and dog images will contain some cat-only images.
The point here is that the question-raised is interesting (getting a rigorous answer is another interesting question).
Blurs vs. Interactions
More questions: As images fuse, how can we tell if the AI is ‘showing a relationship between the two’ or simply mashing the two together?
How could we tell if the image-AI is trying to show a juxtaposition? A curious, and possibly completely inconclusive, example may come from our next prompt:
Merge & Migrate
I am not going to show images of this because they can be a bit stomach churning, but ‘holding hands’ is a prompt that often leads to a tangle of finger-ish-things. Animals or people next to each-other can share limbs or have curious extras. I think in ai-videos it can be tricky to stop people from merging together (not sure though).
Prompt 3: “Day is to sun as night is to ___, what?”
This prompt produced a set of images with different combinations of day and night sky-scapes. But one that seemed unusual (see below) combined side by side two separate image panels (a “diptych” if you will). In all of the merge-combo variations I have seen, this is the only two-related-pictures-in-one that I recall seeing.
For example if your prompt is a bunch of items that can be fused with each other, the result is usually some curious combination, such as
“octopus apple pumpkin tree moon stars”
“duck water hand pumpkin”
“death star in space ships stars cantaloupe”
Dodge & Weave
Can we avoid the common-adjacy issue by using not-usually-combined items as parts of our questions? For example, in previous comparisons it was not clear if the image-AI was specifically deducing “queen” or if queens already happened to be in all of the ‘king with woman’ images/image-vectors anyway. And cat and dog images and comparisons are so common online that there is probably a kind of gravity-well of over-representation around cats and dogs too.
So let’s try a less common combination:
Prompt 4: “Bark is to dog as quack is to ___, what?”
This did not produce any duck images.(which might be a counterexample to the idea the in vector-land meow===cat and quack===duck, or this might feed into the question of there being an over-abundance of dog-and-cat images as well as cross-labeled cat-dog images but very few dog-with-a-duck images).
However, interestingly, this prompt did produce almost exclusively water-retriever dogs swimming…which is a curious connection. Was the AI presenting a conclusion of “duck-hunting”…which is perhaps one of the few clear real-world relationships between dogs and ducks. This swimming-retriever pattern may also add an interesting ‘relationship’ aspect to the sometimes abstract ‘analogy’ question. Let’s map this out:
On the one hand, the answer to: “Bark is to dog as quack is to ___, what?”
is the abstract ‘duck’ or ‘duckness’(vector).
But, if the focus is on relations and interactions, and the question is
“Bark is to dog as quack is to ___, what (is a real-world interaction and relationship between all these items)?” then duck-hunting swimming hunting-retriever-dogs is a valid answer to the question.
Could visual analogies be different from text-analogies? How does visual-vector-space work differently from text-vector-space? Again, the goal here is to dredge up questions and feedback for our perceptions of what may be happening, not arriving at a conclusive explanation (based on a reported image which I didn’t even bother to save).
Another experiment on a similar theme is to start out with two similar scenes:
- very happy people at a meeting
2. very bored people at a meeting
and see what happens when we try to combine parts of those two images.
The result (below) is similar to the theme above, where the early-stage 2023 image-AI cannot combine multiple themes or attributes, even (in this case) when there is no interaction between them. Curious.
Object Granularity: Inscriptions, Ears & Shirt Collars
As a segue into the next section, here is an image from the prompt:
“darth snape wand potions”
Overall this is an interesting conceptual fuse-blending of Darth-Vader vectors with Severus Snape wand potion vectors (including a curiously blue…weird thing in his hand). The title-inscription is almost-not-gibberish. While too ambiguous to say, “Savrise Snivap” or “Saride Snipvar” is kind-or-sort-of not terribly far from “Severus Snape,” or if that is too ambitious, some of the fake writing looks kind of like real letters: S, R, E
‘Real letter’ if not ‘real words’ can be an interesting item to look for in a context of details that the model fuzzes over.
The AI actually does very well on the clock numbers and roman numerals a lot of the time, and occasionally had not-terrible clock hands. If it is safe to speculate a bit, perhaps wall-clocks are uniform enough that this was memorable. But wrist watches are almost the opposite. (Why such a huge difference? I would have thought clocks and watches were in the same category, not two highly different categories.
The hands and buttons on the wrist watch are generally very good (unlike wall-clocks), but the writing and numbers are total nonsense (also unlike wall-clock). And the digital watch is so incomprehensible that without the outside context you’d be hard pressed to think it was an attempt to draw a digital clock face at all (see the analogue winding-knob thing on the digital watch?).
To press the point, here’s a “digital calculator”…
Compared with fictional-keys on the ai-calculator, the numbering and hands on the wall-clock faces are astoundingly clear and correct.
How should we map out what is more or less difficult and how the difficulty works?
For example, AI ‘human’ faces can be highly realistic, but ears, earrings, and shirt-collars are, in contrast to the hyper-realistic facial features, distorted beyond recognition. I had assumed this was a kind of oversight that the model was only intended to focus on the face, but maybe shirt-collars are that much more difficult?
If you have not does this before, try refreshing to look at a few faces on https://thispersondoesnotexist.com/ (beware of spamvertizing lookalike sites, I think this is the real one). At first you may be surprised by how real-ish the pictures look. You may wonder as you flip through the ai-generated faces if you could even tell that the face was ai-generated. But focus on just the shirt collar (and ear-rings if they exist). Perhaps the model has not been updated, but unless the shirt collar and ears are all covered up those features tend to be at a drastically lower quality than the rest of the image.
Even playground.com/canvas is similar, while the overall quality can be amazingly good, watch the ears and shirt collars.
Depending on areas of difficulty like this, the threat of AI generating fictional news footage and elaborate world event reports to hoodwink people may not be a short-term concern. (At the same time, some people will believe pretty much anything and that lowers the bar.)
One More Puzzle
Here I asked ai for a ~”digital calculator with hair.” (I also then asked it to modify the face and add some groceries to keep the image G-rated). Notice that while the numbers in the hair are (mostly) real numbers, the numbers on the calculator are complete nonsense. How does this mis-match of difficulty-levels work? It can write numbers. It can put numbers anywhere it wants. But it either can’t put number on the calculator, or it does not care.
Perhaps, again, this represents the ‘relationship’ type challenge. Numbers: ok. Hair, ok. Calculator:ok. But the relationship between numbers and calculator are…not ok.
“count to ten on the blackboard”
“birdsong on a blackbaord”
“words on a blackboard”
“a restaurant kids menu”
“a restaurant menu”
How do these restaurant menus not even contain real ascii characters, unlike the Darth-Snape painting that was nearly real words?
“pizza menu text”
As a test of whether context is important, here is an image prompted by “pizza menu text,” and it does not nearly say “pizza,” or any other words, though there are three “P” letters.
Equal Representation Under the Vectors
No Legislation without Vector Representation
And possibly on a similar theme, but try “sandals and toes” as a prompt. Maybe you will have better luck but over many tries I have not yet seen 10-toes between two feet (which…would be normal). This toes-problem appears to maybe be different from the holding-hands and merging-arms problem which involves two people. Perhaps hand-pictures and wall-clocks are well enough represented, but perhaps people just aren’t taking enough pictures of their feet…
High Definition & Low Definition, Media: Hot & Cold
The overall interplay featured here may be EM-spectrum(“visual”)-spatial data vs. serialized-textual-language, but it may be important to remember language as sound-data. Future empirical and multi-modal work may be interesting regarding such a long standing blindspot.
It is not a hard science, various writers and researchers have explored some topics that may relate to these fields of high and low definition
signals and perception, and how images, words, and sounds may relate to either meaning differently or signal transmission or perception or data-density, depending on your focus.
While you may not agree with his interpretations, the late Leonard Shlain wrote a book specifically about the interplay between words and images.
Dr. Shlain was a very accomplished neurosurgeon and professor, I was lucky enough to see him speak at events while he was alive.
Herbert Marshall McLuhan wrote many books on Media Studies involving how different media may operate and affect systems differently.
Now sadly out of print, this is a great set of lectures: https://www.amazon.com/Surfing-Finnegans-Riding-Marshall-McLuhan/dp/1561769118/ Perhaps it is archived somewhere online.
And it may be interesting to compare visual number and letter articulation in AI with traditional Ingo Swan type ‘remote viewing’ printed-character granularity:
Minimal and Random Inputs
Here are a few parting experiments with minimal and one largely random input.
“face woahhg3i hair 3 9s igs figs 9hgs”
C.P. Snow vs. Eric Ashby, and The Secret Chief
While all very speculative at this point, we can at least ask what other regions of systems-space might be at play here, given that there is much we do not know.
I tried looking at some of the text-descriptions for some of the ‘elaborate’ sample images (such as a street of shops). In some cases the text blurbs are just image-titles, but in other cases they look as though they could be (or be related to) the often long text prompts and settings used to generate the images.
Trying this out, and using the ‘filter’ option.
To me, reading these ‘prompts’ was painfully like trying to ‘read’ yet another vapid clickbait article that contains virtually no information at all. It felt wrong somehow.
I was starting to feel like a detective interviewing a recalcitrantly inarticulate witness:
Detective: “Ok, and when you arrived, what exactly did you see? What happened?”
Witness: “Oh, sure! It was like, Sunset in Autumn. I think there was a Lake and a Silver birch forest. It was totally Ultra HD, I mean, really realistic and so many vivid colors. Like wow. I mean really high detail. It was like a UHD drawing (that’s ‘utra!! high definition, man!’)… maybe pen and ink, with like totally perfect composition, I mean seriously beautiful detail. You know, it was all like, concept art, with a soft natural volume. It was totally cinematic! It was perfect! Did I mention lightSunset… oh yeah, it was Summer, and there was a Lake. There was a Silver birch forest. Oh, and did I mention it was Ultra HD?? It was so realistic! And there were vivid colors! There were high details, man! It was totally a UHD drawing, like with pen and ink. Such a perfect composition. Oh, and it had beautiful details. Did I mention that? You know, it had… Complex details. I totally think there was an Octane render trend. I’m talking 8k art, man! This was photography!! This was photorealistic concept art!!! This was…soft! This was natural! This had volume! This was cinematic!!! This was perfect!!!!!!! You know, it was like, light.”
The above not-actually-human-dialogue was adapted from the following actual posted prompt, where I tried to not add or remove any language aside from adding sentence structure.
“Prompt Sunset Autumn Lake Silver birch forest, Ultra HD, realistic and vivid colors, high detail, UHD drawing, pen and ink, perfect composition, beautiful detail, concept art, soft natural volume cinematic perfect lightSunset Summer Lake Silver birch forest, Ultra HD, realistic and vivid colors, high details, UHD drawing, pen and ink, perfect composition, beautiful details Complex details Octane render trend, 8k art photography, photorealistic concept art, soft natural volume cinematic perfect lightRemoved From Image ugly, deformed, noisy, blurry, distorted, out of focus, bad anatomy, extra limbs, poorly drawn face, poorly drawn hands, missing fingers”
Note: The “Removed From Image” blurb I think is the default that is already there.
Trying this prompt with a “macro-realism” filter and then a “cinematic” filter, I got these two images…
…which are much less horrible than the results of the prompts that I design myself.
So, aside from my personal strong aversion to describing one’s own design in the above unspeakably repellent terms…this alien metier seems to be operating in some kind of interesting space separate from how I am thinking about it.
As a possible angle for re-orienting around this, two narratives in the history of science are Eric Ashby’s fact-based account of academia’s role in the history of the sciences (or perhaps STEM would be a better term though it did not exist (or was not widely used) in 1958) (puzzle pieces we should all be considering) and C.P. Snow’s more conjectural musings about how other people might be thinking about different types of things.
(As usual there are several editions, versions, and possibly separate works, with confusingly similar titles. Naming things is hard (or rather, not knowingly giving two different things exactly the same name appears to be irresistibly impossible).
Though it may be ridiculous to invoke C.P. Snow’s two cultures as a context specifically for thinking about essences and atmospherics, at least more broadly (and, again, disputes or details about C.P. Snow’s dichotomies may simply get in way here) the main idea here is that the world is neither made of one type of task nor one type of approach. And just because one approach doesn’t lend itself to something does not in the least mean that there isn’t much in the world that approach will not intersect with. Or maybe this more simply fits the adage: If all you have is a hammer, everything looks like a nail. Where as, to heavily paraphrase Robert A. Heinlien’s idea that “Specialization is for ants: people need to do many things,” we need to somehow get perspective on how our tools shape our perceptions.
the situational importance of mode, style, set, and setting
A possible avenue for looking at (and looking at the overlooking of) the substantial role of set-and-setting is the well set down narrative of Michael Pollan in his deftly diplomatic “How to change your mind,” which absolutely needs to be a standard text book in AI studies, AI development, and AI-Biology integration.
One of the STEM threads that Pollan follows is the medical-therapeutic role of set-and-setting in a context of mind. To wildly oversimplify: sciency-academic people thought set-and-setting wasn’t important, but persistent data won out eventually, so after decades of perfunctory hominid hostility fighting reality (and, as usual, the punishing the people who pointed towards data) set, setting, and atmosphere are now taken more seriously as part of the space of a scientific and medical study of mind.
The “Secret Chief” was a prominent figure in medicine who, in a sense like Geoffrey Hinton and others, continued to follow the data and do research to help people despite the hostilities and exclusion of institutions and academia.
Is this another chapter in exploring the dynamics of setting?
Interactions, Relationships, Ecologies, & “Systems”
A set of areas that can be important but that is notoriously difficult to navigate technically and linguistically (and, ironically, socially and institutionally) is the sprawling mix of areas around “systems thinking” and nonlinear and dynamical systems (which, being vaguely defined at the boundaries may even include the peculiar split between computer science and project-oriented ‘Operations Research’). I will try to stick to the affirmative and practical.
An exciting entrant into the renaissance year of 2023 is Deborah M. Gordon’s “The Ecology of Collective Behavior” (October 24, 2023) (…still reading that now). Indeed, this may (or should) end up being a core AI architecture textbook as it deals with practical questions blessedly outside the echo-chamber of cat smile sentiment analysis.
(This is not an AI generated image of an ant collection. (from web browser image search))
AI image prompted by: “This is not an AI generated image of an ant collection.”
Deborah M. Gordon is in some ways in the footsteps of the great biologist Edward O. Wilson (author of too many books to list here) who worked hard to extend biology to interactions within populations, and both Gordon and Wilson are biologists focusing on ants.
Melanie Mitchell’s wonderful “Complexity, a guided tour”
is a refreshingly lucid and cheerful walkthrough of both exciting research and a respectful obituary for the seemingly endless failed attempts to formally incorporate nonlinear dynamics into STEM. The book also includes some fascinating examples of modeling interaction behavior.
Rupert Sheldrake is another biologist, not an ant specialist in this case, who has been involved in empirical interactions studies in biology, including animal behavior, writings about how science is done and how science sees the world, and of course who has braved controversy for not being stodgy.
Another affirmative, practical, book is “Thinking in Systems”(2008) by the late Donella H. Meadows
Mandelbrot himself, and his protege Nassim Nicholas Taleb, have worked on many practical projects. The pioneer Stephen Wolfram is another example of someone whose work has been primarily in practical software and who has also pushed angles for understanding systems related to cellular automata.
If not commonly lauded in AI discussions, Thomas Hobbes’s “Leviathan” (~1668) is ever-fascinating as a foundational early work in modeling population interactions, broadly considered the creation of the field of ‘social contract theory’ in political philosophy. https://www.amazon.com/Leviathan-Penguin-Classics-Thomas-Hobbes/dp/0141395095
There are also attempts to review, synthesize, and critique approaches of analysis such as:
- John Hand’s “Cosmosapiens” (2016) https://www.amazon.com/Cosmosapiens-John-Hands-audiobook/dp/B01BKYCHYQ
- Erica Thompson’s “Escape from Model Land: How Mathematical Models Can Lead Us Astray and What We Can Do About It” (2022) https://www.amazon.com/Escape-Model-Land-Mathematical-Models/dp/1541600983
- Coco Krumme’s “Optimal Illusions: The False Promise of Optimization” (September 12, 2023) https://www.amazon.com/Optimal-Illusions-False-Promise-Optimization/dp/0593331117
In a broader sense, while not conclusively ‘solving the universe,’ books such as these inform a study of research itself and how people and communities see and use STEM. Somehow we need to include books that are just about people being people, such as Krumme’s book.
And on the topic of the history of STEM, including social views on STEM areas, I always recommend Eric Ashby’s “Technology and the Academics: an Essay on Universities and the Scientific Revolution” (1958).
As one more historical note, people have been talking about the uses and impacts of technologies like AI-technologies for a surprisingly long time. A nice survey of this is “AI Narratives: A History of Imaginative Thinking about Intelligent Machines” a fascinating if often dry collection of essays by different authors (always a treat!) edited by Stephen Cave.
https://www.amazon.com/AI-Narratives-Imaginative-Thinking-Intelligent/dp/B087XCDSDB In many ways the debates of 2023 are more connected with the flavor of human history than then the platitude plateau following the post war years.
Frontiers Inside & Outside
From very roughly 1971–2020 it was broadly believed that there were no frontiers. The universe was dead and uninhabitable. Science and history were contemptuously considered a simplistic fait accompli. Periodically it is said that everything has already been invented or there is nothing yet to discover. The world is large, and we should try our best to cultivate our perspectives.
About The Series
This mini-article is part of a series to support clear discussions about Artificial Intelligence (AI-ML). A more in-depth discussion and framework proposal is available in this github repo: