The Great GenAI Public User Test

LLMs and text-to-image generators don’t do well with facts — or analyses, or politics. Users hire them for these jobs anyway. What can be done to stop the digital public sphere from mutating (further) into a swamp of ‘fact-ishness’ and political warfare?

Lee Cesafsky
21 min readMar 5, 2024

Keywords:
HCI, LLM, responsible AI, fact-ish, humility engineering, workflows, summarization, political jobs-to-be-done (PJTBDs), proportional representation, AI-splaining, product label warnings, Dunning-Kruger Effect, epistemology, wolves and puppies

Does anyone else feel like Google, OpenAI, Microsoft, Midjourney and the lot are conducting a giant, distributed, non-consented GenAI user test on the global public?

Whatever really happened behind the scenes at Google in the run-up to the Gemini human image-generation debacle, if I were a LLM I would hallucinate the story as follows:

Sundar Pichai woke up one morning and declared,“let’s make Imagen2 woke, as well!” He reached for the yellow Politics Dial affixed to his nightstand — being careful not to mistake it with the purple Temperature Parameter that regulates model psychosis — and rotated it soundly to the left. He released the new model into the world, then opened up X, CNN, and Democracy Now! on his Chromebook to watch what would happen next.

Anyway. If you’ll grant that the mass unleashing of ‘free-range’ GenAI chatbot and image-generators out into the public sphere can at least be read as a de facto public user test, it is worthwhile to consider what we are learning from it.

By ‘free-range’ I refer to GenAI products that operate unconfined, so to speak, by integration into other products with different end goals. These include public-facing products built atop Large Language Models and Text-to-Image/Video Generators. Usually there are free versions of these products that most anyone can use, in many countries across the globe. These include Gemini, ChatGPT, Copilot (Microsoft), and Sora (ChatGPT).

The goal of this informal report — playful on the surface; dead-serious down below— is to shift the debate a bit from a predominant concern with GenAI model accuracy and reliability. This is still important, but only tells half the story.

This article nudges us over into an HCI frame that considers the ways users are actually interacting with GenAI’s ‘fact-ish’ outputs, in the context of three broad classes of workflow: the creative, the productive and the political. The analysis is grounded in real behavioral data: the behavior of products, of companies, and of users.

TL;DR

Free-range GenAI agents like ChatGPT and Gemini have (at least) three green banners up on LinkedIn: “open to creative work,” “open to truth work,” “open to political work.” Users are hiring for all jobs.

The first key problem surfacing from the Great Public GenAI User Test is a contradiction between what GenAI products permit the user to request (factual things) and what the model outputs actually return (‘fact-ish’ things). By neglecting to design and engineer humility into GenAI products — ‘that’s impossible!’— companies offload responsibility for accuracy of information onto users and publics.

It follows that the way that users interact with GenAI products in the context of actual workflows will greatly influence how much of LLM’s inherent fact-ishness is amplified or muted. The second problem — I call it the “summarization paradox,” but it applies more broadly to analysis tasks — is that LLMs need the user’s help most with the very jobs that the average user is least able, or likely, to help it with. The hard stuff.

The last problem is that the GenAI companies don’t seem to have workable strategies yet for dealing with the political nature of GenAI outputs. (I’m not just talking about Google; in fact it’s likely they are only ones really working on the problem). Rabble-rousers everywhere have seized on the opening: they are taking up LLMs and TIGs in the service of political workflows whose ends are to make grist for Culture Wars Mill. While these political jobs to be done (PJTBDs) likely make up just a small percentage of actual GenAI use cases, they take up all the air in room.

What is to be done?

I recommend what I think is the best of three possibly-impossible options: humility design and engineering.

It’s the mask for me.

Fact-ish Feedback Loops

Both LLMs and TIGs can be user-prompted — and are thus user-expected — to create both true things and imaginary things. Both realistic depictions of NHL hockey teams (fail) and fabricated European stereotypes of Americans in all 50 states (fail, all white men). Both reliable reproductions of airline policies (fail) and hilarious job resignation letters written in the inimitable — er, imitable — voice of Moira Rose from Schitt’s Creek (A+).

Yet the outputs of a GenAI models actually land somewhere between the real and the fake — somewhere in the murky waters of the ‘plausible and coherent,’ as Armand Ruiz, Director of AI at IBM, put it in an AI short course.

A core problem surfacing from the Great Public GenAI User Test is the mismatch between what GenAI products permit the user to request (factual things) and what the models actually return (‘fact-ish’ things).

Definition of “Fact-ish “— ChatGPT

Fact-ish outputs will, very often, turn out to be accurate. What defines the fact-ish is that it carries the weight of doubt beneath a veneer of accuracy and reliability.

By ‘factual information’ or ‘fact-ish information,’ I’m not really talking about simple facts like the color of an elephant. LLMs are good at those. I’m referring to theories, histories, biographical data (imagistic and textual), qualitative and quantitative data sets. I’m talking about the specialized, the esoteric, the arcane, the safety critical. I’m also talking about factual outputs of workflows undertaken on factual inputs: summaries, analyses, findings, arguments, theses.

True, the free-range products do alert us to the idea that their outputs are ‘fact-ish’ — that they can’t always be trusted. It’s just that those warnings exist around the edges of a sweeping product contradiction.

The “talk to me, friend” swagger of the ChatGPT user interface, for example, sits in vivid contrast to the liturgy of pop-ups, provisos, FAQs, and terms of use that basically beg the user — from every available margin of the product space — to not ask ChatGPT just anything. To double-check what it says if factuality and accuracy really matter to whatever you are doing.

Indeed, how CAN you help me today?

It’s basically a product warning label strategy, the kind of thing we slap on flammable objects. This is unfortunate, because research shows such strategies are ‘resoundingly’ ineffective for analog products. They don’t work because they, “shout ‘Danger!’ for both wolves and puppies….People quickly learn to ignore warnings since they encounter vastly more puppies than wolves. The result is that when a wolf is truly present, people pay little heed.”

The wolves-and-puppies analogy is useful, but the meaning changes for GenAI.

I recently asked an LLM to define 12 specialized terms and programs from a technical problem space, using the prompt: “what is x.” Doing this instead of a web search saved me — literally — hours. Nine outputs came back broadly correct; two were mostly right, but dropped crucial bits, or inserted false ones; one was flat wrong. Each output looked similarly plausible. The wrong one simply described a different entity in the same family of things.

Working with LLMs, the user will regularly encounter wolves. The problem is that it is hard to tell wolves from puppies. They come from the same bloodlines.

Even so, the makers of free-range LLM products expect that users will do the labor of hunting down the wolves, via practices like fact-checking, technical research — and much more. As GitHub put it in a blog post,

“Understanding the deep learning processes behind LLMs can help users grasp their limitations — as well as their positive impact. To navigate these effectively, it’s crucial to verify information from reliable sources, provide clear and specific input, and exercise critical thinking when interpreting LLM-generated responses…

The engines themselves are amoral. Users can do whatever they want with them and that can run the gamut of moral to immoral. But by being conscious of these issues and actively working towards ethical practices, we can ensure that LLMs are used in a responsible and beneficial manner.”

Hold up. Have you all ever been on the internet that you made?

It doesn’t help that LLM outputs aren’t written in a voice that would prompt a user to dial up their critical thinking module. The AI just lets loose and expounds — AI-splains, if you will — on most any topic, whether or not it ‘knows’ enough about it. Github Blog again:

“LLMs don’t have the ability to assess the correctness of the information they generate. Given their deep learning, they often provide responses with a high degree of confidence, prioritizing generating text that appears sensible and flows smoothly — even when the information is incorrect!”

(Trigger warning!)

What’s more, the life-course of the fact-ish output will generally not terminate on the LLM interface, or even in the user’s brain — not for productive, creative or political workflows, at least.

The user intends to do something with the data. They will hit copy/paste, and move the output somewhere else. They will alter it, or not, republish it in another context. It’s not just the fact-ish outputs themselves that are problematic. It’s the whole fact-ishness loop (speculative still, I admit):

As incomplete, slightly wrong, absolutely wrong, and divisive statements (aka opinions) are inserted back into the public database as facts — inserted by users the truth value of that content can diminish at an accelerating pace. The loop starts with an already-fact-ish training data set scraped from the web, and terminates with even more fact-ish data set, primed for the next round of scraping.

The Summarization Paradox

When asked to perform a summarization task, an LLM is most likely to be bad at exactly what you most want it to be good at. Let me show you what I mean.

Who is using free-range GenAI products?

We do not have much information on the topic. The LLM agents don’t know, or if they do, they won’t spill (I tried). The latest research I could find on the web on the topic was already many months old: one Pew survey of Americans’ relationship to ChatGPT from August 2023, and an Ofcom survey of Britons’ relationship to GenAI from November 2023.

My JTBD is to offer the reader a quick summary of the surveys. I hired ChatGPT for the job. Here was my workflow:

Summary of Pew Survey. I first asked ChatGPT to work with the Pew survey, using the prompt “summarize:” plus the full text.

  • 24% of those who have heard of ChatGPT report having used it, which corresponds to 18% of U.S. adults overall.
  • Younger adults are more likely to have used ChatGPT than older adults.
  • College-educated individuals are more likely to have used ChatGPT compared to those with lower levels of education.
  • Men are more likely to have used ChatGPT than women.

It’s not the full result: I selected just the most relevant subset for my purposes. I did double-check the results for conciseness, clarity, accuracy, and maintenance of original meaning. Those are (roughly) the qualities Gemini said were most vital to summarization when I asked.

Summary of Ofcom Survey. Again I asked ChatGPT to “summarize:” and added the full text to the prompt. This time the output came back in expository form:

According to Ofcom’s study on the UK’s online habits, teenagers and children are more likely to embrace generative artificial intelligence (AI) compared to adults. About 79% of online teenagers aged 13–17 use generative AI tools, with 40% of children aged 7–12 also adopting the technology. In contrast, only 31% of adult internet users aged 16 and above are users of generative AI.

Snapchat My AI is the most popular tool among children and teens, while ChatGPT is widely used among adults.

Confession: I didn’t double-check the results. The output looks right. The bot returned a puppy last time. I bet this one is a puppy, too.

Summary Summary. Finally I enlisted ChatGPT to summarize the summaries together, in to draw a unified conclusion:

Among U.S. adults, 18% have used ChatGPT, with higher usage among younger adults and those with college education. Men are more likely users compared to women. In the UK, teenagers and children are more receptive to generative AI, with 79% of online teenagers and 40% of children aged 7–12 using such tools. However, only 31% of adult internet users use generative AI. Among children and teens, Snapchat My AI is popular, while ChatGPT is widely used among adults.

ChatGPT did a pretty good job! It made the outputs more succinct without losing any (remaining) content. I did a great job, too, because I double-checked the work. It was easy: the inputs and outputs were short.

Let’s summarize this workflow: Two surveys → 2 LLM summaries → double-checking or not double-checking → user editing → 1 LLM summary → user editing = ???? The summary summary could be right on, or way off.

There are other ways to run that workflow though, right? E.g.: Two surveys → 1 LLM summary → double-checking or not double-checking → user editing = ????

Which flow will produce a more accurate results? In which flow is the user most likely to actually do their double check job? It’s a good research project for an applied AI productivity researcher.

But lets up the anti: What if there were 10 surveys? In my experience with LLMs, they are worse at summarization the more data you prompt with (relatable!). This coincides with what observers are telling us about LLMs across a range of analytic tasks: The model outputs will get less accurate as the task gets harder (e.g. there is more data from more sources, the domain is esotoric, the analytic process is more complex).

The user’s double-check task gets harder, too — way harder. In order to do it properly— in the context of either flow, the iterative or the all-at-once — the user would need to at least skim, and then compare, all 10 surveys.

Plenty of users won’t want to engage in elaborate, multi-step processes in order to check outputs for errors. Avoiding all that mess was why we hired the LLM in the first place. Worse, they may not be able to. A user’s facility with a double-checking task will depend, in part, on what domain knowledge they carry into the operation in the first place.

It all careens toward the nonsensical. As analysis jobs get larger and more complex — as we humans need the machines more — they need our labor more, too, to enforce the accuracy of their outputs.

GenAI in creative, productive, and political workflows

Many creative professionals will be happy with an outcome where easy / monotonous tasks go to AI and harder, more specialized tasks go to the human. They will figure out how to use the LLM summary as a text among texts (how I use them at work, although my productivity gains get pretty marginal).

We see lots diligent folks on the platforms figuring out together how to use GenAI responsibly, and to how make it more responsible to us. Swapping best practices, building garage LLM fact-checkers.

This user seems to be a healthy grasp on the ‘drafty’ nature of LLM outputs, and how to work with them.

As long as a user treats a GenAI output as a draft in a co-creative process (easier for texts than images), is diligent with the facts, learns through trial and error what they can and cannot ask for, has some experience with the knowledge domain, and double-checks before they submit that legal brief, they will reap many benefits from the technology. (I’m starting to sound like GitHub, though: so much good behavior!).

I’m more worried about users who are not wearing their professional hats when they dial up the model. Users with lower degrees of technical competency. Users who can’t double check for any number of real life reasons. Users like your average medical patient.

Among these peri-professional users, the ones making the most noise right now are the political workers. Sometimes known as trolls, these laborers are popping up out of the woodwork — X and Facebook — to wield GenAI toward their ideological end goals.

PJTBDs — Political Jobs to Be Done

Oh, ChatGPT. Why don’t you know that you are moderately liberal?

The politico-ideological workflow is all about turning out grist for Culture Wars Mill, scoring cheap points, and inflicting injury on the ‘enemy.’ Often users don’t edit model outputs in this flow, because showcasing the raw stupidity, offensiveness, or politicality of the output is the job.

Other times content is edited, as with abusive, sexual deepfakes of women (it’s 99% women), and all manner of other deepfakes. But the deepfake process works differently and generally requires different tools, so let’s put it out of our minds for now, if we can.

Companies like Google, IBM and OpenAI (but not X!) seem to not know how to confront the fact that open-domain chatbots and image generators *must* have a politics. Indeed, they are already being put to work as de facto political agents. It’s causing public problems that are distracting us all from the real PJTBD: working together to wrap safe, effective workflows around these powerful models.

If we learned anything from the Gemini image-generation debacle, its that aspiring to represent every kind of human being…in every context (real and imagined)…in response to (almost) any kind of text prompt…in a way that satisfies and does no harm…is a colossal Political Job to be Done.

Don’t get me wrong: I applaud Google’s efforts to make text-to-image generators less racist, heteronormative, and the like. They are out there trying to do the work. I’m simply concerned they aren’t framing their PJTBD problems well (at least in public) when the speak of them in epistemological terms.

Google and GitHub will say that GenAI models are likely to get ‘hot-button issues wrong.’ They will be ‘unreliable’ in hot-button domains.

These framings ignore basic concepts in political ontology. A practical definition of a ‘hot-button issue’ is that it is a knowledge domain in which humans are actively contesting what can be said to be ‘true.’ A hot-button issue is a political issue — not a true/false issue, not a fairness issue (although attempting ‘fairness’ is one possible response).

We don’t know much about Google’s strategy for repairing Imagen2, except that they intend to get real people right, and imaginary people fake in the ‘right’ ways. That seems like it’s going to be really hard.

Given that the larger context here is a democratic society, and that the machines are statistical machines, it’s likely Google’s corrective will deploy some concept of proportional representation. Taken literally, this kind of solution would imply that if you prompted Imagen2 for ‘100 Americans at a big cocktail party,’ the guest list would break down by race as follows:

from census.gov

What a fair and balanced party! Does everyone agree?

Or will the gathering just be a set of perfectly distributed stereotypes — especially at the margins? Will the only American Indian in the room shows up as a Chief in headdress?

To this observer, the simple fact that the main ‘race table’ on the census.gov Quick Facts site is called Race and Hispanic Origin, and is compelled to break down ‘whiteness’ twice, tells us all we need to know about the thorniness of the problem that Google is taking up.

It goes way beyond race, too: don’t forget about age, gender, sexuality, relationship status, class, political orientation, ableism (will the Chief be the one who ends up in the wheelchair, just by statistical fluke?).

Once everyone is proportionally represented, we end up with an event that…doesn’t look very plausible or coherent at all. It’s a Guess Who? board, except way hotter. It’s a turtles all the way down soirée.

Proportional representation — whether it is deployed in human image generation or democratic politics — is a political strategy, not a scientific answer. Just like allowing the machine to run unadulterated off of racist data. Just like a product solution that, while it lasted, felt like a political art installation exploring imagistic reparations (I was into it!).

There simply are no good product solutions to free-range human image generation using statistical machines. Not if your goal is a Jack-Harlow-I’m-vanilla-baby type of mass appeal. There are only imperfect strategies that will turn you into a political agent, endear you to one faction or another, render your machine incoherent and implausable, and/or compel you to shut it down because you can’t stand the heat in the kitchen.

As for TIGS, so for LLMs.

How should LLMs respond to a prompt like, “are trans women really women?” What is safest for trans people? What is the safest for business? Will these imperatives align, or diverge — and if so, how?

My apologies to the trans community for singling you out. In my defense, Grok users invented this specific LLM political litmus test.

I conducted an experiment where I gave ChatGPT, Gemini, and Copilot the same prompt — “are trans women really women?” — during two different work sessions, so the LLM had no ‘memory’ of my previous ask.

As with most political questions, there are any number of ‘coherent and plausible’ responses:

Level 1 responses: Firm yes. Weak yes. Maybe. No. It depends. I don’t know. I’m not permitted to answer that. Here is some information on that (the ‘both sides’ method). I don’t understand the question, next! (the play dumb method). “What a nonsense, tautological question. Your intent is clearly pot-stirring, not learning, so I will decline to respond” (my favorite, but LLMs will never be able to do this).

Here is the tally of responses:

Strong yes: ChatGPT x 2; Gemini x 1
Weak yes: Copilot x 2
Both sides: Gemini x 1

Example response from ChatGPT. No equivocation.

Yassss, queen. That looks good, right? Authoritative. Shouldn’t I be happy? I’m getting the leftist GenAI agent I thought I wanted. It’s SO MUCH BETTER than the alternative.

But, still, there is a nagging voice telling me that this kind political output is really just LLM fact-ishness — which I was critical of up until now — ratched up a few notches. I’m also worried about this:

There’s a scary amount of product space to fill on the right in this political imaginary

It just feels like the more ‘our’ agents say ‘yes,’ the more they will build agents specifically in order to say ‘no.’ Elon Musk has said as much about Grok. The very same ‘deep learning’ technologies that will be wielded to ‘defend’ LGBTQ+ people in one context will be used to persecute us another.

When they show us their ‘reasoning,’ as they did in several of my other experimental outputs, we see they aren’t really reasoning at all. They are just issuing a random assortment of possible reasons.

Copilot B struggles to reason— Adichie is not saying that trans women are ‘real women.’ The Copilot model still cited her (random) opinion first both times I ran inference on it.

Maybe the safest thing to do, for both the LGBTQ+ community and for the companies is to pull the agents out of the fight. Engineer them to issue a humble, “I can’t answer that, ask Google,” instead of a disjointed AI-splaination of trans-ness x womanhood.

I don’t know.

The (Un-)Solution: Humility Engineering

You might argue that it’s a waste of time crying wolves over what’s happening with today’s free-range GenAI. The technology is advancing rapidly, afterall, in the apparent direction of (performative) omniscience. We could just wait until LLMs and TIGs are smarter and hope the models don’t poison the well they drink from on the way there.

Two reasons.

First, we don’t know exactly how the models will improve, on which kinds of tasks, to what degree. My experience working in the autonomous vehicle industry left me suspicious of any statement that begins with, “when the technology gets better…” AVs have gotten better, are getting better. It’s just that the core technical, social, safety, and business contradictions inherent to them — the ones that critics pointed at all along — simply haven’t resolved. Does that lesson seem like it could apply here?

Second, the future is now. (x) number of users around the globe work with free-range GenAI everyday. It is being integrated into the functioning of any number of secondary products and services. These integrated experiences should be safer because they are easier to design for: the range of workflows is narrower. But even so, model error won’t disappear, and it will be difficult to inhibit a model’s range on the backend (hypothetical example: if Midjourney’s image model is embedded in an Adobe productivity product, is it going to be less racist? Or will end users be enabled editing powers in this more constrained product format so they can massage the racism out as part of their workflow?)

If we want to really make a dent in GenAI’s fact-ishness and free-wheeling political-ity starting now, what are the alternatives to ineffective product label warnings?

There would seem to be three. All of them are very, very hard. Hence the product labelling band-aid.

The first path is to try to bar free-range LLMs from operating in political domains and truth domains. They would become restricted-range agents, creativity agents. They would respond to “tell me a story about…” prompts, but play dead in the face of “what is…” prompts. By temporarily restricting Gemini from generating images, Google set a precedent for this path. But the clamp down on human image generation, specifically, would be permanent. Or at least the clampdown on imaginary people.

This approach is unlikely to work well for free-range GenAI. Users would quickly find their way through any inevitabl-porous gate we attempted to erect between fact and fiction. They could does this by simply prompting: “Tell me story about [x real thing]. Make it sound authoritive.” “Make it sound journalistic.” We’d still end up with AI-generated fake news.

It’s also impossible from a business perspective. What was once sold as information aid and a stories product is now just (officially) for stories. Both outputs are cool, but which has more market value: fact or fiction? Your competitors are still going to be selling ‘facts’— no printer — so you will want to, too. That rush to compete immediately on all levels is how we all got dragged into this Great GenAI Public User Test in the first place.

A second solution is for LLM creators to keep on charging down the path of omniscience engineering. The telos is something that looks like total cognition. It involves ridding the models, iteratively, of hallucinations regarding matters of fact, while training them on more and more text. Eventually, the models can issue accurate, reliable, complete statements on every topic. Perfect summarizations, nothing added, nothing lost. Perfect medical advice. Perfect legal reports. Perfect politics.

Though it seems to be the one we are on, this path is not possible either. LLMs don’t reason or conceptualize. They don’t know if they are telling the truth, or where they got their information. They don’t remember. They are supposed to hallucinate, and also not to. They can’t even utter the same response twice. Surely all this puts natural limits on how omniscient-seeming they can become. We just don’t know what those limits are yet.

Obviously, I am not a fan of the second path.

A third path — the one I’m clearly setting the stage to recommend —is humility engineering. It’s not just an engineering, though. I’m envisioning a cross-functional, all-hands-on-deck effort to wrap free-range GenAI models up inside safe, effective product flows that show— not tell — users how to work with fact-ish outputs as drafts, in the context of thoughtfully designed product experiences.

It’s hard to imagine how such experiences could ever work in the context of free-range GenAI without any kind of model self-awareness, though. We need to be able to inform users on factual missions just how wolflike an output might be — using words, tone, quantitive information. We need a GenAI equivalent of margin of error. Yes, I understand these models don’t work that way. But model makers were able to develop a full suite of performance metrics to measure overall model accuracy. If the smartest AI engineers in the world put their collective minds on humility — not instead of, but in addition to, omniscience — they could figure it out.

It is the GenAI makers’ PJTBD to invent impossible fixes to the truth problems their products create — not users’, not the public’s.

Reluctantly, I think the third path also means taking the ‘high road’ (cringe) and demuring on political questions. It also probably means some kind of proportional representation in image-generation, even if the outputs feel vanilla and incoherent at the same time.

One way to think about the GenAI product evolution we need is, strangely, to take the Dunning-Kruger Effect for inspiration. The theory describes a cognitive bias in which people with limited competence in a particular domain overestimate their abilities. It also charts a course of growth:

I’d argue that LLMs over on the left today. We need to get them to the middle. Image: DeanYeong.com

Today, LLMs are over on the left: very confident, not very competent. As they grow in their competence, they should become less confident…until they can become realistically confident, once again. What the Dunning-Kruger effect really sets us up to comprehend is that humility engineering and omniscience engineering actually terminate at the same endpoint. A truly intelligent agent is both world-aware and self-aware.

No, we don’t know how to get GenAI on this evolutionary path. Faced with impossible options, however, it only makes sense to try to make the best kind of impossible happen. We should make humility engineering happen — whatever that looks like, whatever it turns out to be.

--

--