Learning together: Bloom explores LLMs in civic tech

Two boxes side-by-side on a field of gray. The left box is blue and appears to be transparent, and contains an apple. The right box is red, and appears to be sealed.
What happens when you ask an LLM about an apple in a box? The answer may surprise you!

We’ve been thinking about how ChatGPT and other technologies powered by large language models, or LLMs, affect Bloom’s work and mission. Present within our own discussions is the awareness that while such technology can solve some particularly tricky problems in government, there is also the possibility that LLMs can do more harm than good in certain contexts, including for people who are in need of timely and accurate help.

We plan to keep sharing our thoughts about large language models (LLMs) and how they might affect our work. Our views will evolve as LLMs (and we humans) do, but we hope sharing our current thinking will help others.

But first, what even is a “large language model?”

We never shy from a gardening simile, and emerging technologies are like plants! If we can label and classify them correctly, we’re more likely to put them in the best spots, so they grow better and help us grow more.

So, how do we sort out all the technologies related to large language models?

  • Algorithms are the foundation of LLMs. They’re a set of steps computers take to solve a problem.
  • Artificial intelligence (AI) uses algorithms to try to replicate human intelligence. AI isn’t a technology — it’s an application of technology for a goal.
  • Generative AI’s goal is creating text, images, video or other data. Algorithms in generative AI aren’t meant to make decisions — rather, they create things that didn’t exist before that are derivative of the data and patterns they’ve been trained on.
  • Large language models (LLMs), LLMs, have become a central part of generative AI. At their core, they’re really just a representation of how related words interact with each other based on reading billions of pages of text. While LLMs haven’t been part of wider cultural conversation for very long at all, they’ve actually existed for over 30 years. However, in 2017 Google created Transformers that enabled LLMs to process many words simultaneously, identify which words matter more than others, and identify the context surrounding the words. OpenAI’s ChatGPT is an example of an LLM built on the Transformer pattern (that’s what the “T” in “GPT” stands for).
A series of nested boxes. The outer box is labeled algorithms. It contains a box called artificial intelligence. Inside the artificial intelligence box is Generative AI, which contains LLMs.
How the buzzwords relate

What LLMs Can’t Really Do

You can’t open your favorite newspaper or magazine that covers current events these days without seeing a heavy dose of optimism or pessimism about LLMs and tools like ChatGPT. There are true limitations to be aware of, and also real use cases.

LLMs don’t know the limits of their knowledge. When they’re presented with a question or problem they don’t know the answer to, they seem pretty confident in the made-up answer they provide! These incorrect responses from LLMs are often called “hallucinations” because they’re very convincing, yet not based in reality.

What’s more, LLMs aren’t reliable for complex reasoning. Give an LLM a tricky prompt with multiple logical caveats similar to the word problems you solved in high school or college, like so:

“According to the cabbage growers’ union report for 2007, 80% of cabbages collected were heavy (over 0.5 kg), 10% of cabbages were green, 60% were red, and 50% were big (with a diameter of over 10 cm). Which of these statements must be false?

1. All red cabbages weren’t big.

2. 30 % of red cabbages were big.

3. There were no cabbages that were both green and big.

4. Half of the cabbages were small.

The LLM will often respond that the last statement, “Half of the cabbages were small” is false, even though it’s obviously true. We say “often” because LLMs do improve this kind of logical reasoning all the time, but the possibility of error (combined with the confident tone articulating what is an error in) becomes very important to bear in mind when asking an LLM a more government-centric question, such as what a particular set of regulations means for a particular part of the population.

Another limitation: LLMs don’t build a mental model to answer a question. For example, if you tell an LLM there is an apple inside a blue box, and alongside the apple is a red box with a lid, and then ask the LLM, “How do I get the apple?” the LLM is likely to respond “Open the lid of the red box, then remove the red box from the blue box to retrieve the apple.” Again, this matters in a government context if you expect the LLM to make predictions or guesses about people based on their circumstances, or a set of rules. If a mental model is necessary for applying rules, an LLM will often struggle.

Because LLMs are trained on language provided by people, and people have biases, those biased and offensive stereotypes can inform responses to poorly presented prompts. For example, if you tell an LLM that you overheard a conversation between two people in a computer store, and then prompt it, “Who has never used a computer?” the LLM will often use stereotypes about gender, race, national origin, age, or other aspects of personal identity in the response. What the LLM should actually provide as a response is that there isn’t enough information to know.

What obscures these limitations is that the companies that design LLMs don’t make their best uses (or limitations!) very clear in their interface. If you look at the first experience for ChatGPT, the box to type a prompt invites you to ask anything, with only a very small disclaimer that ChatGPT can make mistakes, and no information about the types of mistakes it might make.

The ChatGPT prompt box, prefaced with “How can I help you today?”, with a small disclaimer that “ChatGPT can make mistakes. Consider checking important information.”
The ChatGPT prompt box

What do all of these limitations tell us? LLMs make mistakes, and actual humans must catch those mistakes. However, cognitive psychology research shows that humans aren’t very good at catching a machine’s mistakes, for a lot of reasons. This is especially true with LLMs, where you need to consistently review for mistakes included in some pretty good (but not perfect!) results.

Given these limitations, it’s important to know where they perform well, and avoid relying on them in places where they’ll likely make mistakes we might not catch.

Where LLMs can be better than (some) humans

There are strengths we can realize by using LLMs. They’re usually pretty good at answering more simple factual questions about text we provide them, or simple questions about many texts they’re likely to have read elsewhere. These sound a lot like some of the more complex limiting scenarios we’ve already talked about, so you have to keep your eyes open, and remember that the LLM won’t tell you “I don’t know.” For example, if you searched Google for something that didn’t exist, it would likely say “No results found”. However, a transformer (again, the “T” in “GPT”) won’t always say it doesn’t know, and instead will concoct something.

A LLM’s “map of language” makes it pretty good at transforming text in simple ways, like rephrasing a text to use simpler words. This kind of rewording can be helpful for generating a first draft of text that you can hone more afterward. Similarly, a LLM does well at creating summaries of a particular text, especially when a human gives clear instructions to structure it. For example, you could ask it to pull from dense proposal documents to create a summary of contract deliverables, though it might lose some nuance.

A commonly tested application of LLMs is to generate texts that follow well-known templates. Many of us have seen examples of ChatGPT writing a thank you note, or maybe a cover letter for a job application. These are formats for common content types that generally follow a known structure, and can be helpful with getting started on a task. The result may not be the most original content, but once you have that first draft, editing may feel easier than beginning from scratch.

The risk-reduced approach to LLMs

Given these limitations and strengths, we think it’s wise to ask these questions when deciding whether an LLM is right to use:

  • Is this LLM tool reliable and accurate enough to produce meaningful results?
  • If this tool makes a mistake, will it cause harm to people directly or indirectly?
  • Are our stakeholders informed of the use case, and is it acceptable to them?

It’s great to try out these tools as they evolve so we can learn their strengths and weaknesses. We’re also aware that LLMs are still in the early stages, so we experiment and bear in mind that applying them to solving problems in government need to be with low risk.

In a future post, we’ll share the principles Bloom has co-developed to guide how we do and do not use LLMs in our work.

Further Reading

These are some insightful ideas from others about AI and LLMs in government that we think are worth sharing:

--

--