Learning together: Bloom’s principles for using Large Language Models

Colin MacArthur
Pollinator: the Bloom Works blog
7 min readJul 8, 2024
The scales of justice superimposed over a network of dots and connecting lines.

Much ado has been made about AI and LLMs, and how to drive innovation around them. Private sector companies are racing to develop tools and features around the technology that range anywhere between fairly impractical to rather clever. The drive and tone of all of these efforts has been largely experimental, which is a pattern we’ve seen in past large leaps in technology such as augmented reality and the blockchain. But as with those examples and others, it’s important for technologists like us who design and develop products intended for use in government services to demonstrate more caution and care. As a public benefit corporation, we’re beholden to the public and their needs.

Last month, we shared insight into recent conversations we’ve had as an organization about applying large language models (LLMs) to our work in civic tech. We’ve found internally at Bloom that it was helpful to be on the same page about what LLMs can and can’t do. As we came to a shared understanding of the realistic capabilities of LLMs, we began pondering another question: Given what LLMs can and can’t do at this particular time, how might we decide whether and how to use LLMs to solve real problems in delivering good government services?

How does Bloom think about using LLMs responsibly?

LLMs (and GPTs, in particular) seem very good at some things, and very bad at others. We hear that LLMs can pass a law school entrance exam, but also see obvious examples of their mistakes.

Not long ago, during a new series of emerging tech talks here at Bloom, I was excited to lead us through a discussion about LLMs and where they may or may not be the best tool for a given job (our previous blog post covered much of this discussion!). I have a lot of experience working with LLM’s in a research capacity, so I was able not only to talk about their strengths and weaknesses, but also to take the first pass at Bloom’s guidelines for how to use them responsibly. Following this first draft, I led what turned out to be an inclusive and enlightening discussion with Bloomers across the company to gather their feedback and understand how they were thinking about using LLMs.

These principles are the result of our exploration.

Guiding principles that probably won’t change over time

1. Avoid harm, and repair it if it occurs. LLM-generated instructions that contain errors that reviewers later miss can have real and harmful effects. Be mindful that errors are always possible when employing an LLM, and if an error makes it out to the public, name it and fix it as soon as possible.

Examples: When trying to improve life-altering government services, it is always useful to review policy research. Employing an LLM to make quick work of those reviews might be effective for creating summaries and condensing mountains of peer-reviewed or multilingual research. However, if you plan to publish those summaries in some way, the language improvements that an LLM would likely apply may change how and whether people receive that service. Carefully review the LLM’s responses for validity. In the event that mistaken guidance from an LLM is sent to constituents, give notice in places like the organization’s website, through individualized notifications to affected people, or even support organizations who serve constituents. Moving forward, purge the incorrect data and validate that correct answers are being provided. If that’s not possible, decommission the LLM or chat bot.

2. Use LLMs when they are a better solution to meeting people’s needs than existing approaches. If we can organize documents into an easy-to-navigate hierarchy, we don’t turn them into an LLM-based chat bot. If using a template or a formula to write website pages is quick and easy, we do that. LLMs take time and patience. If there’s a simpler approach, we embrace it.

Example: Government services like unemployment insurance often have formulaic rules and requirements to determine eligibility. Using an LLM to create more consistent guidance about those formulaic rules and requirements might be a great application for LLMs in a call center with many agents with varying degrees of familiarity with the regulations. But LLMs tend to be variable, and you wouldn’t want to feed them a pile of unemployment claims to make determinations. Instead, a traditional algorithm-based piece of software would likely give more fair and consistent determinations, especially when supported with a responsive appeals process for issues that require a human touch.

3. Use data and evidence to inform decisions when selecting and implementing an LLM. We use research and data to choose how to use emerging technologies, and mitigate their harms. For example, we run tests that compare an LLM’s performance to a human’s before using the LLM to do the human’s work. Or we incorporate research from scientific articles about the strengths and weaknesses of LLMs to make decisions about how we’ll use them.

Example: Let’s say you’re developing a chat bot to answer questions about federal nutrition benefits. Read reviews from reputable organizations about available LLM products to understand strengths and weaknesses and select the most appropriate. If your research includes a request for proposal (RFP) from companies who develop LLM products, which is common for government organizations seeking technology solutions, ask detailed questions about how the company has trained and validated their products, and insist on access to their tests. Once in use, validate output anytime the LLM algorithm or training data significantly changes, just like you would with other technology you employ, either by automatically running routine tests to inspect answers to the top 10–25 questions folks ask when the training data significantly changes, or by manually conducting user research to understand whether answers to questions are helpful.

Guidance that Bloom may revisit as LLMs change

1. If the potential for harm is high, only use an LLM if it’s more reliable than humans. LLMs do almost nothing perfectly, with consistency. In some situations, their mistakes could harm people. And although we can ask humans to catch mistakes, they always miss some. In “high risk” contexts, we discourage using LLMs unless they are more accurate, and cause less harm than humans do.

Example: It can take years of training for call center staff to deeply understand and accurately distribute policy information. An LLM has the potential to quickly memorize much more information.

2. Always check LLM responses for accuracy, no matter how low the potential for harm is. When machines are semi-reliable, humans rely on them and miss their errors. Proactively plan for an LLM’s mistakes: have a second reader, institute a review process, audit ongoing performance. If that sounds like too much work for too little benefit, it may not make sense to use an LLM in that situation.

Example: Imagine you’re interested in using an LLM to help revise written content on a massive government website so that it adheres to standard plain language guidance. While an LLM might speed up the writing process, you should retain the later peer review process that you would have historically used for content written by a person.

3. Avoid using LLMs to make decisions or judgements. LLMs often struggle to apply complex logic to prompts about people or situations. When they try, they often lean on biases or stereotypes present in the content they were trained on, which has no place especially in government contexts.

Example: Don’t replace a social worker or human adjudicator with an LLM. Many policy implementation decisions are complex and require a deep understanding of an individual’s situation. Stereotypical advice from an LLM could lead a person with unique needs down a bad path.

4. Know where and when you’re inputting private information into an LLM, and whether that’s acceptable. LLMs often don’t make it clear how they might be using the information given to them, where that prompt information is stored, or where it could pop up again (like in the response to someone else’s prompt!).

Example: The free versions of ChatGPT or Bing Co-pilot train themselves on your prompts (and those from everyone else who uses them, which is a lot of people!). If you prompt these products with the financial information for your organization, that data could appear in responses to a stranger’s prompts. If you work with confidential information, carefully check the LLM’s privacy policy first, or use a locally-run model like GPT4all.

5. Be open about how and when we use people’s data to train LLMs (and anything else we do with their data). Many people are concerned about how the data they submit to online forms will be used, whether that’s selling it to a third-party marketing firm, or perhaps worse. A majority of the US public are concerned about how artificial intelligence affects their privacy, and want openness about how and when LLMs are used.

Example: Imagine you collect information on a form where people are expected to answer long, open-ended questions. To speed up form review, you might train an LLM to summarize their answers. In that case, you should disclose — on the form itself — that you may train an LLM with their responses.

Above all, we recognize that LLM capabilities will change, and so should how we think about them.

Next time we will share how Bloom’s practice areas — Product Delivery, Tech Strategy, UX Design and Research, and Content Strategy — think about using LLMs.

--

--