Building the plane while flying it: an LLM case study

Published in

People + AI Research

7 min read3 days ago

Quinn Madison, Alexa Koenings, Iris Chu, Pratheek I, from Google Dev AI UX
Contributor: Laurie Pham, Google DevAI UX

4 cards arranged in clockwise order that indicate the UI patterns described in the article. The 1st card shows a UI in which AI generated code has 2 sources: controls to provide feedback, and a message to the user to use code with caution. The 2nd card asks users why they chose a rating, with an response section. The 3rd card recommends prompts for users to get started with coding tasks, and he 4th card lists 3 questions that users might ask an LLM, accompanied by feedback and documnentation. — Enable user feedback, design for steerability, set expectations so users can calibrate their trust in your AI system. Examples from Develocity, a fictional application created for the People + AI Guidebook.

As a UX design team operating in the artificial intelligence space inside Google, we navigate uncharted territory every day. That’s because there’s much to think about when designing human-centered artificial intelligence (AI) products. How can we best keep humans in the loop? How might our user interfaces (UIs) create continuous, dynamic feedback loops between users and AI models? How do we design to enable individuals, groups, and AI models (such as Large Language Models [LLMs]) to collect, curate, and share knowledge seamlessly?

Because this is a high velocity space where the limits of AI technologies are not yet known, dynamism is ever present: we’re designing the proverbial plane while flying it. Our experience has taught us that a principle-based approach is key. At the highest level, this means prioritizing continuous improvement, transparency, and user empowerment in the development of AI systems.

Over the last two years, we designed, tested, and continuously iterated on the development of an internal tool that helps software engineers (SWEs) increase their velocity at work. The tool, a proprietary LLM related to Google’s Gemini models, was trained on a corpus of internal SWE data and over time, enabled developers to code faster and get answers quickly to technical questions in their day-to-day work. To keep humans at the center of our product experience, we prioritized three core design principles:

Enable user feedback
Design for steerability
Help users calibrate trust.

Introducing Develocity

A tablet-like interface with an introductory message, “Welcome to Develocity. AI assisted code support tailored to your organization.” A code cell invites users to use devol@ to generate code with AI, and 2 buttons encourage users to experiment. There are 3 cards that describe functionalities and suggest prompts for users to get started. The 1st card states “turn comments into code”, the 2nd “search across your organization,” and the 3rd “chat-based code companion.” — A mock-up of Develocity’s onboarding screen. Created for the People + AI Guidebook.

Because our internal product is proprietary¹, we’ll refer instead to a fictitious product called Develocity. Develocity is a generative AI-powered product that increases developer productivity and improves code-consistency. Develocity — featured originally in the People and AI Guidebook — is a helpful way to think through knowledge and dependency management challenges that are typical to organizations, software-as-a-service, and enterprise products. It also demonstrates a range of scenarios in which developers interact with underlying models, whether with language-based turns or UX components.

Develocity illustrates how generative AI models — specifically, LLMs capable of writing code — might enable developers to write, test, share, and manage code across their workflows. For the purposes of the following article, imagine that Develocity has an LLM-powered chatbot — one that can find resources for software engineers and also generate documentation as needed. We’ll use Develocity to demonstrate some of the design decisions that helped us put our principles into practice, so you can apply them to your AI products.

1. Enable User Feedback: capturing rich signals

Writing code is a creative, problem-solving task. Help from an LLM can unblock SWEs, increase coding velocity, and tighten code-iteration cycles — especially when the UI is intuitive. We designed a UI that enables users to fine-tune the model’s responses by refining their input prompts. We also saw an opportunity to collect and use explicit feedback about the LLM’s performance during chat exchanges to improve the model itself.

We used analytical rigor in deciding which feedback signals to capture, eventually determining the number of chips used (with a limit of five), what the chips should say, and whether or not to include a rewrite option.

One of the most familiar ways to collect feedback is via thumbs up and thumbs down buttons. These are almost universally understood as feedback mechanisms for users to personalize AI outcomes, and work very well because they solicit immediate and lightweight feedback — with minimal interruptions.

The thumbs up/down pattern offers a binary signal into whether or not a user likes an AI outcome — but generative AI models typically benefit from a richer set of feedback signals for any meaningful change. So, we designed our chat with a thumbs up/down feedback pattern that enabled us to quickly collect positive signals — but also, ask for richer feedback on negatively rated AI outputs.

*Users can optionally rate AI chat messages as helpful or not helpful, using the thumbs up and thumbs down buttons. Rating an AI output not helpful surfaces a rich feedback card.*

When a trusted tester decides that a response is “Not helpful” and clicks the thumbs down button, a contextual feedback card appears. This allows the user to quickly tell us why they chose that rating by selecting from a predefined set of chips. These chips surfaced common error types that we could anticipate, such as “Broken / incorrect link,” “Outdated information,” “Repetitive,” and “Not relevant.” Users receive the option of elaborating with short explanations, re-writing the response entirely, or adding sources. This rich information is collected so that the engineering team could manually assess how the LLM needs to be tuned.

Our feedback card, designed to engage users during early stages of product development, yielded feedback that improved both the accuracy and the confidence of our models.

Identifying and acting on opportunities to gather rich feedback in the short term can allow us to offer better AI outputs in the long term.

2. Design for Steerability: offering users control

Generative AI systems can be tuned to provide reasonably reliable outputs, but the exact outcomes can still sometimes be surprising or off-topic. It’s critical that product interactions accommodate this dynamism and set up users for success.

In our hypothetical Develocity use case, we took several under-the-hood measures. We set a prompt-preamble for the LLM intended to maximize the relevance of outputs. We also set model parameters that would be reflected in the front-end user experience.

From UX research with our internal trusted testers we discovered that our users’ interactions and engagement with the LLM largely depended on their goals, and whether they were deeply engaged in a task, looking for quick facts, or simply exploring. Across all use cases, we found that steerability — enabling users to steer or refine the model’s outcomes — was key to a human-centered experience.

One way to improve the model’s steerability and AI outcomes, was to give users the tools to manage context in the front-end. We designed a multi-turn dialog system in which users could specify parts of the context that were relevant to their query, and preserve vital context for our AI system. This enabled the AI system to suggest follow-up questions that helped users advance their initial query.

We further complemented this with a range of content strategy tactics, such as carefully crafted zero-state text for input fields, and introductory messages that oriented users with our experimental tool. Users were shown a range of tasks they could get assistance with, and how they might frame these tasks. We also described the risks and limitations associated with using our tool.

As a result, developers got answers that were more accurate and relevant. Our experience highlights insight from the People + AI Guidebook stating that turns are the center of the Generative AI experience, and conversational interfaces are one piece of the puzzle.

An introductory message that lists the tasks that users can “Chat with Develocity for assistance with:” when users open Develocity for the first time. The tasks are: Coding, Planning, and Learning. Below that is a modal with the title “important reminder”, that reminds users that they should not prompt Develocity with restricted code, privileged information, and user data. It asks users to review and acknowledge the Terms of Use and Data policy by selecting a checkbox and pressing confirm. — Craft introductory messages that are up-front about what your product can and can’t do when the user interacts with it. Help them explore the limitations of your AI powered product during initial and early use so they build healthy AI habits — even if it’s as simple as a reminder and acknowledgement.

3. Help calibrate trust: keeping humans in the loop

There are always the chance of inaccuracies with LLMs — hallucinations (confident, but fabricated and incorrect responses) or an inexplicable AI outcome for example. It’s important that the design of any AI product enables users to calibrate their trust in the system, relying on their own judgment if something seems amiss. While developers can easily use AI generated code, for example, they’re still responsible for validating and executing this code, and ensuring that it’s syntactically and functional correct.

Without explanations of AI outcomes, it can be challenging for a user to calibrate their trust. Trust is critical to product adoption and user success, and as designers, there are a few things we can do to help foster appropriate levels of user trust. In designing our chatbot, we turned to human-centered design strategies:

Set clear expectations: Transparent disclaimers about the AI’s capabilities and limitations help users form realistic expectations.
Offer explanations when possible: When the AI encounters errors or limitations, provide concise explanations to reduce user frustration and maintain trust
Enable choice and control: Allow users to refine their prompts through AI-generated suggestions, provide feedback, or even override AI suggestions, to foster a sense of agency and collaboration.
Show references and sources: Provide citations and relevant resources alongside AI outputs, to help users to verify information and understand its origins.

A screen where a user asks Develocity’s chatbot “When I can take a vacation from this project?”. The chatbot’s response shows the 4 human-centered strategies. 1. Sets clear expectations of scope by stating the question is out of scope. 2. Explanation for why it is out of scope (its purpose is for questions on software engineering). 3. Enables choice and control by providing user alternative question options. 4. At the bottom it shows references and sources: a link to staffing documentation. — Sometimes users may prompt your AI system for an out-of-scope task. When the AI encounters errors or limitations, acknowledge the error in the moment, and provide concise explanations. Then, address the error by giving users a way forward — in this case, alternative prompts and follow-up documentation.

In closing

When designing with AI, we design for systems where human and machine intelligence intersect and evolve together.

Our designs must perceive and adapt to these AI systems, which consist of many elements, interconnected in such a way that they produce their own patterns of behavior, knowledge synthesis, and relationships with users.

For Develocity, our design decisions created a simple set of affordances that offered users a degree of steerability — and the guidance to explore a wide range of use cases. They also enabled our users to leverage our tool’s capabilities while keeping their own judgment in the loop. And of course, our feedback mechanisms encouraged users to provide feedback when their goals weren’t met.

As we navigate the dynamic landscape of AI, where we’re designing and building the proverbial plane while it’s already in flight, we find that a commitment to principled UX is essential. It’s how we ensure a safe and meaningful journey for all passengers on board, fostering an environment of trust, transparency, and empowerment.

¹The examples, guidance, and recommendations in this article draw from Google user research studies and design explorations. The details of these are proprietary, so they are not included in this article.