Generative AI is new and exciting but conversation design principles are forever

Alessia Sacchi
Google Cloud - Community
11 min readDec 28, 2023

--

Generative AI filled us with wonder in 2023 but all magic comes with a price. At first glance, it seems like Large Language Models (LLMs) and generative AI can serve as a drop-in replacement for traditional chatbots and virtual agents. In reality, because LLMs are generalist word probability machines trained on a huge corpus of humanity, it is more important than ever that virtual agents powered by generative AI have access to the latest information and specific details about products or services. Furthermore, given that LLMs are non-deterministic by nature, what seems like creativity and open-ended possibilities to a developer might be perceived as inconsistency, quality issues, liability, or risk in business-critical industries such as finance, healthcare, or retail.

As the end of the year is approaching, let’s wind down and reflect upon the fundamental principles required to preserve the human element when designing conversational flows, chatbots, virtual agents, or customer experiences. The generative AI that we have been using this year in conversation brings so much excitement but there’s a counterpart to everything. Designing loquacious bots that actually help users solve a problem or make transactions takes strong grounding in the science of conversational AI and a healthy dose of hands-on tinkering as we aim to strike a balance between generating creative content and adhering to conversation best practices.

Let’s review a few design principles and pitfalls to keep in mind when blending generative language with deterministic agent design.

Rely on unwavering fundamental principles to guide you when designing conversations.

When dealing with time-sensitive and organisation’s specific information combine LLMs with vector databases, graph databases, and document stores to generate grounded and truthful responses.

Generative AI is amazing. This year, as part of my process of learning and testing the technology to report back to you, I’ve tried out a set of generative conversational features built on Dialogflow and Vertex AI. With these features, you can now use large language models to parse and comprehend content, generate agent responses, and control conversation flow. This can significantly reduce agent design time and improve agent quality. But generative AI is not without its problems. A large language model is a type of language model able to achieve general-purpose language understanding and generation. LLMs acquire these abilities by using massive amounts of data to learn billions of parameters during training. Generative AI is the application of a model such as an LLM to generate text based on user input, and it does this by acting as a sort of “word probability machine” based on the corpus that it was trained on. The bad news is that the information used to train an LLM may be weeks, months, or years out of date and in a corporate AI chatbot may not include specific information about the organization’s products or services. That can lead to incorrect responses that erode confidence in the technology.

The information used to train an LLM may be weeks, months, or years out of date.

In this example, a generalized LLM can provide accurate questions about players, teams, history and rules of a sport in the past. It wouldn’t be able to discuss last night’s game or provide current information about the latest world record or a particular athlete’s injury because the LLM wouldn’t have that information. Given that an LLM takes significant computing horsepower to retrain, it isn’t feasible to keep the model current.

That’s where retrieval-augmented generation (RAG) comes in. Consider all the information that an organization has — the structured databases, the unstructured PDFs and other documents, the blogs, the news feeds, the chat transcripts from past customer service sessions. In RAG, this vast quantity of dynamic data is translated into a common format and stored in a knowledge library that’s accessible to the generative AI system. The data in that knowledge library is then processed into numerical representations using a special type of algorithm called an embedded language model and stored in a vector database, which can be quickly searched and used to retrieve the correct contextual information.

Now, say an end user sends the generative AI system a specific prompt, for example, “What is the world record for diving?”. The query is transformed into a vector and used to query the vector database, which retrieves information relevant to that question’s context. That contextual information plus the original prompt are then fed into the LLM, which generates a text response based on both its somewhat out-of-date generalized knowledge and the extremely timely contextual information. Interestingly, while the process of training the generalized LLM is time-consuming and costly, updates to the RAG model are just the opposite. New data can be loaded into the embedded language model and translated into vectors on a continuous, incremental basis. In fact, the answers from the entire generative AI system can be fed back into the RAG model, improving its performance and accuracy, because, in effect, it knows how it has already answered a similar question. In short, RAG provides timeliness, context, and accuracy grounded in evidence to generative AI, going beyond what the LLM itself can provide.

When we’re designing conversations with users, we want to ensure that we are divergent when it comes to options and possibilities, and convergent when we are trying to help them solve a problem or make transactions.

Being LLMs typically generalists trained on a large corpus of text, users can prompt or chat with LLMs in a divergent way across a vast range of topics. If you’re actually trying to solve a problem, like reporting a property damage, what seems like creativity and open-ended possibilities might turn into a frustrating user experience.

An open-ended chat that always seems to diverge instead of converging towards the problem resolution.

On the other hand, if we were to only use deterministic intent-based design, our chatbot might get the job done but users would feel like they were conversing with an impersonal robot rather than a helpful assistant. In our daily life we use both convergent and divergent thinking. Convergent thinking focuses on reaching one well-defined solution to a problem. Divergent thinking involves creativity, it helps generate ideas and potentially develop multiple solutions to a problem. Both convergent and divergent thinking are needed for creative problem solving. If we apply these two concepts to LLMs and virtual agents, it is reasonable to believe that the former are more suitable for creative, expressive conversations that help users learn while virtual agents are ideal for focused, problem-solving conversations that can steer users towards their goal. When it comes to critical use-cases, such as reporting a lost passport or requesting to block a stolen credit card, it’s probably not a good idea to “delegate” the whole user-agent interaction to an open-ended LLM. Transactional use-cases are centered around clear and specific intents that often require deterministic prompts to collect and validate structured data needed to trigger actions on the back-end. Missing to detect the user’s intent and to extract the contextual data would inevitably build frustration with the user. In conclusion, since we want to put users’ experience first, for key and sensitive use-cases consider more deterministic approaches instead of delegating free-form NLP tasks (such as intent matching and slot filling) to LLMs.

Use the 80/20 rule, or Pareto Principle, to avoid over designing the agent. Then fallback to LLMs to cover edge cases and common detours where the unpolished effort may be “good enough”.

One of the last steps in the detailed design phase of the Conversation Design Process is design for the long tail. The head is represented by the key use cases that make 20% of possible paths in a dialog. These are the most important and most common conversational paths that 80% users follow. When we are designing a conversational interface we should focus the majority of our effort on making these paths a great user experience. But there’s a body of common detours (less common, and often less direct or less successful, conversational paths) and a long tail of edge cases (highly uncommon paths) that we cannot disregard or omit as part of the agent design/implementation. Think about all the things that can go wrong in a conversation and all the unexpected or unsupported paths users might take.

Design for the long tail

Even with robust intents, there is still room for error. Users may go off script by remaining silent or saying something unexpected. While preventing errors from occurring is better than handling errors after they occur, errors cannot be totally avoided. There are trade-offs in terms of perfection or completeness, and while for detours and edge cases the unpolished effort may be "good enough", generic static prompts like “Sorry I’m not sure how to help” are just not good enough and we can certainly be a little more specific. Error prompts should be inspired by the Cooperative Principle according to which, efficient communication relies on the assumption that there’s an undercurrent of cooperation between conversational participants. Let me give you a real example: recently, in a restaurant in San Francisco I was in the mood for some fruit. I asked the waiter if they had any and the answer was simply “nope”. I then looked at the menu and I noticed a fruit cake dessert. A more efficient communication aimed to address my goal (have fruit) implies a cooperative answer such as “Sorry, we don’t have any fruit on the menu but I can recommend the fruit cake”.

The generative fallback feature uses Google’s latest generative large language models to generate virtual agent responses when end-user input does not match an intent or parameter for form filling. The feature can be configured with a text prompt that instructs the LLM how to respond and the conversation between the agent and the user. The generative fallback feature combined with good flow and intent descriptions can provide agent’s specific and cooperative responses as opposed to generic prompts like “Sorry I’m not sure how to help” or “Sorry, you’ve entered an invalid option”. Error prompts generated by large language models can gently steer users back towards the successful paths or reset their expectations about what is and isn’t possible.

Let’s look at an example of what happens when we design a virtual agent to be convergent. In this example, the user’s goal is to book a liveaboard for his family. Notice how the agent is not too prescriptive however thanks to LLMs it does handle an unexpected destination as well as the user intent to take a scuba course. It resets the expectations about what is and what isn’t possible and steer the conversation back to the successful path. It’d be extremely hard and almost impossible to design an agent to handle the myriad of unexpected user inputs. That’s when generative AI comes into play.

Now, let’s put our best practice into action and design a blend of deterministic goal-oriented conversation, and we’ll see how the agent is designed to switch to a generative and LLM-based approach when it’s appropriate. Once the question is answered or the distraction is over, the agent returns to helping the user with their primary goal.

It is important to highlight that despite generative responses provide good context specific error handling we should always follow agent design best practises related to intents and entities so agents can optimally process the end-user input and avoid raising unnecessary no-matches.

Inject generative content into your chat and voice bots to help gain user trust and make experiences a bit more human.

According to the principles of conversation design, defining a clear system persona is vital to ensuring a consistent user experience that builds user trust. Your persona can help provide users with a mental model for what the agent can do and how it works by starting with what users already know. For example, consider the persona behind a chatbot to assist donors with blood donation eligibility questions. It could be modeled after an idealized helpful assistant — empathetic and trustworthy about donors’ needs and sensitive personal information. In this demo app an LLM determines the user’s eligibility to donate blood by taking the user through a quiz. Below is a generative email crafted by the LLM in the scenario when the person doesn’t exactly meet all the necessary requirements to donate.

Dear [User], Thank you for your interest in donating blood. We appreciate your willingness to help others in need. Based on the information you provided, you are eligible to donate blood. You are within the age range of 18–75, your weight is above 50 kg, and you are not pregnant or have recently given birth. However, you mentioned that you are vegan and your iron levels tend to be low. We would like to inform you that low iron levels can affect your eligibility to donate blood. Hemoglobin is a protein in red blood cells that carries oxygen throughout the body. Iron is necessary for the production of hemoglobin. If your iron levels are too low, you may not be able to donate blood. We recommend that you consult with your doctor to determine if your iron levels are within the acceptable range for blood donation. If your iron levels are low, you may need to take iron supplements to increase your levels before you can donate blood. We appreciate your understanding and look forward to seeing you at one of our centres. Sincerely, [Your Name]

This kind of response makes the user experience feel familiar and intimate. Users are guided by the large corpus of data LLMs are trained on grounded in the content of the organisation’s specific knowledge base.

Before wrapping up, I would like to mention a few more potential pitfalls to keep in mind when using large language models. Rapid advancements of LLMs have enabled the processing, understanding, and generation of human-like text, with increasing integration into systems that touch our social sphere. Despite this success, these models can learn, perpetuate, and amplify harmful social biases. For example, think about a virtual agent that makes calls to an LLM trained only on the EU GDPR. If the same agent started being used in the US, the model underneath would likely not be fair with potential US users since it has been biased towards the EU regulations. When adopting generative features be mindful that business critical industries like finance, healthcare and, retail are highly regulated when it comes to data-privacy and fairness. Keeping data security and privacy in mind, be also aware of the potential security threats related to LLM-based applications that process very sensitive user inputs since they are subject to prompt injection. While that might be innocent in some cases, it could become harmful if the output is in direct connection with a database or a third-party component. Lastly, learn new adequate strategies to test non-deterministic responses. Very likely the testing strategies that you have been using so far to test traditional deterministic conversational systems are not going to work if those systems now make LLM calls because generative outputs are non-deterministic. Check out https://cxcli.xavidop.me/ to learn about virtual agents testing strategies.

In conclusion, Large Language Models (LLMs) cannot serve as a drop-in replacement for traditional chatbots and virtual agents. In fact, they complement each other. Only if we keep in mind the fundamental principles of conversation design, as a design language based on human conversation, we will make the most of LLMs and generative AI. The goal is not to trick the user into thinking they’re talking to a human being, but simply to leverage the communication system users learned first and know best: conversation.

--

--