Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

When Users Message Faster Than AI Can Think— part 1

--

TL;DR: Is your GenAI agent struggling to keep up with users who send multiple messages at a time? This series, composed by 3 parts, dives into the challenges of rapid-fire messaging and presents a scalable solution based on Redis for efficient message buffering and AI processing, that has been implemented in Google Cloud for different use cases. In this first part you will know more about the problem and you will know more about the idea of distributed message buffer.

In today’s business landscape, where Generative AI (GenAI) is rapidly transforming customer interactions across various sectors — from help desk support and customer service to personalized education and interactive entertainment — effective communication with such agents is a must. Users now expect seamless and human-like interactions with AI-powered systems. However, many GenAI applications struggle to handle a specific communication style: rapid-fire messaging.

A whatsapp chat where the sender uses a rapid-fire messaging pattern

This staccato messaging style is prevalent in popular instant messaging applications like WhatsApp and Telegram, and if this rapid-fire messaging phenomenon seems foreign to you, chances are you’re the friend we all have who crafts perfectly composed paragraphs in a single message (I admire your patience).

Over the past few months, I’ve been deeply involved in developing a cloud-native solution to address the challenges of rapid-fire messaging within a healthcare chatbot. This article delves into the complexities of this issue and elucidates the rationale behind our innovative approach.

Back to the basics, how does GenAI agents handle multi-turn conversations?

One message
One response

Let’s start by considering a conversation with one of the popular AI chatbots we often interact with, like ChatGPT, Gemini, or Claude. While these applications aren’t specifically designed for conversational-style communication, I’ve come across numerous custom applications that implement the same mechanism.

How can we characterize the interactions between a user and the AI within such tools?

It is like a graceful dance, where each message and response represents a carefully choreographed step:

  1. The User Leads: The user initiates the interaction by sending a message, setting the tempo and direction of the conversation.
  2. The AI Responds: The AI generates a single, thoughtful response based on the current state of the conversation. This includes every step taken so far — all previous messages and responses.
  3. The Dance Evolves: After each interaction (a message-response pair), the conversation takes a new shape. The AI diligently updates the “dance floor” — the state of the conversation — reflecting the latest moves.
  4. The Cycle Continues: With the updated state in mind, the AI is ready for the next step, gracefully responding to subsequent messages from the user.

That’s the essence of multi-turn conversations: each user message receives its own dedicated response.

By processing messages sequentially, the AI can effectively address each point individually, minimizing any potential for confusion. Furthermore, this method facilitates dynamic interactions and rapid feedback. Indeed, responding to each message individually provides users with immediate feedback, allowing them to quickly gauge the AI’s understanding and adjust their subsequent messages accordingly.

But what if things become… frenetic?

The problem with rapid-fire messaging

Multiple messages
Multiple responses

By default, agents will respond to each message, but in multi-message exchanges, each short message may lack context or depend on previous parts, making it harder for AI to connect the dots and increasing the chance of misunderstandings.

Here an example of a multi-turn conversation with Gemini, asking “Hello, tell me why my pc is not working” but in a WhatsApp-like messaging style.

A conversation with Gemini trying to use the rapid-fire messaging pattern

Again, it’s clear that such tools aren’t inherently designed to handle this conversational style. However, if you’re building a conversational agent for your startup, a client, or even just for personal use, it’s crucial to recognize that some users might interact with your agent in this way.

The challenges extend beyond just context. Scalability and cost become critical factors to consider. Indeed, handling a high volume of short, fragmented messages can place significant demands on system resources. Each message requires individual processing and analysis, potentially leading to increased computational load and higher infrastructure costs.
And here’s the real kicker: if a majority of your users adopt this rapid-fire messaging pattern, you’ll face a significant waste of computational power. Many of these fragmented messages may not convey complete thoughts or intentions, leading to unnecessary processing and analysis.

What we really need is a mechanism to throttle concurrent requests from the same user. So, let’s take a look at how others have tackled this issue.

GenAI leaders overcome the problem by limiting messages generation

Let’s take a look at how the big players in the GenAI space are handling this. Currently, they’re limiting users to just one request at a time. They’re doing this using a few different techniques:

Blocking request

When a user sends a message, the UI locks down the “send” button (or whatever mechanism is used for sending messages). It’s like putting a temporary “out of order” sign on the message submission process. This block remains in place until the AI agent sends back a response. Once the response is received, the UI unlocks the button, allowing the user to send another message.

Blocking request mechanism

Stream interruption

This technique involves completely stopping the AI’s response stream when a user sends a new message. Any information that was still being streamed to the user is discarded, and the AI focuses solely on processing the new message.

Streaming interruption mechanism

Conversation-based request limiting

Once a user sends a message to the API, he cannot send another message within the same conversation until the AI has responded to the first one. This effectively limits the rate of message processing within a conversation to one, preventing users from sending a barrage of requests. If a user needs to send another message, they would have to initiate a new conversation.

Conversation-based request limiting mechanism

These methods are highly effective for general purpose applications like ChatGPT, where users expect immediate but independent responses. However, these may prove less suitable for conversational agents, where maintaining a smooth and uninterrupted flow of interaction is required.

Wait is (sometimes) the answer

Let’s think about how we have real conversations on our favorite chat application. What happens when we receive a message in real life?

Usually, we first check if the sender is still writing other messages. If not, we just send a reply. If we notice that the sender is still writing something, then we usually wait for the new message. And if the sender is taking too long then we just write our reply to the messages we’ve already received.

Such process does work not only for just one message, but can scale. The difference is on the number of messages we consider when we just reply or if we are waiting for too long. Indeed, our brain concatenates (the bonus phase) the messages to have the full context, and that’s exactly what we want to achieve with our agents.

Let’s now consider the statements in bold as the phases we want to replicate on our custom agent, and let’s first address the problem at the frontend side.

How to mimic the conversation phases on our GenAI agent frontend

Now, our agent takes on the role of the receiver and must identify the three (+1) phases mentioned earlier. We can leverage our frontend client to achieve this, as modern technology allows us to capture events on the end-user’s device. Here’s a list of events we can detect for each of the first three phases.

Just send a reply

No action: When the user has not taken any action after sending a message for a period of time, the backend can be triggered for a response.

Out of focus: If the user navigates away from the chat interface or closes the keyboard without sending a message, the backend can be triggered for an immediate response.

Wait for the new message

Start typing: When the user begins typing a message on his device, the backend can wait for a specified number of seconds before responding.

Too long

Typing timeout: If the user starts typing but does not complete their message within a pre-defined time frame, the backend will assume that the user may have been interrupted or faced difficulty in composing their message.

Concatenate (but not here)

We need also a way to combine those individual messages into a cohesive whole (this is where the concept of a message buffer comes in). However, relying solely on client-side concatenation limits the solution’s reliability and scalability.

To ensure robust functionality across various clients and platforms, we must elevate this message aggregation capability to the API level. This means implementing a server-side mechanism that can receive and buffer multiple messages from any client, regardless of its implementation or capabilities.

The message buffer

To handle incoming messages effectively, user-specific buffers can be implemented utilizing a First-In, First-Out (FIFO) queue data structure. These buffers function as transient storage for incoming messages, ensuring their sequential integrity and preventing message loss. However, when accessing these shared resources , it’s important to pay attention to potential concurrency issues. To mitigate race conditions, scenarios where multiple threads, processes or components attempt concurrent read or write operations on the buffer, the storage solution has to support atomic operations.

When the buffer reaches its capacity or once a specific event is received signaling the end of a message sequence, the application concatenates the accumulated messages. This consolidated message is then sent to the AI module responsible for generating a response.

Now that we understood the new capability we want to add to our backend system it’s time to talk about the software architecture.

If you already have a monolithic stateful backend applications (hope not in 2025), accumulating messages in the way described above may involves managing the buffers as shared resources within the same application instance, resulting in a scenario like the following:

Message buffer in a stateful application

It’s time to scale

Unfortunately, when faced with the need to support a growing user base for a monolithic application, the simplest and often most immediate solution to scale a monolithic application is to vertically scale the underlying infrastructure. This means increasing the resources of the server hosting the application, such as adding more CPU cores, memory, and storage capacity. While this approach can provide a temporary performance boost and accommodate a moderate increase in users, it has inherent limitations, like the non-linear performance obtained by an hardware boost and the increase of costs.

Horizontal scalability is the way to go

Horizontal-scaling an application, instead, means having different instances with the same software installed, where each instance will manage just a part of the incoming traffic, or like in our scenario, incoming messages to distribute the computational load.

Whether users’ messages are sent via traditional HTTP requests, typically routed through a Layer 7 load balancer, or through a message broker like Kafka or RabbitMQ, the underlying principle of load balancing remains crucial for scalability and resilience.

However, unless some specific cases when it is possible to leverage mechanisms like sticky sessions, the delivery of messages coming from the same user to the same data consumer is not guaranteed. This means that the state is potentially distributed among different instances and that the traditional techniques for handling the concurrency may be useless.

Scaling a stateful application

Let’s make the backend stateless — Redis as a distributed message buffer

To address this challenge and ensure robust scalability, the persistence of messages must be decoupled from the application logic. To maintain low-latency access to the message buffers, the recommendation is to go with a high-performance in-memory data store, especially considering the ephemeral nature of the messages which are retained until they are aggregated and processed.

Thanks to the support of different data structures, async notifications, pub/sub pattern and watch mechanism to implement transactions and distributed locks, and because it is insanely fast, we decided to use Redis as the message buffer, as it turned out to be the perfect companion for this specific use case. However, beside the tool of choice, the core principle remains the same: ensure atomicity and consistency when accessing shared resources.

Redis distributed message buffer in a stateless application

In this scenario, any instance of the application which receives a user request can interact with the message buffer. The selected instance can add new messages to the buffer or retrieve the entire contents atomically, ensuring that no data is lost or corrupted due to concurrent access.
Once the instance has obtained the complete message sequence from the buffer, it can concatenate the individual messages and forward them to the AI module for processing. This time the AI can generate a response that consider all the different segments a provide a meaningful answer. Bingo!

Conclusions

We’ve seen that managing communication with generative AI agents and handling split messages effectively requires a well-designed architecture that can seamlessly scale horizontally. Redis is the perfect companion for this kind of tasks thanks to its speed and flexibility, as it allows to decouple the temporary storage layer used to buffer users’ messages.

What’s next

In the second part I will write about two actual implementation strategies that I’ve implemented for global customers, that can be chosen depending whatever you have to stick with synchronous communication or not.

Feel free to reach out if you have any questions or want to share your own experiences with building conversational AI agents.

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Danilo Trombino
Danilo Trombino

Written by Danilo Trombino

Google Cloud Architect @ Datwave. Specialized in data-driven solutions for global partners. Love for music and HiFi audio.

No responses yet