Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

When Users Message Faster Than AI Can Think— part 2

--

TL;DR: Users who send multiple messages at-a-time is a thing to consider when it comes to human-like conversations with a GenAI agent. In the part 1 I wrote about the multi-messages problem in detail, and how we can solve it by leveraging a distributed message buffer based on Redis. In this second part we will have a look at two different patterns that can be implemented: synchronous aggregation and asynchronous aggregation, depending on whether you have control over the frontend of your application.

Implementing the solution: Synchronous aggregation

Let’s dive deeper into the scenario where your frontend utilizes synchronous HTTP WebHooks to communicate with the custom backend. In this setup, the AI response must be returned as an HTTP response to the frontend to display the message in the UI.

Synchronous communication between the frontend and the backend

This common configuration, however, introduces several challenges for implementing message aggregation:

Wait Mechanism: Within the timeframe of a single HTTP request/response cycle, the application must wait for additional messages before generating the AI response. This waiting period is difficult to predict in terms of time consumption, potentially leading to HTTP timeouts.

New Message Detection: A mechanism is needed to detect, within a single HTTP request/response cycle, whether new messages have arrived from the same user in the same session but in another HTTP request.

Empty Responses: If new messages are detected within an existing (old) HTTP request/response timeframe, the application must send an empty response to the frontend. This response needs to be ignored by the chatbot frontend to avoid displaying irrelevant information.

Concatenation: Once a timeout expires, the application must gather all messages sent by the same user within that timeframe and concatenate them into a single, coherent message before forwarding it to the AI for processing. This ensures the AI receives the complete context of the conversation.

The solution I came with is represented below:

Synchronous aggregation sequence diagram

In this diagram, the frontend acts as a messenger, delivering user input to the backend instances via HTTP requests sent by the HTTP WebHook. To gather a complete message sequence, a backend instance must wait up to a predefined timeout (the too long phase, see part 1) for new messages. If a new message arrives, then the backend must send a “no-action” or “conflict” response to the current (old) HTTP request, if not, it get the full users’ message buffer, concatenate the messages and send such payload to the AI module (just reply phase).

How can the application identify a new message from the current HTTP request?
The application instance receiving the request can periodically check the ID of the last message stored in the message buffer for that user (Wait for the new message phase), until the timeout is reached. If this ID differs from the message ID saved for the current HTTP request, it indicates that a new message has arrived.

This solution, although successfully implemented several times, has a significant drawback: it relies heavily on time-based mechanisms, which can negatively effect the user experience.

Moreover, this reliance on time introduces the following issues:

  • Unnecessary Delays: The waiting period for message aggregation can cause noticeable delays in the user interface, especially when users are not using rapid-fire messaging pattern.
  • HTTP Timeout Limitations: The entire communication process must be completed within the WebHook’s HTTP timeout window, which is often non-configurable.

Implementing the solution: Asynchronous aggregation

Things become truly interesting when we have control over the frontend, or when it’s possible to leverage asynchronous WebHooks and client APIs to visualize AI responses. In this scenario we can leverage two powerful features of Redis: Expiring Keys and Pub/Sub notifications.

Expiring Keys

Redis’s ability to set expiration times on keys offers an elegant solution for managing timeouts. By assigning an expiration time to special keys associated with each user’s last activity, we can automatically detect when a user has been inactive for a certain amount of time. These expiration events can be then captured by leveraging Redis Pub/Sub functionality.

Pub/Sub Notifications

Redis’s Pub/Sub enables real-time communication between different parts of the system. In this context, it can be used to notify the AI module that it is time to concatenate the messages within the buffer. This eliminates the need for continuous polling and “no-action” messages, reducing latency and improving overall responsiveness.

Overall, the sequence diagram looks like the following:

Asynchronous aggregation sequence diagram

Let’s break down this elegant solution:

TTL-Based Message Grouping

Imagine each user’s conversation as a series of messages grouped together. Each message group is assigned a Time-To-Live (TTL), essentially a countdown timer. Every time a new message arrives from the user, the TTL for that group is reset, signifying continued activity.

If the TTL expires, meaning no new messages have arrived within the allotted time, Redis automatically triggers an event. This event serves as a signal to the AI module, indicating that the message group is complete and ready for processing. The AI module then retrieves all messages within that group, concatenates them into a single coherent message, and forwards it to the AI model for response generation.

Instantaneous Response with just reply State

Here’s where things get even more interesting. By integrating with the frontend, we can detect specific user actions like closing the keyboard or exiting the application. This allows us to instantly trigger the AI processing, even if the TTL hasn’t expired. When a user performs a “just reply” action, we can immediately set the TTL for their message group to 0, forcing an immediate expiration event. This ensures that the AI receives and processes the message without any delay, providing a seamless and responsive experience.

Conclusions

The choice of aggregation pattern depends on how incoming messages reach our system. The asynchronous approach offers excellent scalability and a seamless user experience. However, not all frontends support this method. In such cases, we can configure synchronous aggregation to achieve the same result.

What’s next

In the last episode of this series I will write about how the two patterns have been implemented in Google Cloud Platform, our cloud provider of choice, leveraging Memorystore, the fully-managed Redis service of Google Cloud, to deliver a production-ready solution.

Documentation

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Danilo Trombino
Danilo Trombino

Written by Danilo Trombino

Google Cloud Architect @ Datwave. Specialized in data-driven solutions for global partners. Love for music and HiFi audio.

No responses yet