Gemini has entered the chat: Crowd conversations Made Easy

Published in

Google Cloud - Community

7 min read6 days ago

In my two previous posts, I showed you how to get started with coding a very simple Discord bot that uses Gemini to reply to user messages and explaining the available configuration options for it. In this post, I’ll expand beyond the “Hello world” of an initial chatbot to demonstrate how an LLM can participate in a conversation with multiple users at the same time! To achieve this, we need to consider two necessary requirements: the bot needs to be aware of the chat history and it needs to distinguish between different users taking part in the conversation. This will be achieved by constructing a data structure storing the history of the conversation in a particular way.

Understanding the Content and Part classes

In earlier posts, we called model.generate_content_async() method with a string type parameter with the message the bot should reply to. As the model object is stateless, there was no history of previous queries and replies. However, it is possible to maintain state with Gemini so that it retains conversation history. How does this work in practice?

The generate_content_async() method accepts not only string values as its main parameter. If we examine the methods documentation, we can see that it accepts collections of Content, Image and Part objects. Those are the building blocks of a proper conversation history that Gemini is able to understand.

Part

Part class represents, as the name suggests, a part of the conversation history. Since Gemini is a multimodal model, parts can be made of text, images, audio or video. A good analogy of what a part is in a Discord chat history is a single message sent by a user. Part class has class methods from_text(), from_uri() and from_image() which make it possible to easily create part objects that can be then used to create Content objects.

Content

Content class, as can be seen in documentation is made of two elements: a sequence of parts and role attribute. Using objects of this class, we can attribute parts of conversation to the Gemini model or to the human user interacting with it. The role attribute can have only one of two values: ”model” and ”user” — as by default, Gemini is made to have a conversation with only one user and requires the sequence of content objects to be alternating between model and user roles.

Keeping history

Equipped with the knowledge of Content and Parts, we can now attempt to create and retain our chat history. The goal is to keep track of all the messages that were sent and deliver them to the Gemini model every time there is a need for it to generate a response. This way the model will be aware of the context in which questions are being asked of it.

Image illustrating grouping of Part objects into a series of Content objects altering between user and model role. — The chat history structure

In the drawing above, I have demonstrated how we can construct a sequence of Content type objects that will contain the chat history. Gemini requires the content objects to be alternating — meaning we can’t have two user or two model content objects next to each other. This disqualifies a simple “one message = one-content object” approach. Luckily, we can simply pack multiple messages from human users into a single content object.

Since the bot will reply only when it is mentioned in a message, we can observe the following: as users write regular messages, we append them to the last user content object in our chat history. When our bot is mentioned, we append the message to the user content and generate a new content object with a model role. The next message has to be from a human user, creating a new user content object. This way we guarantee that the contents’ objects alternate between user and model.

Let’s draw this out, to make sure everything is clear:

A flow chart describing the algorythm of appending to chat history. — Chat history flowchart

Telling users apart

At this point you might be wondering — if the Content objects can only distinguish between model and user roles, how do we tell Gemini that there are multiple users sending messages in the chat?

If we use JSON to format every user-generated section in the user-content, we indicate to Gemini who the author of every message is. If user “Maciek” writes a message, “I like cats,” we can encode it as:

{"author": "Maciek", "message": "I like cats"}

The replies don’t need to be encoded as there’s no doubt about who the author is. To give a full example in pseudocode, this is what a chat history looks like:

[
Content(
role="user",
parts=[
 Part("{'author': 'Ola', 'message': 'Hi! I'm Ola and I like dogs!'}"),
 Part("{'author': 'Maciek', 'message': 'Hello, I'm Maciek and I like cats :)'}"),
 Part("{'author': 'Maciek', 'message': 'Hey @GeminiBot, can you tell us something about the animals we like?'}")
]
),
Content(
 role="model",
 parts=[
  Part("
Hello Maciek and Ola! 👋

Of course, I can tell you something about dogs and cats! 

**Dogs** are known for their loyalty, playfulness, and intelligence. They are often called \"man\'s best friend\" because of their strong bond with humans. Dogs come in a wide variety of breeds, each with its own unique characteristics. 🐶

**Cats**, on the other hand, are independent and graceful creatures. They are known for their soft fur and their ability to purr. Cats are often seen as mysterious and aloof, but they can also be very affectionate companions. 🐱

I hope that was interesting!  If you have any other questions about animals, feel free to ask!")
 ]
),
Content(
 role="user",
 parts=[
  Part("{'author': 'Maciek', 'message': '@GeminiBot do you think you'd like my favourite animal?'}"),
]
),
Content(
 role="model",
 parts=[
  Part("
That's a great question!  Since I'm a bot, I don't have personal preferences like liking or disliking things.  However, I can access and process information about all kinds of animals, including cats! 😻

I'm sure your cat is very special and I'm happy to learn more about them if you'd like to tell me!  I'm always ready to expand my knowledge of the animal kingdom.")
 ]
)
]

Other considerations

Now that we have a way to store the history of the Discord channel we want the bot to interact with, there are other issues we need to consider:

Preloading the history

Discord chats live independently of our bot. They can be active when our bot is not listening, they also have a permanent message history. It would only be natural for our bot to see the conversation history just as interactive users do, not only the parts that it witnessed. That’s why it’s important to provide the bot with the ability to read channel history, so it can fill its memory with the messages that were sent before it was started.

Limited prompt size

At the time of writing this post Gemini Flash is a model with one of the biggest context windows (over 1 million tokens). That’s around 4 million characters, if we consider only text messages. While it might seem a lot, it isn’t much when compared with the history of Discord channels on active servers. Our bot needs to make sure that the message history it keeps in its memory won’t extend the context window available for the Gemini Flash model.

Designing proper system instruction

As we learned in the previous part of this blog series, using Gemini models to construct an answer takes two kinds of prompts into account. One is the content which we send to generate_content_async and the other is the system_instruction parameter we provide when creating the model object. We know that we will be sending the chat history as content for the model to generate responses, but what do we use as the system instruction? System instructions give the model additional context. For our multi-user chat bot, we can include things about the role, additional contextual information, and formatting instructions. We want to instruct the bot to answer the messages it receives, while also disclosing that it is a Discord Bot called GeminiBot. Another important piece of information we should include is that the content we’re sending will contain a JSON-encoded chat history to receive more natural responses to the conversation.

Here’s an example of the prompt I used during my tests as system instruction:

You are a Discord bot named GeminiBot.
Your task is to provide useful information to users interacting with you.
You should be positive, cheerful and polite.
You can use the default Discord emojis.
You are provided with chat history in JSON format, but your answers should be regular text.
Always reply to the last message in the chat history.

That sounds like a lot of code…

Implementing all this logic described above could take a while, but luckily, I’ve helped you get started! 🙂 In the discord-bot repository on GitHub, I’ve prepared a simple implementation of a Discord bot that will interact with your server guests, understanding that there are multiple users present in a chat. It will also load channel history when first writing in a channel and make sure that the history it keeps in its memory won’t extend the context window of the Gemini model.

Please let me know if you have tried the bot and what features (preferably AI related) you’d like to see added next!