A Developer Guide for Creating a Multi-Modal Chatbot Using LangChain Agents

Published in

CyberArk Engineering

9 min readFeb 14, 2024

In the field of Generative AI, agents have become a crucial element of innovation. They empower Large Language Models (LLMs) to reason better and perform complex tasks such as interfacing with external data sources. This includes performing Google searches, calling external APIs, or generating personalized images. In my previous post, I explained how to create a personalized GPT model using OpenAI’s GPTs, capable of generating images and text. This post shows you how to develop this type of solution, giving you complete control over your chosen LLM. You can handle your proprietary data, calls to external APIs, and more. I have created a multi-modal chatbot that utilizes LangChain, ChatGPT, DALL·E 3, and the Streamlit framework for its user interface. Finally, I will also share the open-source repository I have created, allowing you to explore and deploy the chatbot on your own.

The Challenge of Living in the Real World

The issue involves retrieving information from the real world that falls outside the training scope of the Large Language Models (LLM). This could anything from executing calls to a proprietary API or supplying the LLM with data (such as files or images) it hasn’t been trained on and then facilitating discussions based on this data. We anticipate that the agent can then deconstruct the task into smaller, more manageable tasks, determining the appropriate tools and the sequence to use them.

The Role of Agents in Addressing This Challenge

Agents are equipped with various tools, including invoking external APIs, conducting Google searches, or generating images based on specific instructions. These capabilities directly tackle the challenge we’re facing, offering a comprehensive solution. Before we look to solutions, however, it’s crucial first to comprehend the functioning of Agents in the LangChain framework.

As illustrated by the diagram, the process unfolds in the background: When presented with a user’s task or query, the agent engages the LLM for reasoning, essentially decomposing the task into smaller, intermediate steps. After this, the agent activates the appropriate tool, forwarding its output to the LLM for further analysis. This reasoning cycle continues until the issue is completely resolved and a solution is delivered to the user.

How Three Main Tools Offer a Possible Solution

The multi-modal chatbot I crafted is backed by an agent that uses three tools:

REST Countries API Chain: enables retrieving information on countries, invoking this public API
DALL·E 3 Image Generator: Generates an image of countries based on the country name
Google Search Tool: Useful for fetching information from the web

I developed the entire chatbot in Python; here is a code snippet of the agent creation:

def create_agent():

    tools = [countries_image_generator, get_countries_by_name, google_search]

    functions = [convert_to_openai_function(f) for f in tools]
    model = ChatOpenAI(model_name="gpt-3.5-turbo-0125").bind(functions=functions)

    prompt = ChatPromptTemplate.from_messages([("system", "You are helpful but sassy assistant"),
                                               MessagesPlaceholder(variable_name="chat_history"), ("user", "{input}"),
                                               MessagesPlaceholder(variable_name="agent_scratchpad")])

 
    memory = ConversationBufferWindowMemory(return_messages=True, memory_key="chat_history", k=5)

    chain = RunnablePassthrough.assign(agent_scratchpad=lambda x: format_to_openai_functions(x["intermediate_steps"])
                                      ) | prompt | model | OpenAIFunctionsAgentOutputParser()

    agent_executor = AgentExecutor(agent=chain, tools=tools, memory=memory, verbose=True)

    return agent_executor

LangChain framework offers a comprehensive solution for agents, seamlessly integrating various components such as prompt templates, memory management, LLM, output parsing, and the orchestration of these elements within an agent executor.
The create_agent() function is at the heart of this approach, designed to instantiate and configure a ChatGPT agent with specific functionalities, integrating external tools and a custom processing pipeline for handling user inputs and generating responses.

Here’s a breakdown of its components:

Tools Integration: Defines a list of tools, which include countries_image_generator, get_countries_by_name, and google_search. These tools are then converted into OpenAI functions, allowing them to be called within the ChatGPT’s processing pipeline.
Model Configuration: The function sets up a ChatGPT model, specifically “gpt-3.5-turbo-0125”. This model is enhanced by binding it with the previously converted OpenAI functions, enabling the model to leverage these external tools during its operations.
Prompt Template: Creates a ChatPromptTemplate using a series of predefined messages, including a mix of system-defined roles and placeholders for dynamic content such as the user’s input, and a scratchpad for intermediate steps. This template guides the conversation flow and structure.
Memory Management: Uses ConversationBufferWindowMemory to manage conversation history, storing the last 5 messages (controlled by the parameter k) for context. This memory is indexed by “chat_history” and is configured to return messages for use in generating responses.
Processing Pipeline: Defines a processing chain that starts with the RunnablePassthrough for handling intermediate steps, then passes the context through the prepared prompt template, the ChatGPT model itself, and finally through an OpenAIFunctionsAgentOutputParser. This pipeline orchestrates the flow of data through the agent, integrating the model’s output with the function calls and parsing the results. It uses LangChain Expression Language, or LCEL, which is a declarative way to compose chains together easily.
Agent Executor: The function’s core creates an AgentExecutor, which encapsulates the entire agent along with its tools, memory management, and the defined processing pipeline. This executor runs the agent, handles inputs, and generates outputs based on the configuration.

Creating a Custom Tool in LangChain

Using @tool decorator is the simplest way to define a custom tool in the LangChain framework. The decorator uses the function name as the tool name by default, which can be overridden by passing a string as the first argument. Additionally, the decorator uses the function’s docstring as the tool’s description, so a docstring MUST be provided.
You can also customize the tool name and JSON args by passing them into the tool decorator (see below in the second tool, get_countries_by_name).

The agent adds the tool description as context to LLM to decide which tool to use. It is crucial to select the appropriate description.

Let me share my insights on each tool:

1) Countries Image Generator tool

@tool
def countries_image_generator(country: str):
    """Call this to get an image of a country"""
    res = DallEAPIWrapper(model="dall-e-3").run(
        f"You generate image of a country representing the most typical country's characteristics, incorporating its flag. the country is {country}"
    )

    answer_to_agent = (f"Use this format- Here is an image of {country}: [{country} Image]"
                       f"url= {res}")
    return answer_to_agent

I used the DallEAPIWrapper to call the DALL·E 3 model and gave it specific instructions on how I wanted the country’s images to be (e.g.,representing the most typical country’s characteristics, incorporating its flag, etc). However, I added output instructions in response to the agent to set a defined format with the country name and the generated image URL. This is crucial for recognizing that the response from the agent is an image, not just text.

2) Get Countries By Name tool

def prepare_and_log_request(base_url: str, params: Optional[dict] = None) -> PreparedRequest:
    """Prepare the request and log the full URL."""
    req = PreparedRequest()
    req.prepare_url(base_url, params)
    print(f'\033[92mCalling API: {req.url}\033[0m')
    return req


class Params(BaseModel):
    fields: Optional[conlist(str, min_items=1, max_items=27)] = Field(
        default=None,
        description='Fields to filter the output of the request.',
        examples=["name", "topLevelDomain", "alpha2Code", "alpha3Code", "currencies", "capital", "callingCodes", "altSpellings", "region", "subregion", "population", "latlng", "demonym", "area", "gini", "timezones", "borders", "nativeName", "numericCode", "languages", "flag", "regionalBlocs", "cioc"]
    )


class PathParams(BaseModel):
    name: str = Field(..., description='Name of the country')


class RequestModel(BaseModel):
    params: Optional[Params] = None
    path_params: PathParams

@tool(args_schema=RequestModel)
def get_countries_by_name(path_params: PathParams, params: Optional[Params] = None):
    """Useful for when you need to answer questions about countries. Input should be a fully formed question."""
    BASE_URL = f'https://restcountries.com/v3.1/name/{path_params.name}'

    effective_params = {"fields": ",".join(params.fields)} if params and params.fields else None

    req = prepare_and_log_request(BASE_URL, effective_params)

    # Make the request
    response = requests.get(req.url)

    # Raise an exception if the request was unsuccessful
    response.raise_for_status()

    return response.json()

Navigating through this particular tool presented a unique challenge. The first step involved designing a model for parameters of REST API that fetches country information by name. I developed a Pydantic model encapsulating path and query parameters to achieve this. Following that, I integrated this model with the LangChain tool decorator.

The value of this approach lies in its comprehensive submission of the argument model to the Large Language Model (LLM), including detailed descriptions of each argument. It’s not just a brief overview of the tool itself. This method significantly enhances the LLM’s ability to precisely determine the correct arguments for a function based on user prompts, ensuring a more accurate and efficient interaction.

3) Google Search tool

@tool
def google_search(query: str):
    """Performs a Google search using the provided query string. Choose this tool when you need to find current data"""
    return SerpAPIWrapper().run(query)

For conducting searches on Google through an API, I utilize the built-in SerpAPIWrapper. It’s important to mention that an API key is required for its operation. Learn how to obtain this key here on my GitHub Readme.

The Multi-Modal Chatbot Architecture

The diagram illustrates the structure of the multi-modal chatbot system:

Prompt Refinement. The initial user prompt and context of the conversation history are forwarded to the LLM (in this scenario, ChatGPT) to refine the prompt into a more precise query.
Thought Process. The agent relays the refined prompt, alongside any optional tools, to the LLM for reasoning. Based on this, it decides which tool to employ. If the final answer is determined at this stage, it is directly communicated to the user.
Tool Invocation. The agent executes the chosen tool.
Observation. The output generated by the tool is sent back to the LLM by the agent for further reasoning.

Running the Multi-Modal ChatBot

The first question I asked was to create an image of Holland.

It’s amusing how it depicted the bicycle floating on water, right? We also received an image of the Netherlands, capturing its essence and flag.

Let’s take a peek behind the curtain:

The agent successfully called upon the “countries_image_generator” tool, inputting the necessary argument: the country’s name. Following that, highlighted in blue, we observe the function’s output : the return value from “countries_image_generator” showcasing the URL for the image crafted by DALL·E 3. Finally, in green, we witness the agent’s conclusive action.

Afterwards, I asked: “How many tourists visited Athens last year?” The response I received was that 6.4 million tourists did. The Google search tool was utilized for this query due to the need for up-to-date information. By conducting a Google search with the query “number of tourists visited Athens last year,” we obtained the answer.

I followed up with more inquiries:

What are the regions and sub-regions of Brazil?
What is the currency and capital there?

For both questions, the get_countries_by_name tool was employed with the appropriate parameters. For instance, for the second question, the tool was activated using: get_countries_by_name with the parameters {‘path_params’: {‘name’: ‘Brazil’}, ‘params’: {‘fields’: [‘currencies’, ‘capital’]}}.

In addition, this multi-modal chatbot can recall previous interactions, as demonstrated by its capacity to deduce the correct country for the subsequent question regarding currency and capital.

Finally, I asked it to create an image of Greece:

Utilizing Agents for Real-World Problem Solving

You should now have a better understanding of the crucial role of agents in developing robust applications, and specifically how a language model can function as a cognitive engine. It intelligently decides on the sequence of actions to execute, facilitating the integration of the Large Language Model (LLM) with various external resources, including APIs, proprietary datasets, and internet searches, among other potential actions. We also showed you the importance of the LangChain framework in constructing a versatile chatbot. This chatbot is not only capable of calling external APIs and browsing the internet but can also create images, showcasing its multi-modal capabilities. (You can learn more in the Github repository).

Here’s the basic takeaway: Make sure to select the appropriate tools and a compatible LLM that enables function calls to tailor your application to your specific needs.