Building a Langchain Agent with a Self-Hosted Mistral 7B: A Step-by-Step Guide

Jorge Pardo Serrano
20 min readApr 7, 2024

--

DALL-E generated image representing an agent looking for information in Wikipedia

Utilizing agents powered by large language models (LLMs) has become increasingly popular. Yet, the choice between using public APIs, like OpenAI’s, and self-hosting models such as Mistral 7B, carries significant implications. Public APIs can incur high costs due to the frequent iterations agents require, while self-hosted models offer a cost-effective solution with enhanced data privacy. This makes self-hosting an attractive option for those dealing with sensitive information or seeking to reduce operational expenses.

The notebook of this tutorial is available online (colab)

(Update 2024/04/09: Improved custom prompting)

Introduction

Deploying agents with Langchain is a straightforward process, though it is primarily optimized for integration with OpenAI’s API. The pathway for utilizing alternative models is not as clearly documented. The aim of this article is to provide a concise guide on how to navigate this process. This article assumes that the reader has at least a basic understanding of agents.

Langchain is a framework designed to simplify the integration and execution of pipelines involving language models. At its core, Langchain aims to streamline the development process for applications that harness the power of natural language processing (NLP) and machine learning. A key feature of this framework is its support for the creation of agents: Autonomous entities capable of performing a wide range of tasks. With Langchain, the complexity of managing interactions between different components of a language model pipeline is significantly reduced.

Mistral is a French AI startup that develops large language models (LLMs). The company was founded in 2020 by a team of researchers from Inria, the French national research institute for computer science and applied mathematics.

Mistral 7B is a 7-billion parameter LLM that was released by Mistral in 2023. It’s especially powerful for its modest size, and one of its key features is that it is a multilingual model. This makes it an essential foundation for any project that includes text in languages other than English.

First Steps

In the initial phase of setting up our project, the first step involves installing the necessary packages to import the Langchain modules and use the Mistral 7B model.

Thankfully, the installation of Mistral 7B is made particularly straightforward thanks to Hugging Face. Accessing and deploying the Mistral 7B model becomes a hassle-free experience.

!pip install langchain
!pip install accelerate
!pip install bitsandbytes

Working within a Colab environment, it’s important to note that after the installation of the packages, a notebook restart might be necessary.

To minimize the memory required to run the model, we will employ 4-bit quantization. This technique reduces the precision of the model’s parameters, significantly decreasing the amount of memory needed without substantially compromising the model’s performance.

import warnings
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from transformers.models.mistral.modeling_mistral import MistralForCausalLM
from transformers.models.llama.tokenization_llama_fast import LlamaTokenizerFast

model_name = "mistralai/Mistral-7B-Instruct-v0.2"

quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, quantization_config=quantization_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

Custom LLM Class

To create our agent, we need to assign it a Large Language Model (LLM) in a format that Langchain can understand. Therefore, we’ll need to develop a custom class that inherits from Langchain’s base LLM class. This custom class will act as a bridge, enabling Langchain to interact with our chosen model.

While instantiating the LLM via the HuggingFacePipeline class is simpler, taking the Custom LLM approach allows for greater control. We can easily add logging to track what’s going into and coming out of each model execution, which is invaluable if we encounter any issues.

from langchain.llms.base import LLM
from langchain.callbacks.manager import CallbackManagerForLLMRun
from typing import Optional, List, Mapping, Any

class CustomLLMMistral(LLM):
model: MistralForCausalLM
tokenizer: LlamaTokenizerFast

@property
def _llm_type(self) -> str:
return "custom"

def _call(self, prompt: str, stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None) -> str:

messages = [
{"role": "user", "content": prompt},
]

encodeds = self.tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to(self.model.device)

generated_ids = self.model.generate(model_inputs, max_new_tokens=512, do_sample=True, pad_token_id=tokenizer.eos_token_id, top_k=4, temperature=0.7)
decoded = self.tokenizer.batch_decode(generated_ids)

output = decoded[0].split("[/INST]")[1].replace("</s>", "").strip()

if stop is not None:
for word in stop:
output = output.split(word)[0].strip()

while not output.endswith("```"):
output += "`"

return output

@property
def _identifying_params(self) -> Mapping[str, Any]:
return {"model": self.model}

llm = CustomLLMMistral(model=model, tokenizer=tokenizer)

You can find more information about the custom LLM Class here.

In our custom class, we have wrapped the text generation process of Mistral 7B. We also remove any part of the output that pertains to the prompt.

Additionally, we are handling what we refer to as “stop words”. These are specific words or phrases that Langchain will send with each execution of the model during the agent’s iterations. It is assumed that any “stop word” appearing in the output of our execution should be removed, along with all subsequent text.

Stop words are used to prevent hallucinations. We are going to build a JSON-type agent, meaning it will return a JSON structure within a Markdown Snippet in each iteration. Sometimes the LLM might provide information after the JSON or even send several actions at once. By enforcing the writing of a specific stop word after each JSON, we can remove all subsequent text from the output.

Another advantage of using this wrapper is that we can handle known errors. For example: From our testing, we’ve observed that Mistral 7B sometimes fails to properly close the Markdown Snippets. If they are not correctly closed, Langchain will struggle to parse the output. Therefore, we detect and correct this error before passing it to Langchain.

Note: Langchain can accurately interpret the outputs in JSON format, whether the JSON is embedded within a markdown snippet or not. However, for some reason, Mistral 7B performs better when asked to wrap the JSON within a markdown snippet.

Choosing the Tools

According to Langchain documentation:

Tools are functions that an agent can invoke.

Clear and simple definition.

Therefore, the tools we require depend on what we want our agent to be capable of performing.

The agent of our example will have the capability to perform searches on Wikipedia and solve mathematical operations using the Python module numexpr. To accomplish this, we will use two tools:

  • A tool for conducting searches on Wikipedia that is provided out of the box by Langchain itself.
  • A custom tool for solving mathematical operations.

First of all, Wikipedia tool needs wikipedia package to be installed:

!pip install wikipedia

After that, we will import the necessary modules:

import numexpr as ne
from langchain.tools import WikipediaQueryRun, BaseTool
from langchain_community.utilities import WikipediaAPIWrapper

Let’s create the Wikipedia Tool, just following its documentation:

wikipedia = WikipediaQueryRun(api_wrapper=WikipediaAPIWrapper(top_k_results=1, doc_content_chars_max=2500))

We have initialized our Wikipedia engine. We limit the number of results to 1 and the maximum number of characters to 2500. This limitation is imposed because, although Mistral 7B can support prompts of up to 32000 tokens, using a free Colab account would not provide sufficient memory for overly large inputs. But feel free to experiment changing the parameters.

We can check if it works:

print(wikipedia.run("Deep Learning"))

Page: Deep learning
Summary: Deep learning is the subset of machine learning methods based on artificial neural networks (ANNs) with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.Deep-learning architectures such as deep neural networks, deep belief networks, recurrent neural networks, convolutional neural networks and transformers have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, climate science, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.Artificial neural networks were inspired by information processing and distributed communication nodes in biological systems. ANNs have various differences from biological brains. Specifically, artificial neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analog. ANNs are generally seen as low quality models for brain function.

Now we will create the tool:

wikipedia_tool = Tool(
name="wikipedia",
description="Never search for more than one concept at a single step. If you need to compare two concepts, search for each one individually. Syntax: string with a simple concept",
func=wikipedia.run
)

The description we provide to the tool is extremely important. As we will delve into later, it is used in the prompt for each LLM execution, outlining the tool’s purpose. When experimenting with various tools, we may sometimes need to engage in prompt engineering to optimize their performance.

We’ve observed instances where the model attempts to search for more than one concept at the same time. To address this, we explicitly specify in the description that it should not do so. As you will see, there is a lot of prompt engineering work in this article.

Now we will create our Calculator tool:

class Calculator(BaseTool):
name = "calculator"
description = "Use this tool for math operations. It requires numexpr syntax. Use it always you need to solve any math operation. Be sure syntax is correct."

def _run(self, expression: str):
try:
return ne.evaluate(expression).item()
except Exception:
return "This is not a numexpr valid syntax. Try a different syntax."

def _arun(self, radius: int):
raise NotImplementedError("This tool does not support async")

calculator_tool = Calculator()

We’ve noticed that on many occasions, Mistral 7B outputs operations using syntax that is not compatible with numexpr. Therefore, we’ll catch these errors and send back a message suggesting an attempt with a different syntax.

Let’s try it:

calculator_tool.run("2+3")

5

Great! Well, this was a very simple tool but you will find in the official documentation a lot of literature about defining custom tools.

Now we should create a Python list with our two tools. We will use it later:

tools = [wikipedia_tool, calculator_tool]

Customizing the Prompt

We need to define the prompt template that our LLM will receive in each iteration, complete with all the necessary information to progress in solving the proposed problem and provide a response that Langchain can parse.

This is probably the trickiest part. We need to craft a specific prompt that aligns well with Mistral 7B, as the default prompts optimized for OpenAI models may not function as intended.

The prompt will have the following structure:

[Introduction about how to interact with the user]
[Tools to use]
[Description of the tools]
[Chat history]
[Example of interaction]
[User's question]
[Inner memory]

Some points to highlight:

  • JSON Communication Format: Our agent will be a “json chat agent”, so it will communicate using the JSON format inside a markdown snippet. Our prompt will emphasize the format and insist that a specific stop word be added after each JSON. In our case, the stop word will be “STOP”.
  • Tool Descriptions: The prompt must clearly explain the purpose of each tool. These descriptions will be drawn from those defined when creating the Tools. Providing this context helps Mistral 7B understand the intended use and capabilities of each tool, enabling it to apply them appropriately in its responses.
  • Example of an interaction: Mistral 7B performs better when provided with at least one example of the expected behavior. If it has access to a wider range of tools, especially those with more complex functionalities, it’s likely to require more examples to respond correctly.
  • Chat history and inner memory: Conversation memory serves as a comprehensive record of the entire interaction history, maintaining details of all questions and exchanges that have occurred. This type of memory ensures that the agent can refer back to any part of the conversation. Inner memory pertains specifically to the memory of steps taken during the resolution of a particular query. It’s a focused record that tracks the sequence of actions, decisions, and tool usages employed to address a specific question or task at hand. This detailed account allows the agent to understand the context and progression of its current task

The JSON Communication format should be like this:

User: <user prompt|tool answer>
Assistant: ```json
{{"thought": "<agent chain of thought>",
"action": "<tool_name to use|Final Answer>",
"action_input": "<tool_parameters|output to the user>"}}
```

Our LLM must return a JSON object with three key components: “thought”, “action” and “action_input.”

  • The “thought” key will help the agent to improve it’s reasoning capabilities. It has been demonstrated that Large Language Models enhance their decision-making capabilities when they are compelled to articulate their reasoning beforehand.
  • The “action” key is designed to specify either the name of the tool that should be utilized next or the term “Final Answer” when the model is ready to deliver the final response to the user.
  • The “action_input” key holds the parameters to be passed to the designated tool or, if the “action” is “Final Answer,” the conclusive text to be presented to the user.

Additionally, the model receives either the user’s initial prompt or the output from the last tool executed. For each iteration, the prompt provided to the model encompasses the entire sequence of outputs and JSONs from all steps taken during the current execution.

Now, let’s create the system section of the prompt:

system="""
You are designed to solve tasks. Each task requires multiple steps that are represented by a markdown code snippet of a json blob.
The json structure should contain the following keys:
thought -> your thoughts
action -> name of a tool
action_input -> parameters to send to the tool

These are the tools you can use: {tool_names}.

These are the tools descriptions:

{tools}

If you have enough information to answer the query use the tool "Final Answer". Its parameters is the solution.
If there is not enough information, keep trying.

"""

There are two tags: {tool_names} and {tools}. They will be replaced by Langchain automatically with the tool names and descriptions already provided in their initialization.

Now, the human section of the prompt:

human="""
Add the word "STOP" after each markdown snippet. Example:

```json
{{"thought": "<your thoughts>",
"action": "<tool name or Final Answer to give a final answer>",
"action_input": "<tool parameters or the final output"}}
```
STOP

This is my query="{input}". Write only the next step needed to solve it.
Your answer should be based in the previous tools executions, even if you think you know the answer.
Remember to add STOP after each snippet.

These were the previous steps given to solve this query and the information you already gathered:
"""

Just as the system section, there is a tag that will be replaced by Langchain. In this case it is {input} and will be replaced by the user’s actual input. We are also providing an example and insisting in the stop word. After the last sentence of this prompt fragment, the history of actions for this task and all the collected information will automatically appear.

The next step is building the prompt with just this piece of code:

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

prompt = ChatPromptTemplate.from_messages(
[
("system", system),
MessagesPlaceholder("chat_history", optional=True),
("human", human),
MessagesPlaceholder("agent_scratchpad"),
]
)

We have concatenated our system and human prompts with two MessagesPlaceholder: “chat_history” is the conversation memory and “agent_scratchpad” the inner memory of the agent. They will be filled automatically by Langchain, so we don’t need to manage them.

Creating the Agent

We are almost there. Now we need to create the Agent and the Agent Executor. The definition of these concepts according to Langchain’s documentation is as follows:

An agent is the chain responsible for deciding what step to take next. This is usually powered by a language model, a prompt, and an output parser.

The agent executor is the runtime for an agent. This is what actually calls the agent, executes the actions it chooses, passes the action outputs back to the agent, and repeats.

Don’t worry, creating them is a straightforward process.

(Update 2024/04/12: Modified template_tool_response)

Let’s create the agent. As we stated before, it will be a “json chat agent”:

from langchain.agents import create_json_chat_agent, AgentExecutor
from langchain.memory import ConversationBufferMemory

agent = create_json_chat_agent(
tools = tools,
llm = llm,
prompt = prompt,
stop_sequence = ["STOP"],
template_tool_response = "{observation}"
)

As you can see, the initialization receives as parameters the list of tools, the llm custom object, the prompt we already built, our stop word and a template_tool_response. This last parameter is necessary because omitting it leads to automatic addition of a long default text after each tool’s execution, which confuses Mistral 7B and degrades its response quality. We replace this with the justo the tag “{observation}”. This tag will be replaced by each tool response and we don’t need to add anything else.

Now let’s create the Agent Executor:

memory = AgentExecutor(agent=agent, tools=tools, verbose=True, memory=memory)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True, memory=memory, handle_parsing_errors=True)

Setting the ‘verbose’ parameter to ‘False’ results in receiving only the agent’s final answer. On the other side, a ‘True’ value enables us to observe the intermediate steps involved.

Warning! In implementing memory within the AgentExecutor, it’s been observed that adding memory can sometimes lead to confusion for the model. This complication arises because the model might struggle to differentiate the accumulated knowledge from previous interactions and the immediate context of the current task. Therefore, unless memory is explicitly required for the specific use case at hand — where the benefits of maintaining a historical context outweigh the potential for confusion — it’s advisable to avoid incorporating it.
In the notebook we don’t use it and initialize the agent_executer without it.

Testing the Agent

Here we go. With our agent created, all that remains is to put it to the test. We will run the agent in the following way:

agent_executor.invoke({"input": "<input text>"})

We’ll start with something simple:

agent_executor.invoke({"input": "How much is 23 plus 17?"})

> Entering new AgentExecutor chain...
```json
{"thought": "This query requires the use of the calculator tool.",
"action": "calculator",
"action_input": "23 + 17"}
```40```json
{"thought": "The calculator tool returned the correct answer.",
"action": "Final Answer",
"action_input": "40"}
```

> Finished chain.
{'input': 'What is 23 plus 17?', 'output': '40'}

We observe that the agent used the calculator tool, successfully retrieved the answer, and ultimately, by employing the ‘Final Answer’ action, returned the output.

Let’s try a different type of question:

agent_executor.invoke({"input": "What is the capital of France?"})

> Entering new AgentExecutor chain...
```json
{"thought": "The capital of France is a concept that can be found on Wikipedia.",
"action": "wikipedia",
"action_input": "capital of France"}```Page: Paris
Summary: Paris is the capital and most populous city of France. With an official estimated population of 2,102,650 residents as of 1 January 2023 in an area of more than 105 km2 (41 sq mi), Paris is the fourth-most populated city in the European Union and the 30th most densely populated city in the world in 2022. Since the 17th century, Paris has been one of the world's major centres of finance, diplomacy, commerce, culture, fashion, and gastronomy. For its leading role in the arts and sciences, as well as its early and extensive system of street lighting, in the 19th century, it became known as the City of Light.The City of Paris is the centre of the Île-de-France region, or Paris Region, with an official estimated population of 12,271,794 inhabitants on 1 January 2023, or about 19% of the population of France. The Paris Region had a GDP of €765 billion (US$1.064 trillion, PPP) in 2021, the highest in the European Union. According to the Economist Intelligence Unit Worldwide Cost of Living Survey, in 2022, Paris was the city with the ninth-highest cost of living in the world.Paris is a major railway, highway, and air-transport hub served by two international airports: Charles de Gaulle Airport (the third-busiest airport in Europe) and Orly Airport. Opened in 1900, the city's subway system, the Paris Métro, serves 5.23 million passengers daily; it is the second-busiest metro system in Europe after the Moscow Metro. Gare du Nord is the 24th-busiest railway station in the world and the busiest outside Japan, with 262 million passengers in 2015. Paris has one of the most sustainable transportation systems and is one of the only two cities in the world that received the Sustainable Transport Award twice.Paris is especially known for its museums and architectural landmarks: the Louvre received 8.9. million visitors in 2023, on track for keeping its position as the most-visited art museum in the world. The Musée d'Orsay, Musée Marmottan Monet and Musée de l'Or```json
{"thought": "The capital city of France, as per the obtained information, is Paris.",
"action": "Final Answer",
"action_input": "Paris"}
```

> Finished chain.
{'input': 'What is the capital of France?', 'output': 'Paris'}

As expected, it uses the Wikipedia tool to fetch an entry from Wikipedia and delivers the output using the ‘Final Answer’ action.

Now, we’ll attempt something more challenging. We aim to retrieve a population data from Wikipedia and perform a mathematical operation on it, all within a single query:

agent_executor.invoke({"input": "What is the double of the population of Madrid?"})

> Entering new AgentExecutor chain...
```json
{"thought": "The population of Madrid is the information needed to find its double. I will use the wikipedia tool to find the population of Madrid.",
"action": "wikipedia",
"action_input": "Population of Madrid"
}```Page: Madrid
Summary: Madrid ( mə-DRID, Spanish: [maˈðɾið] ) is the capital and most populous city of Spain. The city has almost 3.4 million inhabitants and a metropolitan area population of approximately 7 million. It is the second-largest city in the European Union (EU), and its monocentric metropolitan area is the second-largest in the EU. The municipality covers 604.3 km2 (233.3 sq mi) geographical area. Madrid lies on the River Manzanares in the central part of the Iberian Peninsula at about 650 meters above mean sea level. The capital city of both Spain and the surrounding autonomous community of Madrid (since 1983), it is also the political, economic, and cultural centre of the country. The climate of Madrid features hot summers and cool winters.
The Madrid urban agglomeration has the second-largest GDP in the European Union and its influence in politics, education, entertainment, environment, media, fashion, science, culture, and the arts all contribute to its status as one of the world's major global cities. Due to its economic output, high standard of living, and market size, Madrid is considered the major financial centre and the leading economic hub of the Iberian Peninsula and of Southern Europe. The metropolitan area hosts major Spanish companies such as Telefónica, Iberia, BBVA and FCC. It concentrates the bulk of banking operations in the country and it is the Spanish-speaking city generating the largest amount of webpages. For innovation, Madrid is ranked 19th in the world and 7th in Europe from 500 cities, in the 2022–2023 annual analysts Innovation Cities Index, published by 2ThinkNow.Madrid houses the headquarters of the UN's World Tourism Organization (UNWTO), the Ibero-American General Secretariat (SEGIB), the Organization of Ibero-American States (OEI), and the Public Interest Oversight Board (PIOB). It also hosts major international regulators and promoters of the Spanish language: the Standing Committee of the Association of Spanish Language```json
{"thought": "I have the population of Madrid from the previous step. I will calculate the double of the population using the calculator tool.",
"action": "calculator",
"action_input": "2 * 3.4e6"
}
```6800000.0AI: ```json
{"thought": "The double of the population of Madrid is 6,800,000.",
"action": "Final Answer",
"action_input": "6,800,000"
}
```

> Finished chain.
{'input': 'What is the double of the population of Madrid?',
'output': '6,800,000'}

Not bad, right? And now, the last test. We will aim for the agent to conduct two searches on Wikipedia and compare the responses.

agent_executor.invoke({"input": "Who is older, Tom Hanks or Kevin Costner?"})

> Entering new AgentExecutor chain...
```json
{"thought": "I need to find the birth years of Tom Hanks and Kevin Costner to determine who is older.",
"action": "wikipedia",
"action_input": "Tom Hanks"}```Page: Tom Hanks
Summary: Thomas Jeffrey Hanks (born July 9, 1956) is an American actor and filmmaker. Known for both his comedic and dramatic roles, he is one of the most popular and recognizable film stars worldwide, and is regarded as an American cultural icon. Hanks' films have grossed more than $4.9 billion in North America and more than $9.96 billion worldwide, making him the fourth-highest-grossing actor in North America. He has received numerous honors including the AFI Life Achievement Award in 2002, the Kennedy Center Honor in 2014, the Presidential Medal of Freedom and the French Legion of Honor both in 2016, as well as the Golden Globe Cecil B. DeMille Award in 2020.Hanks made his breakthrough with leading roles in a series of comedy films which received positive media attention, such as Splash (1984), The Money Pit (1986), Big (1988), and A League of Their Own (1992). He won two consecutive Academy Awards for Best Actor for starring as a gay lawyer suffering from AIDS in Philadelphia (1993) and the title character in Forrest Gump (1994). Hanks collaborated with film director Steven Spielberg on five films: Saving Private Ryan (1998), Catch Me If You Can (2002), The Terminal (2004), Bridge of Spies (2015), and The Post (2017), as well as the WW II miniseries Band of Brothers (2001), The Pacific (2010), and Masters of the Air (2024). With the former he launched his career as a director, producer, and screenwriter. He has also frequently collaborated with film directors Ron Howard, Nora Ephron, and Robert Zemeckis.
Hanks' other films include the romantic comedies Sleepless in Seattle (1993) and You've Got Mail (1998); the dramas Apollo 13 (1995), The Green Mile (1999), Cast Away (2000), Road to Perdition (2002) and Cloud Atlas (2012); and the biographical dramas Charlie Wilson's War (2007), Captain Phillips (2013), Saving Mr. Banks (2013), Sully (2016), A Beautiful Day in the Neighborhood (2019), News of the World (2020) and Elvis (2022). He has also appear```json
{"thought": "I have the birth year of Tom Hanks which is 1956. I need the birth year of Kevin Costner to compare and find out who is older.",
"action": "wikipedia",
"action_input": "Kevin Costner"}```Page: Kevin Costner
Summary: Kevin Michael Costner (born January 18, 1955) is an American actor, producer, and director. He has received various accolades, including two Academy Awards, three Golden Globe Awards, and a Primetime Emmy Award.
He rose to prominence starring in such films as The Untouchables (1987), Bull Durham (1988), Field of Dreams (1989), JFK (1991), Robin Hood: Prince of Thieves (1991), The Bodyguard (1992), A Perfect World (1993), and Wyatt Earp (1994). During this time, Costner directed and starred in the western epic Dances with Wolves (1990), for which he won two Academy Awards for Best Picture and Best Director. He then starred in and co-produced Waterworld (1995) and directed The Postman (1997) and Open Range (2003).Costner's other notable films include Silverado (1985) No Way Out (1987), Tin Cup (1996), Message in a Bottle (1999), For Love of the Game (1999), Thirteen Days (2000), Mr. Brooks (2007), Swing Vote (2008), The Company Men (2010), 3 Days to Kill (2014), Draft Day (2014), Black or White (2014), McFarland, USA (2015), and The Highwaymen (2019). He has also played supporting parts in such films as The Upside of Anger (2005), Man of Steel (2013), Jack Ryan: Shadow Recruit (2014), Hidden Figures (2016), Molly's Game (2017), and Let Him Go (2020).
On television, Costner portrayed Devil Anse Hatfield in the miniseries Hatfields & McCoys (2012), winning the Primetime Emmy Award for Outstanding Lead Actor in a Limited or Anthology Series or Movie. Since 2018, he has starred as John Dutton on the Paramount Network original drama series Yellowstone for which he received a Screen Actors Guild Award nomination and a Golden Globe award.

AI: ```json
{"thought": "I have the birth year of Tom Hanks which is 1956 and the birth year of Kevin Costner which is 1955. I can now compare the two to determine who is older.",
"action": "Final Answer",
"action_input": "Tom Hanks is older than Kevin Costner."}
```

> Finished chain.
{'input': 'Who is older, Tom Hanks or Kevin Costner?',
'output': 'Tom Hanks is older than Kevin Costner.'}

Awesome! BUT… no… The agent retrieved the birth dates for both individuals: Tom Hanks was born in 1956, and Kevin Costner in 1955. Therefore, Kevin Costner is evidently a year older and not the opposite. Perhaps this is as much as we can expect from the reasoning capabilities of a 7B model. On the other hand, in some runs, it has accurately used the calculator tool to verify the age difference. Yet, in others, it has embarked on bizarre calculations leading to absurd conclusions. Hence, there is room for improvement.

Realistic Expectations and Strategic Approaches

It’s important to clarify: While the results of these examples have been good, we should keep realistic expectations. Mistral 7B is a smaller model and therefore has limitations in reasoning capabilities. Attempting to deploy an agent that requires a wide variety of Tools or complex reasoning might not be successful, even with extensive prompting efforts.

One approach is to use specialized agents and turn them into Tools. For instance, consider a scenario where we need to retrieve information from different sources (such as a RAG retriever, SQL, and internet search) and select among them based on certain conditions. Additionally, depending on the retrieved information, we might need to choose from multiple tools to execute. Trying to accomplish this with a single agent would result in an overly large prompt filled with examples, overwhelming Mistral 7B. A solution could be to create one specialized agent for information retrieval and another for decision-making based on the retrieved information, each with its own prompt. These agents could then be converted into Tools, allowing a main agent to orchestrate them.

Another obvious route is model fine-tuning, which could significantly reduce the prompt size. However, the main challenge here lies in dataset construction.

For more reliable performance, it would be advisable to use Mixtral 8x7B. By the way, exploring these concepts further is beyond the scope of this article. Don’t hesitate to experiment!

The post turned out to be longer than anticipated. I encourage you to download the code and you’ll see that the entire process is not as complicated as it might seem.

I look forward to your comments!

--

--