Function calling with Mistral 7B

A quick and reliable way of calling functions using the Open Source LLM Mistral 7B

Tingkart
4 min readJan 20, 2024

Function calling with open-source models unveils intriguing possibilities, but can have issues with regards getting the models to answer in a format we can parse or are slow. This article using the Mistral 7B model strives to address both these issues.

This article and solution is greatly inspired by the innovations of Aurelio AI Lab, the creators of the Semantic Router available at https://github.com/aurelio-labs/semantic-router. The Semantic Router is a great implementation for rapid decision-making for large language models (LLMs) and intelligent agents.

Dalle-3 — prompt ‘Image depicting Mistral, the wind, as an anthropomorphic character.’

Install llama-cpp

pip install llama-cpp-python

The default pip install behavior is to build llama.cpp for CPU only on Linux and Windows and use Metal on MacOS. Go to https://pypi.org/project/llama-cpp-python/ for details on installing hardware acceleration backends using GPUs.

Download model

Download the GGUF model from HuggingFace using Curl, you can use larger models quantized models (Q8) for better accuracy and smaller for speed (Q2). For my use the Q2 model is accurate enough and really quick.

# Fastest and smallest model 
curl -L "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q2_K.gguf?download=true" -o ./mistral-7b-instruct-v0.2.Q2_K.gguf

# Larger and more accurate model, remember to update script when using this or another model
curl -L "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q8_0.gguf?download=true" -o ./mistral-7b-instruct-v0.2.Q8_0.gguf

Create a grammar file called json.gbnf

The json.gbnf in llama.cpp is a grammar format used to ensure that model outputs adhere to specific structures, in this case valid JSON. Using grammar is crucial for applications requiring structured and accurate LLM model outputs.

Place the json.gbn ffile in the same folder as your python file and downloaded model.

root   ::= object
value ::= object | array | string | number | ("true" | "false" | "null") ws

object ::=
"{" ws (
string ":" ws value
("," ws string ":" ws value)*
)? "}" ws

array ::=
"[" ws (
value
("," ws value)*
)? "]" ws

string ::=
"\"" (
[^"\\] |
"\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]) # escapes
)* "\"" ws

number ::= ("-"? ([0-9] | [1-9] [0-9]*)) ("." [0-9]+)? ([eE] [-+]? [0-9]+)? ws

# Optional space: by convention, applied in this grammar after literal chars when allowed
ws ::= ([ \t\n] ws)?

The MethodCaller

The MethodCaller.run method accepts a string that contains natural language and function to be executed. The query should contain the parameters that are passed to the function.

It's important to ensure that the function is annotated using standard Python docstring format. This annotation communicate in providing clear documentation and understanding of the function's purpose, its parameters, and the expected return type to the LLM.

import inspect
from typing import Any, Callable, Dict
import json

import llama_cpp
from llama_cpp import LlamaGrammar

class MethodCaller:

def __init__(self, model: Any):
self.grammar = LlamaGrammar.from_file("json.gbnf")
self.model = model

def get_schema(self, item: Callable) -> Dict[str, Any]:
schema = {
"name": item.__name__,
"description": str(inspect.getdoc(item)),
"signature": str(inspect.signature(item)),
"output": str(inspect.signature(item).return_annotation),
}
return schema

def run(self, query: str, function: Callable):

schema = self.get_schema(function)

prompt = f"""
You are a helpful assistant designed to output JSON.
Given the following function schema
<< {schema} >>
and query
<< {query} >>
extract the parameters values from the query, in a valid JSON format.
Example:
Input:
query: "How is the weather in Hawaii right now in International units?"
schema:
{{
"name": "get_weather",
"description": "Useful to get the weather in a specific location",
"signature": "(location: str, degree: str) -> str",
"output": "<class 'str'>",
}}

Result: {{
"location": "London",
"degree": "Celsius",
}}

Input:
query: {query}
schema: {schema}
Result:
"""

completion = self.model.create_chat_completion(
messages=[{
"role": "user",
"content": prompt
}],
temperature=0.1,
max_tokens=500,
grammar=self.grammar,
stream=False,
)

output = completion["choices"][0]["message"]["content"]
output = output.replace("'", '"').strip().rstrip(",")

function_inputs = json.loads(output)
print( function_inputs )
return function(**function_inputs)


# Function to be called described using the Python docstring format
def add_two_numbers( first_number: int, second_number: int ) -> str:
"""
Adds two numbers together.

:param first_number: The first number to add.
:type first_number: int

:param second_number: The second number to add.
:type second_number: int

:return: The sum of the two numbers.
"""
return first_number + second_number


model = llama_cpp.Llama(
model_path="./mistral-7b-instruct-v0.2.Q2_K.gguf",
n_gpu_layers=-1,
n_ctx=2048,
verbose=False,
)

fun = MethodCaller( model )

result = fun.run("Add 2 and 5", add_two_numbers)
print(result)

result = fun.run("Add 66 and seventy", add_two_numbers)
print(result)

Hope you found this method of calling functions using Mistral 7B useful and interesting.

On my old and trusty M1 16GB Macbook resolving and executing this function takes on average 1.4 seconds. It would be interesing to know how quickly it executes on other setups.

--

--

Tingkart

Power Up Your Sensors: Get Real-Time Insights with Tailored AI! Discover how our smart IoT agents transform data into action, making every decision smarter.