Connect an AI agent with your API: Intel Neural-Chat 7b LLM can replace Open AI Function Calling

Vassilios Antonopoulos
11tensors
Published in
9 min readDec 6, 2023

Various pretrained Large Language Models (LLMs) today come with notable reasoning capabilities, enabling them to break down intricate issues into simpler steps, offering solutions, actions, and evaluations at each step. By leveraging the sophisticated natural language understanding capabilities of LLMs, AI agents can interpret and generate human-like text, enabling them to comprehend complex queries, engage in dialogue, and perform various knowledge-intensive and language-related tasks with a high degree of precision and context sensitivity.

AI Agents’ great potential and power come with their capability to use external tools, such as a company’s API, a search engine, a RAG component for getting answers using internal private data or another specialized LLM (e.g. code generator).

The most frequent way to connect an AI agent to an external tool is by connecting it to an API. This means that the agent backed by the LLM should be able to understand when a request needs information provided by the API, then pause the processing and call the API with the correct set of parameters, get the response, incorporate it to the rest of information and continue the processing to reach the final response to the initial request. These requests can come from users (conversational agents), other software components or even other LLMs.

The challenge here for the agent is to not only understand that a call to an external API is needed and interrupt the processing to do it, but also be able to select the appropriate method from the API and of course use the correct set of parameters to do the call. During the pre-LLM period, we were building agents using a Natural Language Understanding (NLU) component. This component was either a custom trained intent classifier combined with an Entity Extraction module, or a similar component offered by third party NLUs like Dialogflow, wit.ai and rasa.ai trained on sample utterances/samples of our application. In a similar way, an LLM could be used now as an intent classifier and an entity extraction module in subsequent calls by leveraging appropriate prompts. So, one could use several LLM calls to mimic the previous flow and step-by-step try to determine if an API should be called, which API and method is the one to call and with which parameters. But that doesn’t seem optimal now. Current LLMs should be able to do this more efficiently and in a more automated manner. Is this the case?

Open AI offers two models, the latest be gpt-3.5-turbo-1106 and gpt-4–1106-preview, which have been trained to both detect when a function should be called (depending on the model input) and then to respond with JSON that adheres to the function signature. This is called Function Calling and it’s the proposed way to connect an agent with an external API.

Except from detecting the API function to call, what is really critical for the LLM in this case is to be able to reliably return structured data back, in a standard format, otherwise we won’t be able to parse the response and determine when, which and how to call an API method. The specifications of the API methods are passed to the model via the prompt.

11tensors has developed an autonomous AI fashion agent with vision and text understanding capabilities that learns the customer’s style and helps them shop. The agent connects to a fashion marketplace via its API. The initial implementation was done based on Open AI’s models and Function Calling. Both models proved to be very powerful and robust in this use case. They were consistent in returning JSON when a function call was needed, precise on the selection of the function and its parameters and intelligent in inferring metadata from the conversation history in order to effectively act even in rather difficult requests, such as “I think I prefer the second dress you showed to me earlier, please provide me with some other similar options”. Here it needs to go back in the conversation history, find the dresses request-response, point to the second product of the list, get its ID and then call the corresponding function.

The agent performed very good in all the use cases. But the tokens were consumed very fast :-) ! We have reached a solution; however, it exceeded the anticipated budget. So, we searched for an alternative one. Can an open-source LLM achieve the performance of Open AI’s models and offer the same Function Calling feature at a comparable performance?

We have tested Llama2–7b-chat and gorilla-openfunctions-v1 models. The second one is fine-tuned for the specific usage (API calling). They both had a decent performance. It proved really hard to make Llama2 return structured responses, but the gorilla model was quite consistent at this. Llama2 was better in inference, gorilla had problem with the tricky cases, but both of them were lacking a lot the performance we had with Open AI. So, we were a bit disappointed.

It was then that Intel announced its neural-chat-7b-v3–1 LLM, a fine-tune of mistralai/Mistral-7B-v0.1, which had scored extremely well in several reasoning benchmarks. Mistral-7b stands out as the top-performing model with fewer than 30 billion parameters, surpassing numerous larger models across various benchmarks. So, this fine-tune seemed very interesting and promising, and we decided to give it a try.

Neural-chat model is not trained in responding with JSON, nor in function calling. In order to reach a point where we get the same functionality as Open AI Function Calling, we need to think a bit differently. Our approach to achieve this was to introduce a concurrent 2-mode operation of the LLM:

  • Function Calling mode (we call it “tool”)
  • Conversation mode (we call it “chat”)

We prompt differently the two modes. User messages are firstly checked for possible “tool” need (API call) using the Function Calling mode. Function description in the prompt and the motivation to try this method came from a very insightful post of Aaron Blondeau (https://dev.to/aaronblondeau/creating-a-llama-2-agent-6fj).

Inside the prompt, we are providing the description of the API, few-shot examples, the history of the conversation, the message from the user and we repeat the request for JSON response in case a function should be called.

tool_prompt = '''
### System:
Assistant is a expert JSON builder designed to assist the users of a women's fashion marketplace.
Assistant is able to trigger actions for User by responding with JSON strings that contain "action" and "action_input" parameters.

Actions available to Assistant are:

- "recommend_clothes": Useful for when Assistant is asked to provide clothing recommendations, which can be filtered on price, designer brand, season and clothing category.
- To use the recommend_clothes tool, Assistant should respond like so:
{{"action": "recommend_clothes", "action_input": {{"category": "Dresses", "season": "Winter", "min_price" : 20, "max_price": 550, "designer": "mpatmos"}}}}
- "get_product_details": Useful for when Assistant is asked to retrieve all metadata details of a specific clothing item product.
- To use the get_product_details tool, Assistant should respond like so:
{{"action": "get_product_details", "action_input": {{"productID": 55783296}}}}
- "get_similar_products": Useful for when Assistant is asked to find clothes similar to the one mentioned.
- To use the get_similar_products tool, Assistant should respond like so:
{{"action": "get_similar_products", "action_input": {{"productID": 12748937}}}}
- "get_designers_list": Useful for when Assistant is asked to provide a list of the designers or brands contained in the marketplace.
- To use the get_designers_list tool, Assistant should respond like so:
{{"action": "get_designers_list", "action_input": "all"}}
- "get_designer_information": Useful for when Assistant is asked to provide information about a specific designer or brand.
- To use the get_designer_information tool, Assistant should respond like so:
{{"action": "get_designer_information", "action_input": {{"name": "mpatmos"}}}}

Here are some past examples of Assistant responding to the User:

User: Hey how are you today?
Assistant: I'm good thanks, how are you?
User: Please, show me some warm dresses you have for the winter.
Assistant: {{"action": "recommend_clothes", "action_input": {{"category": "Dresses", "season": "Winter", "min_price" : 0, "max_price": 550, "designer": ""}}}}
User: Interesting! Give me more details on the Abracatabra dress please.
Assistant: {{"action": "get_product_details", "action_input": {{"productID": 43731284}}}}
User: Very nice! Can you get me some others similar to the Trijichta pants?
Assistant: {{"action": "get_similar_products", "action_input": {{"productID": 386786777}}}}
User: Who are the designers and the brands you collaborate with?.
Assistant: {{"action": "get_designers_list", "action_input": "all"}}
User: Please give me more information for the happyfrenchgang brand.
Assistant: {{"action": "get_designer_information", "action_input": {{"name": "happyfrenchgang"}}}}
User: Thanks, Bye!
Assistant: See you later.

The history of the current conversation between Assistant and User is:
{0}

### User:
{1}

Respond with JSON string that contains "action" and "action_input" parameters if a tool is needed.
### Assistant: '''

If the response from the “tool” LLM contains the requested JSON, so that it specifies that a function call should be made, we are doing it and then the response from the marketplace API (in JSON too) is given as Context to the “chat” LLM. Otherwise, the user request is forwarded as-is to the “chat” mode LLM, which is responsible for the conversation part and the “tool” response is totally ignored.

def check_tool(command):

# Put user command into "tool" prompt, inject history
history = "\n".join(fashion_conversation.conversation_history)
prompt = get_prompt(MODEL_TYPE, "tool").format(history, "User: " + command)

# Send command to the model
output = llm(prompt, stop=["User:"], max_tokens=800)
response = output['choices'][0]['text']
print (str(response))
response = response.replace('\n', '')
# try to find json in the response string
# heuristic needed mainly for Llama2 model responses
# Llama2 usually responds like this: "Sure! I can help in this, here's the json..."
try:
# Extract json from model response by finding first and last brackets {}
firstBracketIndex = response.index("{")
lastBracketIndex = len(response) - response[::-1].index("}")
jsonString = response[firstBracketIndex:lastBracketIndex]
responseJson = json.loads(jsonString)
marketplace_context = call_marketplace (responseJson)
return marketplace_context, True
except Exception as e:
print(e)
# No json match, just return False to denote that we should ignore this response
return response, False


def agent_response(user_message):
# process user message with the "tool" mode of LLM
marketplace_context, blnNeedAPICall = check_tool(user_message)

if blnNeedAPICall:
history = "\n".join(fashion_conversation.conversation_history)
user_augmented_message = str(marketplace_context) + "\n\n" + "User: " + user_message
current_context = str(marketplace_context)
current_user_message = "User: " + user_message
# Update "chat" prompt
new_prompt = get_prompt(MODEL_TYPE, "chat").format(history, current_user_message, current_context)
fashion_conversation.add_message("User: ", user_augmented_message)
else:
# if no function is selected to be called, totally ignore this response and re-prompt user message
# to get the final response from the "chat" mode LLM
current_user_message = "User: " + user_message
history = "\n".join(fashion_conversation.conversation_history)
current_context = "There is no context to provide. Answer User based on system instructions and current conversation history. " #no context to provide, just use the history and respond
new_prompt = get_prompt(MODEL_TYPE, "chat").format(history, current_user_message, current_context)
fashion_conversation.add_message("User: ", user_message)

# Send command to the model
output = llm(new_prompt, stop=["User:"], max_tokens=800)
finalResponse = output['choices'][0]['text']
#update the history
fashion_conversation.add_message("Assistant: ", finalResponse)
return finalResponse

The prompt for the “chat” mode is presented below. The history of the conversation is common for the 2 different modes of LLM usage. The Context is given after the user message, as if given before the model ignores it sometimes.

chat_prompt = '''
### System:
You are a helpful assistant of a women's fashion marketplace.
Assistant recommends clothes to the user and gives information about the marketplace's designers and brands by ONLY using the Context provided.
Assistant never responds with information that is not in the Context.
Context is given in json format.
Assistant is polite and always tries to sell products to the user.
Assistant never shows the ID of products to the user.
Assistant provides a textual description for each product that should only contain information provided in the json formatted Context about its Title, Description and Price and always be accompanied with its product url (product_url provided in Context).
Assistant never invents new products not existing in the Context.
Assistant only participates in conversations about fashion and kindly avoids to answer to every other request.

The history of the current conversation between Assistant and User is:
{0}

### User:
{1}

Context:
{2}

### Assistant:

'''

So, did it work? Well, yes it did!!!

Intel’s new model performed surprisingly well. It proved to be extremely efficient. Very consistent in selecting the functions and returning structured output (function and parameters), highly capable of handling all difficult cases. Our 2-mode approach combined with the specific model can definitely be considered an alternative to Open AI Function Calling and models. It’s slightly more complicated as a solution, it may need a bit more tweaking and error handling, but it will work. There are so many examples on the internet using Langchain for agents, but actually few of them will work with open-source models, especially lighter ones in the category of 7 billion parameters. Being able to use a 7 billion parameter model with strong inference and reasoning capabilities as the brain of an agent can be a game changer.

Now, I’m really curious to also test this model in several ReAct scenarios…

--

--