One Simple Prompt Boosts Claude 3 and Mistral’s API Calling Performance Beyond GPT-4
Subtitle: I Spent $100 to Compare the API Calling Capabilities of OpenAI’s Claude 3 and Mistral’s Flagship Models
Abstract
This study explores the capabilities of Large Language Models (LLMs) in API calling, evaluating eight mainstream large models including OpenAI GPT-4, Anthropic Claude 3, and Mistral on a test set of four scenarios. The study utilized the Toolbench[1] test set, which contains both real APIs (such as OpenWeather and CatAPI) and fictional APIs (such as Home Search and Booking).
Experimental results show that in real API scenarios, the supersized models of Claude 3 and Mistral perform comparably to GPT-3.5; however, in fictional API scenarios, their performance is poorer, with a higher error rate. Nevertheless, by adding a simple requirement to the prompt, the performance of Claude 3 and Mistral saw a surprising improvement, consistently outperforming GPT-4.
Introduction
Large Language Models (LLMs) have demonstrated remarkable capabilities in multiple domains, with API calling being one of their fundamental functions. To delve deeper into this capability, we referred to recent relevant studies[1–6] and chose the test set provided by Toolbench[1] for evaluation.
In the Toolbench test set, problems are divided into single-step and multi-step scenarios. Single-step scenarios require the model to directly generate API calls to achieve the goal, while multi-step scenarios require the model to iteratively call APIs and decide the next action based on the returned information. This study primarily focuses on the performance of single-step problems.
OpenWeather
The OpenWeather website provides a series of APIs for querying weather information and other data for cities. These API calls include required parameters (such as longitude lon
and latitude lat
) and optional parameters (such as language lang
and measurement units units
). For example, the API call to get the weather forecast for a city is as follows:
# Get the weather forecast data in {city}
curl -X GET 'https://api.openweathermap.org/data/2.5/forecast?q={city_formatted}&appid={API_KEY}{optional_params}'
Parameters:
q
: (required) City name.appid
: (required) Your unique API key.units
: (optional) Units of measurement. 'standard' (default), 'metric', and 'imperial' units are available.mode
: (optional) Response format. 'JSON' format is used by default. To get data in 'XML' format use mode=xml.lang
: (optional) You can use thelang
parameter to get the output in your language. 'en' for English (default); 'fr' for French; 'zh_cn' for simplified Chinese; 'it' for Italian; 'de' for German; 'ru' for Russian; 'ja' for Japanese; 'nl' for Dutch.
In this scenario, there are a total of 9 API calls, including retrieving current weather data for a city, getting air quality data for specific coordinates, and getting weather forecast data for a location by postal code. In addition, the paper authors developed 2 usage examples for each category of API. For example: “Please give me the air quality data at longitude 163.3 and latitude -80.0 at this moment.”
Task: Please give me the air quality data at longitude 163.3 and latitude -80.0 at this moment.
Action:
curl -X GET 'https://api.openweathermap.org/data/2.5/air_pollution?lat=-80.0&lon=163.3&appid={API_KEY}'
And, 100 test queries were constructed, such as “Tell me the current air pollution level at the location with latitude = 5.0 and longitude = -60.7.”
To evaluate the quality of LLMs, the experiment looks for the first line of text starting with “curl” (if it exists). Then, a shell process is used to execute it. If the shell process returns a non-zero value, the authors consider this generation “non-executable.” On the other hand, if the code is executable, the authors compare the returned response with the corresponding result of a real curl request. If the output matches the expected result precisely, the model’s generation is considered successful.
Finally, the authors constructed a Chain-of-Thought (CoT) Prompt template:
I have the following set of API:
{api_docs}
-------------
I have the following set of examples:
{examples}
-------------
Task: {query}
Actions:
For example, if testing GPT-3.5, we would get the following result:
Explanation of Evaluation Metrics:
- Success Rate: The proportion of cases where the model’s output matches the expected result exactly. For example, GPT-3.5 gave the correct output in 88% of the test cases.
- Recall: The proportion of cases where the model can find the correct API but does not necessarily obtain the expected return result. Under this metric, GPT-3.5 could find the correct API in 94% of the cases.
- Crash Rate: The proportion of API calls generated by the model that cannot be executed correctly or return an error code. This could be due to formatting errors, incorrect parameter filling, etc. Under this metric, 4% of API calls made by GPT-3.5 were not executable.
Cat API
TheCatAPI scenario mainly involves interacting with the Cat API website through REST API, including GET, DELETE, and POST requests, which is different from the OpenWeather scenario that only involves GET requests. This scenario provides six types of API calls for handling operations related to cat images, such as:
- Removing a cat image from the user’s favorites.
- Adding an image to the user’s favorites.
- Returning a list of images in the favorites.
- Voting for or against an image.
- Searching for cat images based on filter criteria.
Here is a specific API call example to add an image with a specified ID to the favorites:
# Add the image with id {image_id} to the list of favorites
curl -X POST 'https://api.thecatapi.com/v1/favourites' --data '{"image_id":"{image_id}"}'
To test the model’s API calling ability, the paper authors developed two example tasks for each API category. For example, the task of adding a specific cat image to the favorites:
Task: Post image b6o to my favorites.
Action:
curl -X POST 'https://api.thecatapi.com/v1/favourites' --data '{"image_id":"b6o"}'
Similar to the OpenWeather scenario, the paper authors designed 100 user queries for the Cat API scenario, such as “Please give me 8 photos of cats in ascending order,” to evaluate the model’s performance in handling various requests.
Finally, the paper authors used the same Chain-of-Thought (CoT) based prompt template as the OpenWeather scenario to generate API calls. This unified prompt template helps to fairly compare the performance of models in different scenarios.
Home Search
The Home Search scenario is different from the previous two scenarios, mainly in that the test set and API are fictional, rather than real. This means that LLMs would not have been exposed to Home Search-related information during the pre-training process, so this scenario can better test the LLM’s ability to handle unknown APIs.
This scenario simulates the process of searching for homes based on specific criteria in a specific location. The API includes 15 functions, mainly including:
- set_location(value: string): Set the location of the search area.
- set_buy_or_rent(value: string): Specify whether to buy or rent a home.
- 12 functions for setting search criteria, such as house price, number of bedrooms, and house area.
- search(): Submit search criteria and get results.
For example, the description of API.set_location is as follows:
# To set the location for the search area. This function must be called before setting any criteria.
API.set_location(value: string)
To evaluate the performance of LLMs, the executability of generated actions and the F1 score were considered. To ensure search executability, LLMs need to call a series of functions in order, starting with set_location and set_buy_or_rent, followed by functions for setting criteria, and ending with the search() function. For example:
A task is to find a condo or townhouse in Indianapolis between 1500 and 3050 square feet.
Task: Find a condo or townhouse in Indianapolis between 1500 and 3050 square feet.
Action:
API.set_location("Indianapolis")
API.set_buy_or_rent("buy")
API.select_home_type(["Condo", "Townhouse"])
API.set_min_square_feet(1500)
API.set_max_square_feet(3050)
API.search()
If executable, the F1 score is calculated between the criteria set in the generated program and the criteria set in the real program. The paper developed a test set of 100 queries that ask about home options with different combinations of criteria and provided 10 demo examples. To test the LLM’s ability to use unseen API functions, the paper authors intentionally did not show the usage of some functions in all demo examples.
Finally, a query example is as follows: “Buy a home in Chandler with 5 garage(s) and 4 swimming pools.”
Booking
The trip booking scenario is similar to the home search scenario, but there are more complex dependencies between API function calls. This scenario simulates the process of providing search requests for trip tickets, hotel rooms, or both based on specific requirements (such as location, dates, and required number of tickets).
In this scenario, 20 functions were designed to handle three types of booking needs: trip tickets, hotel rooms, and both. For example, the function for setting the destination is as follows: Given a Loc object, set the arrival location. This function must be called if the booking type is “trip tickets” or “both.”
# To set the location for arrival, given a Loc object. This function must be called if booking type is 'trip tickets' or 'both'.
API.set_destination(Loc)
In this scenario, some function calls are required, while others are optional. If any required function calls are missing or if some function calls are in the wrong order, the query will be non-executable. If optional function calls are missing, the query may be unsuccessful.
The test set contains 120 queries and provides 11 examples. Below is an example task for booking hotel rooms:
Task: I'm interested in booking 2 Queen Bed rooms in Garden Grove for 5 nights, from Dec 17, 2022 to Dec 22, 2022. Find me something within my budget of 800 USD.
Action:
API.select_booking_type("hotels")
API.set_num_rooms(2)
API.select_room_type("Queen Bed")
location = Loc("Garden Grove")
API.set_hotel_location(location)
checkin_date = Date(12, 17, 2022)
API.set_checkin_date(checkin_date)
checkout_date = Date(12, 22, 2022)
API.set_checkout_date(checkout_date)
API.set_max_room_price(800)
API.search()
A user query is as follows: “Hi! I’m searching for a hotel in Detroit for 7 nights, from February 15, 2023, to February 22, 2023. I need 1 Queen Bed room, and my budget is up to 980 USD per night.”
Overview of Experimental Results in the Paper
In this study, we used the same prompts as in the paper and conducted experiments on eight large language models (LLMs) across four scenarios: OpenWeather, Cat API, Home Search, and Booking. These eight models are:
- Baichuan2–13b
- Baichuan-Turbo
- Claude-3-Haiku (medium-sized)
- Claude-3-Opus (supersized)
- GPT-3.5
- GPT-4
- Mistral-Large (large-sized, also its largest version)
- Qwen1.5–70b (the largest open-source version before Elon Musk open-sourced Groq)
Among them, the open-source models Baichuan2–13b and Qwen1.5–70b were deployed on local graphics cards, while the other commercial models were used through their APIs. In all experiments, the temperature parameter was set to 0 to ensure that the generated text had a high degree of certainty.
The collection, organization, and plotting of experimental data were done by GPT-4 and a code interpreter, and were manually checked.
OpenWeather
Cat API
Home Search
Booking
In this experiment, GPT-4 was almost always in the leading position in various metrics, showing its strong ability in API calling tasks. In contrast, smaller-scale LLMs performed less satisfactorily and were almost unusable in some cases.
It is worth noting that the recently popular Claude3 (including the medium-sized Haiku and the supersized Opus versions) and Mistral-Large (large-sized version) performed comparably to GPT-3.5 in two real scenarios (OpenWeather and Cat API). However, their performance was significantly reduced in the fictional scenarios (Home Search and Booking), especially the crash rate of Claude3, which was abnormally high and almost unusable.
To verify whether GPT-4 secretly lowered the performance of other competitors when analyzing the experimental data, we manually checked the running logs. An interesting phenomenon was discovered.
One Simple Prompt
We noticed that in the fictional scenarios, the performance of Claude3 was almost unusable. This reminded me of the performance difference of its predecessor, Claude2.1, in the needle-in-a-haystack test, where simply adding one sentence to the prompt, ‘Here is the most relevant sentence in the context,’ the accuracy rate increased significantly from 27% to 98% [7].
After further examining the running logs, we found a similar situation. In the paper’s prompt, although the descriptions and examples of the API were provided, GPT-4 was able to understand the format of the API very well based on the examples in the prompt and directly return clean and concise code text. However, some other LLMs (such as Claude-3 Supersized) tend to generate outputs with explanations and steps, and merely providing the API format seems insufficient to guide them to generate pure code text.
Specifically, the prompt in the paper looks like this:
I have the following set of API:
# To set the hotel check-in date, given a Date object. This function must be called if booking type is 'hotels' or 'both'.
API.set_checkin_date(Date)# Descriptions of other APIs -------------
I have the following set of examples:Task: I'm planning a trip from Amarillo to Oklahoma City between 2022-03-12 and 2022-03-17. I'm looking for 3 Luxury rooms to stay in Oklahoma City, and also 2 adult and 5 kids bus tickets.
Action:
API.select_booking_type("both")
location_from = Loc("Amarillo")
API.set_origin(location_from)
location_to = Loc("Oklahoma City")
API.set_destination(location_to)
checkin_date = Date(3, 12, 2022)
API.set_checkin_date(checkin_date)
checkout_date = Date(3, 17, 2022)
API.set_checkout_date(checkout_date)
API.set_num_rooms(3)
API.select_room_type("Luxury")
API.set_num_adults(2)
API.set_num_children(5)
API.select_transportation("bus")
hotel_location = Loc("Oklahoma City")
API.set_hotel_location(hotel_location)
API.search()# Examples of other calls-------------
Task: I live in Corpus Christi, and am planning a trip to Columbus. I need to book 4 adult round trip tickets for bus and 6 hotel rooms for 10 nights in Columbus. I'll leave and check in on 2023-03-17, and check out on 2023-03-27.
Actions:
GPT-4 was able to understand the format of the API very well based on the examples in the prompt and directly return clean and concise code text.
API.select_booking_type("both")
location_from = Loc("Corpus Christi")
API.set_origin(location_from)
location_to = Loc("Columbus")
API.set_destination(location_to)
checkin_date = Date(3, 17, 2023)
API.set_checkin_date(checkin_date)
checkout_date = Date(3, 27, 2023)
API.set_checkout_date(checkout_date)
API.set_num_rooms(6)
API.set_num_adults(4)
API.select_transportation("bus")
hotel_location = Loc("Columbus")
API.set_hotel_location(hotel_location)
API.search()
And other LLM-chats tend to generate outputs with explanations and steps, just giving the API format is not enough.
For example, Claude-3 Supersized would return the following text:
Here are the actions to complete the task:
API.select_booking_type("both")
location_from = Loc("Corpus Christi")
API.set_origin(location_from)
location_to = Loc("Columbus")
API.set_destination(location_to)
checkin_date = Date(3, 17, 2023)
API.set_checkin_date(checkin_date)
checkout_date = Date(3, 27, 2023)
API.set_checkout_date(checkout_date)
API.set_num_rooms(6)
API.set_num_adults(4)
API.select_transportation("bus")
API.set_trip_type("round trip")
hotel_location = Loc("Columbus")
API.set_hotel_location(hotel_location)
API.search()The key steps are:
1. Set booking type to "both" for hotels and transportation
2. Set origin and destination locations
3. Set check-in and check-out dates for the hotel stay
4. Specify 6 hotel rooms are needed
In fact, the code in the middle of this paragraph is the same as the output of GPT-4, but there is some explanatory text added before and after it. This kind of redundant explanation is more common in Claude-3 and Mistral-Large, occasionally appears in Qwen, and is less common in GPT-3.5 and GPT-4.
Of course, we can extract this code using text regular matching or other methods [8], or we can simply add a sentence at the beginning of the prompt:
<Answer in CODE ONLY. NO COMMENT OR EXPLANATION>
Footnote: In fact, we also tried adding a sentence at the end of the prompt: (Answer in code only) But it is not as obvious as the sentence at the beginning.
After adding this sentence, the performance of LLMs in various scenarios has improved significantly:
OpenWeather
The following graph more intuitively shows the comparison of success rates before and after.
- The success rate of Claude3-Opus (supersized) increased to a level almost the same as GPT-4 (only 0.003% difference).
- The success rate of Claude3-Haiku (medium-sized) reached 94%, surpassing GPT-3.5.
The Cat API
- For real APIs, the improvement of various models is not significant.
- However, the success rate of Claude3-Opus (supersized) even surpassed GPT-4,
- And although the medium-sized version had a slight decline, it still maintained a correctness rate of 90.88%.
Home Search
- For fictional APIs, the performance of almost all LLMs has improved.
- The improvement of Claude3 (including supersized and medium-sized versions) is particularly significant, with success rates surpassing GPT-4.
- The success rate of Mistral-Large also saw an astonishing improvement, reaching the same level as GPT-4.
Booking
- The performance of almost all LLMs has improved.
- The improvement of Claude3 (including the supersized version) and Mistral-Large is particularly significant, with success rates surpassing GPT-4.
Overall, by adding a sentence to the prompt requiring the model to return only code without explanation, the performance of LLMs in API calling tasks has significantly improved, especially in fictional scenarios.
Conclusion
This study, by comparing the performance of eight large language models (LLMs) in four API calling scenarios, reveals the potential and limitations of LLMs in handling API calls. The experimental results show that these models perform well in real API scenarios but are lacking in fictional API scenarios. However, simple prompt optimization, such as emphasizing answers in code form, can significantly improve the model’s performance, even surpassing GPT-4 in various scenarios.
This finding suggests that with proper prompt design and optimization, LLMs can handle complex API calling tasks more effectively, providing developers and researchers with more powerful tools. Future work can further explore the API calling capabilities of different LLMs and investigate how to better design prompts and evaluation metrics to fully utilize the potential of LLMs in this field.
References
[1] Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, and Jian Zhang. On the tool manipulation capability of open-source large language models. arXiv preprint arXiv:2305.16504, 2023b.
[2] Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301, 2023.
[3] Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244, 2023a.
[4] Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
[5] Chen Z, Du W, Zhang W, et al. T-eval: Evaluating the tool utilization capability step by step[J]. arXiv preprint arXiv:2312.14033, 2023.
[6] Xu Q, Hong F, Li B, et al. On the tool manipulation capability of open-source large language models[J]. arXiv preprint arXiv:2305.16504, 2023.
[7]https://www.anthropic.com/news/claude-2-1-prompting
[8]https://github.com/guidance-ai/guidance
Acknowledgments
Thanks to GPT-4 for helping to review and proofread the format of this paper, thanks to code-interpreter for drawing the figures in this paper, and thanks to Claude 3 Sonnet (large-sized) for conceiving the title and subtitle of this paper.