Llama 3 70B Outperforms GPT-4o in Function Calling with Friendli Tools

Published in

FriendliAI

4 min readAug 20, 2024

Friendli Tools Series: Part 3 of 3

At FriendliAI, we’re on a mission to make cutting-edge generative AI technologies accessible to everyone. That’s why we are thrilled to introduce Friendli Tools, a powerful feature that enables accurate function calling in most open-source language models!

Thanks to Friendli Tools, Llama 3 70B on Friendli performs on par with OpenAI GPT-4o and Fireworks Firefunction v2 in function calling. It even excels in complex tasks like “parallel multiple” function calling. Our outstanding function calling capabilities, combined with our cost efficiency, position Friendli Suite at the forefront of AI agent development.

Highlights of Friendli Tools:

Function Calling Accuracy: Excels in “parallel multiple function” tasks, outperforming GPT-4o even without fine-tuning.
Cost Efficiency: Llama 3 70B on Friendli matches the overall performance of GPT-4o at just 5% of the cost ($0.8 vs $15 per 1M output tokens).

Our Friendli Tools blog series concludes with this third and final post, completing our exploration of this new exciting offering. Learn about Friendli Tool’s impressive function calling benchmark results in this article!

The Gorilla Benchmark

We selected the Gorilla LLM function calling dataset to evaluate the function calling accuracy of Friendli Tools. The dataset covers a wide range of domains including math, sports, and finance, and offers a thorough evaluation of models’ function calling capabilities in different real-world contexts.

Consider a scenario where a user seeks assistance with real estate inquiries in three cities. A thousand benchmarked scenarios, including this one, are given to the model to evaluate the model’s accuracy in function calling. The model has to correctly answer the user query with the appropriate function calls.

Here’s the example:

Question:

Can you help me find a property in San Francisco, CA that is a condo with 2 bedrooms and fits within my budget range of $500,000 to $800,000? After that, could you also provide an estimated value for a villa in Los Angeles, CA with 3 bedrooms that is 5 years old? Lastly, I would also like to know the estimated value of an apartment in New York, NY with 1 bedroom that is 10 years old.

Answer:

{
    "realestate.find_properties": {
        "location": ["San Francisco, CA", "SF, CA"],
        "propertyType": ["condo"],
        "bedrooms": [2],
        "budget": [
            {
                "min": [500000],
                "max": [800000]
            }
        ]
    },
    "property_valuation.get_1": {
        "location": ["Los Angeles, CA", "LA, CA"],
        "propertyType": ["villa"],
        "bedrooms": [3],
        "age": [5]
    },
    "property_valuation.get_2": {
        "location": ["New York, NY", "NY, NY"],
        "propertyType": ["apartment"],
        "bedrooms": [1],
        "age": [10]
    }
}

Don’t you find the example quite challenging? This example is included in the “parallel multiple function” evaluation category, which Llama 3 70B on Friendli excels at!

The function calling scenarios of the Gorilla benchmark are categorized into four groups: “simple,” “multiple,” “parallel,” and “parallel multiple.” The most complex category, “parallel multiple function,” is defined as a combination of the “multiple function” and “parallel function” categories. The “multiple function” tests models on user queries that invoke a call out of 2 to 4 functions. The “parallel function” involves simultaneously executing multiple function calls in response to a single user query.

Benchmark Results

Ready for the Gorilla benchmark results? We compared four models: Llama 3 8B, Llama 3 70B, Firefunction v2, and GPT-4o. The Llama 3 models were tested on the Friendli Suite.

Prepare to be amazed by the benchmark showdown! The results reveal that Llama 3 70B on Friendli consistently achieves top-notch accuracy. It particularly stands out in parallel multiple function calling, outperforming the next-best model by a significant margin of 7%. Notably, the next-best model was created by fine-tuning on Llama 3, whereas the Llama 3 running on Friendli is the vanilla model.

These results highlight that the original Llama 3 models supported by Friendli Tools are comparable to the leading fine-tuned models, specially tailored for function calling applications. Friendli Tools offers an innovative method to enhance LLM function calling without the necessity of fine-tuning.

Friendli Tools supports precise function calling across most language models. By integrating this function calling capability into our LLM inferencing, models like Llama 3 70B can deliver remarkable performance in function calling. Friendli Tools simplifies the process of using custom function calling models, allowing you to create high-performing AI agents without any fine-tuning required.

How to get started

Curious about how to begin using Friendli Tools? Access our documentation to discover a range of detailed guides designed to simplify your onboarding process. Moreover, we host Friendli Tools on the Friendli Serverless Endpoints API which is completely OpenAI-compatible. You can easily use Friendli Tools by switching to our client or changing the model name and base URL in your existing OpenAI client.

Concluding the Friendli Tools Series

Friendli Tools is a fundamental feature for building fast and accurate agents. We’re excited to put this exceptional AI technology into the hands of our community and can’t wait to see what you create!

Explore function calling by reading our full blog series on Friendli Tools. Begin your journey with Part one: Function Calling — Connecting LLMs with Functions and APIs, an essential guide to understanding the basics. Follow it up with Part Two: Building AI Agents Using Function Calling with LLMs to learn how to build AI agents.

The future of building intelligent AI agents is here — Start building today on Friendli!