Model Mania: A Battle of Open Source LLMs

8 popular open-source LLMs compete head-to-head in a series of friendly contests

Team Octo
OctoAI
6 min readMar 29, 2024

--

The Competition Heats Up

It’s no secret that open source LLMs are increasingly competitive with proprietary models. Many developers are asking, why pay top dollar for GPT-3.5 or GPT-4 if a smaller open source LLM will get the job done?

But with more than 68,000 text-gen models to pick from, finding the right model isn’t easy. AI leaderboards can be a useful indicator of overall quality, but they can be unfairly influenced, and they deal in generalities. What they don’t tell you is how a model will perform for your specific use case.

Gen AI app builders then find themselves in the position of having to develop their own LLM leaderboards, testing and tuning models to achieve the desired outcomes for their app or service. So far this has largely been a manual process (though lots of work is being done to ease the process).

Not wanting to sit on the sidelines, we wondered how OctoAI’s hosted LLMS would perform in different prompt scenarios. Will top-seeded Mixtral 8x7B dominate? Will the upstarts at Nous research take the trophy? There’s only one way to find out!

In the spirit of the season 🍀⛹🏼‍♂️ we’re bringing our hosted models into the arena for a series of friendly contests we call Model Mania.

Welcome to “Model Mania,” the ultimate showdown where the giants of OSS LLMs go head-to-head, shooting for the highest accuracy in a series of thrilling contests.

Rules of the Game

The OctoAI team set up a model evaluator prototype to evaluate our hosted LLMs using…LLMs. We began by crowdsourcing fun, unique system prompts from the AI superfans at OctoAI. Each system prompt includes specific criteria for desired behavior. The format looks like this:

Then, both the system and user prompt are processed by a handful of open source LLMs and responses are generated. Note: while OctoAI also supports experimental models, custom models, and fine-tunes, for this experiment we kept to our publicly hosted APIs. Responses look like this:

Riley is a very good boy

Finally, we used a large, all-purpose LLM to evaluate the outputs in relation to the system prompt, and rank the top three winning responses. Response quality can vary, meaning, if you submit the same prompt multiple times, the winner won’t always be the same. In some cases, there’s so much variability that it’s a virtual draw. So we’re only sharing matchup results where a model won best of three attempts. The podium looks like this:

Nous-hermes-2-mixtral-8x7b-dpo ranks highest for affirming Riley’s goodness and delivering that doggy energy

Now that you know the rules, let’s blow the whistle and get started. Who will cut down the net and be crowned champion for each prompt? Read the play-by-play to find out!

Round 1: Enter the Dungeon

Rolling our virtual 12-sided die

Highest Ranked Response

“As you enter the evil wizard’s tower, you find yourself in a grand hall with high ceilings and ornate stonework. The walls are adorned with dark, ominous tapestries, and the floor is covered in a thick layer of dust. The air is thick with the stench of decay and rot, and you can hear the distant sound of scuttling creatures. As you look around, you notice a group of guards standing at the far end of the hall. They are heavily armed and look like they mean business. They eye you warily, and one of them steps forward, blocking your path. “Halt, traveler,” he says in a gravelly voice. “What business do you have here?” You have several options: 1. Try to sneak past the guards. 2. Attempt to reason with the guards and convince them to let you pass. 3. Draw your weapon and prepare for a fight. What would you like to do?

The winner is…

After a strong initial start from Mistral-7B in query 1, codellama-7b-instruct takes the crown in the final two attempts

Round 2: Monday Motivation

Consider taking a day off, my friend

Highest Ranked Response

Elon Musk (bold, ambitious, and visionary): “You’re not living up to your full potential. One email on a Monday morning? That’s not enough. You’re capable of more. Re-evaluate your priorities and focus on the most important tasks that will move the needle for your goals. Don’t let a lack of energy hold you back. Remember, the competition is always working, so you must be relentless in your pursuit of success.” Sun Tzu (strategic, wise, and thoughtful): “Consider the long-term consequences of your actions. One email may seem insignificant, but it could lead to a chain reaction of missed opportunities. Reflect on your energy levels and find ways to conserve them for the most critical battles. Plan your day with intention and focus on the battles that truly matter.” Jensen Huang (innovative, passionate, and charismatic): “Embrace the challenge and find a creative solution! Instead of sending just one email, think outside the box. Use this as an opportunity to connect with your team in a unique way. Send a video message or create a visually stunning presentation. Show your passion and enthusiasm, and watch as your energy and motivation come back to you tenfold.”

The winner is…

🥇 Mistral-7b-instruct swept the results in all three queries

Round 3: The Dreaded Code Review

Will it be a “slam dunk” for CodeLlama?

Highest Ranked Response

The code you provided is a simple Java program that prints “Hello, World!” to the console. However, there is an error in the code that needs to be fixed. The error is in the line: System.out.prinln(“Hello, World!”); The method “println” is misspelled, and it should be “println” with a lowercase “n”. Here is the corrected code: class HelloWorld { public static void main(String[] args) { System.out.println(“Hello, World!”); } } The corrected code will produce the expected output “Hello, World!” in the console.

Top-seeded🥇 Codellama-7b-instruct comes out on top

How did your bracket hold up?

After three rounds of prompting action, 8 out of 9 models had top-three finishes — not bad! Mistral and Mixtral tended to outperform their llama counterparts on tasks that called for creativity, and (unsurprisingly) for our debugging use case, codellamacame out on top. That it was the low-parameter 7b variant is encouraging for anyone with light-weight debugging use-cases (similar to the prompt example) as it is faster, more affordable, and can even run locally if required.

Our humble evaluator prototype isn’t quite ready for the big dance, but if you’d like to share more about your model evaluation use cases get in touch with our team on Discord.

If you’d like to try our hosted text-gen models yourself, sign up for a free account at octoai.cloud.

Don’t forget to follow us on Medium for more fun LLM content 🎉

--

--

Team Octo
OctoAI

Thoughts on machine learning, app dev, and the future of AI from the engineers at octo.ai