Best Small Language Models for Accuracy and Enterprise Use Cases — Benchmark Results

Darren Oberst
11 min readAug 26, 2024

--

  • What are the most accurate small language models?
  • What are the most important trends and directions among small language models?
  • How to evaluate differences among models — which should I choose for my project?

We have evaluated 26 of our BLING/DRAGON fine-tuned models from 0.5B — 9B parameters, built on a wide range of underlying base models, including: llama-2, llama-3, mistral, phi-3, phi-2, phi-1.5, stablelm, red-pajama, deci, yi, qwen, falcon, tiny-llama, sheared-llama, cerebras and pythia — and like a chef taste-testing his new creations, we have subjected all of these models to a RAG benchmark test, and are now publishing the results.

For more details on the models, the objectives and the testing, please see Part I of this blog post (Building the Most Accurate Small Language Model).

The Results: Small Language Model Accuracy on RAG Benchmark

Take-aways:

  1. Small language models can provide high levels of accuracy and quality, well-suited for most enterprise analytical tasks. While generative AI often suffers from the weight of overly high expectations, we believe that small models are frequently under-estimated in terms of their capability and accuracy, especially when the use case is well-defined, the data pipeline is thoughtfully designed, and a fine-tuned specialized model is used. At times, with larger models, the solution is predicated on relying upon the “magic” of the big model, but with smaller models, success is often dependent on building a solid overall pipeline and workflow. When those conditions are present, we believe that small language models are more than “good enough” for most tasks, especially when considering the 10X — 100X cost benefit and the flexibility in private and self-hosted deployment.
  2. Generally, small model accuracy and capabilities have improved over the last 12 months. Of the top 6 best performing models, 5 were based on foundation models released in 2024 — phi-3-mini (3.8b), phi-3.5-mini (3.8b), qwen-2–7b, mistral-0.3–7b and 01-ai-9b. The average improvement in the 2024 vintage models vs 2023 vintage models is especially interesting:
  • Overall score: 98.4 vs. 96.125 -> increase of 2.275 correct answers out of 100 questions, which could be read as either ~2% improvement in accuracy (real but perhaps not exciting), or a 50% reduction in the number of inaccuracies. We believe that this is a significant improvement in a year, but were a little surprised after seeing 100% for phi-3 that other models are still in the 98–99% range. After phi-3 “broke the test,” we thought that we would immediately need to increase the difficulty of the test (especially with new state of the art 7–9B base models), but the test still generated inaccuracies in the other models tested. At the current performance of the top models at 98.4% on the test, we wonder if we will see meaningful accuracy improvements on ‘standard’ questions over the next 6–12 months?
  • Not Found recognition: 87% vs. 83% -> minor continued improvement in identifying when a question can be answered by a passage or not. Generally, the models are effective at identifying a question that can not be answered by the context passage. However, most of the models can still be “tricked” occasionally especially when there are strong underlying statistics around a particular topic, e.g. a famous person or event, into answering a question based on ‘background knowledge’ or incorrectly applying a piece of information in the context passage. A good example is to offer a context passage in which Elon Musk is mentioned as a ‘distractor’ (literally and figuratively), along with several other people and companies in the passage, and then ask for the name of the CEO of one of the other companies (which is not present in the text), and many models will reply with “Elon Musk” rather than the correct “Not Found.”
  • Boolean (Yes/No) accuracy: 94% vs. 90% -> models in the 2nd half of 2023 improved Yes/No question answering significantly, with continued improvement progress in 2024. This is an extremely important capability in most enterprise workflows that the model can answer a boolean question consistently, and we believe that small models are generally up to this task.
  • Math/Logic accuracy: 83% vs. 71% -> notable improvements in Yi and Qwen-2 — this was one of the key areas of major change from 2023 released models. This will be an expanded area of focus in future testing, and notable that small models can now handle basic math and logic substantially better than earlier vintage models.
  • Complex/specialized questions: 4.3 vs. 3.4 (qualitative score between 1–5) with most of the improvement in areas like table-reading, multiple-choice, causal/why questions, and multi-part extraction. This will be expanded in future versions of the testing, as small models are now showing meaningful capability to handle more complex and specialized question types.
  • Summarization: 3.8 vs. 3.4 (qualitative score between 1–5) — slight improvements (perhaps) but not especially notable changes in the quality of the summarizations.

3. Better Performance in Smaller packages trend continues — especially notable are phi-3 (3.8B) and qwen-2 (1.5B). The Phi-3 (3.8B) is overall the most accurate model, out-performing on accuracy 6–9B parameter models, while the Qwen-2 (1.5B) would be competitive against many 7B parameters with 2023 vintage. As more foundation models look to provide better small model bases, and use the same training data as their larger models, we expect to continue to see improvements in high-quality 1–3B parameter models, increasingly able to compare with 7B models. The huge benefit of models in the sub-2B parameter size is CPU inference speed and the ability to run on smaller memory footprints — models in sub-2B range will be the most likely ‘edge device’ models, and models like Qwen-2–1.5B offer an amazing combination of small size and relative high quality.

4. Big Enough? and Accuracy diminishing returns — while there is a notable accuracy improvement generally in stepping from 1B -> 3B -> 7B, we wonder if phi-3 is an indicator that “accuracy” starts to see diminishing returns in accuracy beyond the 4–5B parameter range. The yi-6B (v1) was our most accurate model from the 2023 vintage, and while the yi-9B (v1.5) is a very strong and exciting model, it is actually slightly less accurate than the 6B. In all of our testing over the last year with the llama-2–13B, we did not see any improvements in fact-based question-answering (and even some areas with greater propensity to hallucinate).

The 2023 models that were 7B, are now stretching into 8B and 9B in 2024 (which seems pretty cool and a good way to keep searching out the optimal combination of size/performance), but based on what we have seen in our finetuning and testing, we have not observed a clear benefit, or an obvious area where 8B-9B beats 7B.

To be provocative: for accurate extractive-based tasks, are model sizes in the 3–6B range ultimately the right size for performance?

As far as we know, no other base models have replicated the phi-3’s unique 3.8B parameter size — but we speculate if this is right in the middle of the sweet-spot of optimizing size/performance and the “best of both worlds” of the mini models (1–3B) and the 6–9B range?

5. Evaluate with Quantized Models going forward — our initial testing in 2023 was performed with the ‘unquantized’ Pytorch models, but in 2024, all of our testing is focused on the quantized versions (generally Q4_K_M, e.g, ‘4-bit’ quantization), and generally using GGUF inference. While the Pytorch unquantized models may yield improvements in accuracy, if virtually all real-world inference scenarios are going to deploy quantized versions of the models, then the success of the model to perform when quantized is an important variable. Generally, we have seen consistently high performance with Q4_K_M quantization. The one outlier is Llama-3 and Llama-3.1 where we see meaningful degradation in accuracy that we believe is likely attributable to challenges in quantization. (On an adjacent note, we are also using temperature=0.0 and sampling=False in all of our testing now so that results should be deterministic and completely reproducible from test run to test run.)

6. Some 2nd and 3rd generation models are better, but not always and not usually on all dimensions — it may not be a straight-forward path for newer releases to improve the current versions. We see this as a major potential challenge for “universal” models that attempt to excel at all tasks — especially in smaller model packages, it will be difficult to optimize training on all potential use cases, and we envision that base models will increase “sub-specialize” to drive improvements in particular areas, such as math/logic, multi-lingual capabilities, function-calling, or alternatively, for each release, there will be some areas of regression in performance.

Examples:

  • Mistral 0.3 compared to Mistral 0.1 version: +3% increase in overall accuracy, from 96.5% to 99.5%, although notably other dimensions were unchanged or potentially even slightly worse in the original v0.1 finetune from 2023.
  • yi-9b (v1.5) is an awesome model, but slightly less accurate then the yi-6b (v1.0 in 2023).
  • phi-3.0 scores as well as phi-3.5 — no clear benefit in testing to phi-3.5 (although performance is comparable — no notable regressions either).
  • llama-2 outperforms llama-3 on most dimensions of our testing. We realize that this may be viewed as controversial and out-of-step with many others, but this is our hands-on experience, despite many attempts to adapt hyper-parameters and other training parameters on llama-3 and llama-3.1, we see more accurate and consistent results on quantized versions of llama-2.

7. Math/logic — this is a key area of notable improvement in small language model foundation models over the last year, led by Yi and Qwen2. When we started in early 2023 instruct training small models to read contracts, we found that the models were generally very poor at “every day” common-sense set of math/logic questions in the test set, even when provided a reasonable amount of fine-tuned samples, the models seemed lack solid pattern recognition to answer basic questions such as “If the payment is due in 3 days, and it is currently January 15, when is the payment due?” or answering basic sorting and ranking questions. This is one of the bigger changes in capability in small models over the last year. Yi and Qwen2 are stand-outs with notably strong performance in this area, including table-reading (complex/specialized questions), while we observed virtually no positive changes for Mistral or Llama.

8. Qwen2–0.5B — The smallest and potentially most ground-breaking of all of the releases — in all of our testing and training work in 2023, we tried to get to the smallest possible model that would demonstrate a consistent level of instruct question-answering, and generally struggled to get good instruct behavior on bases smaller than 1B parameters. So, we are especially excited by the Qwen2–0.5B. At times, the small size will create suboptimal output, but this is a remarkable little model, and breaks what we thought was perhaps a “sound barrier” type of law of physics in the realm of language models to get high-quality instruct-following inference behavior out of a 500 million parameter model. If a model this small can be finetuned to ‘read’ complex materials, formulate reasonable responses, perform extraction and summarization tasks, what does it tell us more broadly about the nature of LLM models, and perhaps overall the modeling of language patterns?

9. Phi-3 and Phi-3.5 — are the only models to get perfect scores on the 100 test question set. This is perhaps a coincidence, and it is debatable how much better is 100% vs. 98–99.5%, but it is remarkable that two completely separate fine-tunes of Phi-3 and Phi-3.5 both achieved perfect scores — and both were tested with the 4-bit quantized GGUF version of the models. The fact that these models are MIT-licensed, finetune well, quantize well, and are part of a supported series from Microsoft, make them the continued “pound-for-pound” champion, and our “overall” go-to-model.

10. Base Model License — growing convergence around a standard Apache 2.0 and MIT licenses, which we see as a boon for the industry. Notably, Qwen and Yi have moved to Apache 2.0, and Microsoft moved to MIT. It seems like the window for small companies to make base foundation models is almost closed now with large companies like Microsoft, Alibaba, Facebook, and Google having a steady release train of permissive standard-term high-quality models, with companies like Mistral and 01-AI likely to be the ‘pure play’ champions.

Conclusion — we would be reluctant to draw too broad of a conclusion as to the “best model” as a lot depends upon the use case and the target behavior. Also, this can be an evolving target, as the highest performing model today will likely be surpassed 6 months from now. So, our conclusions are which models seem most interesting to us and where we will be spending the most time in our finetuning, testing and experimentation over the remainder of 2024:

  1. Phi-3 and Phi-3.5 — still the “pound for pound” champions in our view for accuracy and quality with the performance of a ~7B model, but quantized fits in a package of only ~2.4GB. We previously favored the stablelm-3b-4e1t (2.8b) — which is still the highest performing in that size category — but we are shifting most of our fine-tuning experimentation to phi-3 mini as an alternative.
  2. Qwen2–1.5B (and Qwen2–0.5B)— we will be spending a lot of time on fine-tunes, function-calling and experimentation, as this looks to be the best performing small model (under 2B). We will also be doing our own experimentation work on the Qwen2–0.5B to see how far its performance can stretch when specialized for very specific tasks — what tasks can a model this small do most effectively? TinyLlama is our current small model champion, and we will continue to do a lot of work on it, but will be running a lot of comparisons with Qwen2 going forward.
  3. Qwen2–7B and Yi-9B — demonstrated excellent performance on table-reading and math/logic, and appear to be the best of the new generation of 7B-9B models in our assessment. While Mistral-0.3 was just as accurate, it did not perform as well in these areas, so we will likely be doing more experimentation on Qwen2 and Yi.

[Fine-Print: all tests were performed on llmware finetunes of the underlying base models, so this is testing the performance of our finetune, and not properly speaking the underlying foundation model. By finetuning in a substantially similar way multiple base models, we believe that it yields meaningful insights into the capabilities of the underlying model — and provides a ‘level playing field’ for evaluation- but in some cases, the relative performance difference may speak more to the efficacy of combining that base model with our finetuning materials and methods. In short, please feel free to credit the base model for all of the good things we said, and please feel free to blame our fine-tune for any of the bad things!]

[You might be asking … Where is Google Gemma? Where is Phi-3–7B/14B?Coming soon…]

To check out some of our small, specialized fine-tuned models — none of which claim to be AGI but humbly aspire to be really useful fact-based business tools (when used in conjunction with sound generation loops and well-designed data pipelines ) — please go to our repo home page on HuggingFace — LLMWare RAG Instruct Models.

We will update this blog from time-to-time with new models and test results.

For more information about llmware, please check out our main github repo at llmware-ai/llmware/.

Please also check out video tutorials at: youtube.com/@llmware.

You can also contact us on our website: www.llmware.ai.

--

--