Building the Most Accurate Small Language Models — Our Journey

7 min readAug 26, 2024

In mid-2023, we were working on the launch of our open source project — llmware — and were looking for a good instruct RAG ‘testing’ model that we could run comfortably on our laptops for contract analysis, invoice extraction, earnings statement question-answering, and other ‘business-focused’ fact based LLM use cases, and were surprised to see a gap in the market in “CPU-optimal” sized LLM models in the 1–3B parameter range that had instruct training and could be used effectively for fact-based question-answering. To the extent that the base models existed, they were generally either ‘research artifacts’ for technical exploration, or they had some ‘chat’ orientation — few seemed to be designed for business-oriented use cases.

As a result, we launched our BLING model series, focused on bringing high-quality “grounded source” business-focused instruct training to very small models less than 6B parameters. After that, we launched our DRAGON series, targeting models with 6–9B parameters. All of these models were focused on looking beyond the chatbot and consumer-oriented dialog use cases, and envisioning models being deployed as integrated components of more complex workflows, where consistency, predictability, privacy, fact-based responses, model “swappability” and low-compute efficacy, were required.

Each of the 26 models in the llmware BLING and DRAGON series have a common set of characteristics in its training objectives and design:

Optimized for Grounded-Source Question-Answering — like the classic SAT reading comprehension tests — read a passage and answer questions based on the passage (no ‘open context’ or ‘background knowledge’);
Domain adaptation for Complex Business, Financial and Legal documents — focused on the materials that businesses use every day, not consumer or chat oriented materials;
Short-clear answers — “just the facts” — by keeping responses short, it is easier to post-process a LLM response in a programmatic workflow, and it has a side-benefit of accelerating inference times on smaller machines;
Better to say “I don’t know” than make something up — high quality negative sampling training to guide model to use “Not Found” response for consistent handling of questions that can not be answered by the grounded source, rather than using background knowledge or often times undefined behavior — reducing a common source of hallucinations;
Short-simple prompting — no explicit instructions expected — enables clear, simple prompt construction and consistency of behavior — and makes it easy to “swap” out one BLING/DRAGON model for another since the input/output behavior is the same;
Deterministic generation— the models are optimized and intended for use with ‘deterministic’ generation with very low temperatures, or ideally temperature=0.0, and sampling turned off, to further minimize potential hallucinations and variation. (This makes the models less chatty and arguably less fun, but a lot more consistent!);
Targeting shorter contexts (!)— optimized for range of 500–1000 tokens — generally 1 paragraph to 1–2 pages and no more — while model context windows have grown, including for smaller models (from 2048 -> 4096 and now some 8192 and beyond), we have stuck to the goal of accuracy, and as context windows grow, so will inaccuracies and inconsistencies. Rather than throwing more context into the model, we have generally viewed this a data pipeline issue of conducting an effective ‘first pass’ to narrow the relevant context passage with passages between 100–1000 tokens being a window to target in fine-tuning for optimal accuracy.

We have purposefully endeavored to bring this training program and objective across every high-quality open source foundation model, less than 10 billion parameters, especially as new ones are released, including: phi-3, llama-2, mistral, yi-1, yi-1.5, phi-2, phi-1.5, qwen, stablelm, deci, togethercomputer/red-pajama, llama-3, pythia, cerebras, falcon, tinyllama, sheared-llama, etc.

For different models, we optimized the mix of training materials, tokenizer-specific parameters (end-of-text, trailing space), and often times the learning rate and training hyper-parameters, but generally each model was created with the goal of providing consistent prompting and output behavior, on the premise that it would be easy to “substitute” one model for the other, to replace a 1B model used in testing for a 7B model used in production, or to migrate from a “llama-2” to a “qwen-2” and get consistent results — and not require any change in underlying process workflow.

“The 2.1% increase in MMLU score is really interesting (?), but will this work for my use case?”

The second objective we had with the BLING and DRAGON series was trying to provide an answer to the most basic question that we would get from business stakeholders on GenAI projects, namely: how accurate should I expect this to be? Is it 80%, 90%, 95% — more or less?

The existing benchmarks, while useful common measuring points for the technical community, were silent on the only question that we ever received from clients and partners — is this going to work, and is this use case worth pursuing given the likely accuracy level that we can achieve?

As a result, we created a ‘common sense’ 200 question RAG Benchmark test, and we tested each BLING and DRAGON model on this same test.

The RAG benchmark is published in the llmware Huggingface repository, and has now been downloaded thousands of times. It consists of two parts:

Part I — 100 standard questions — used to assign an overall accuracy score of 0–100 to a model. The questions are drawn from a mix of business, financial, technical and general news/politics, and include primarily fact-based question-answering, extraction, basic logic, and identification of complex numbers and attributes from passages with context passages in the range of 100–1000 tokens.
Part II — 100 questions consisting of 5 sub-parts of 20 questions each — evaluating 5 distinct specialized categories -
Not Found Recognition — how does the model respond when given a question that can not be answered by the context passage? This is one of the true keys to using LLMs for RAG and knowledge automation in which the model may have to read 10 passages, and be able to identify the one passage that correctly provides the answer, while providing an easy programmatic way to ‘discard’ the other 9 passages with “not found”. See this video on this topic- “The Hardest Problem in RAG — How to Handle ‘Not Found’”
Yes/No Boolean Questions — can the model correctly identify and answer ‘yes/no’ questions ? This is also an important classification step in many workflow processes to route based on a particular yes/no question.
Math/Logic — how does the model perform on common-sense ‘every day’ math and logic questions? We do not see using LLMs for complex math as a plausible use case in the short-to-medium term, but basic math and logic — understanding increments, percentages, sorting, ranking is critical to almost any type of business analysis.
Complex/Specialized — multiple-choice, table-reading, causal, multi-part extraction. When we launched the test in 2023, most smaller models performed very poorly on these categories, but have improved considerably throughout 2024. We rank performance not on an absolute percentage, but a qualitative ranking of 1–5 based on the model’s responses.
Summarization — this is also ranked on a qualitative basis of 1–5, for clarity, accuracy, fluency and conciseness of capturing the key points.

We have trained dozens of base models, and published 26 now in the BLING family (0.5B — 5B) and DRAGON (6–9B) family. Each one has a scored test that we have published with the model card.

With consistent training objective, prompt design, and testing, we have a fascinating window to assess the evolution of small language models.

Key Learnings

The key to unlocking the capability of smaller models is looking at the model as one key component of the solution, but a component that must be integrated into an overall well-designed use case, data pipeline, and with the right preprocessing and post-processing.

Most consultants that we talk to start with the premise: “We will build our prototype with OpenAI and then once we have it working, replace with a smaller model for production.” This rarely works (and is one of the contributors to the POC never getting to production — the #1 problem plaguing generative AI), as the data pipeline, prompting and postprocessing are all optimized for the larger model, and become very difficult to adapt to smaller models. The larger model is more forgiving of laxity in the data pipeline, extraordinarily complex — and not reproducible — prompt instructions, excessively long contexts, and generally encourages a lot of “bad habits” … (If we only had a dollar for every time someone approached us with a two-page highly-tailored — only-for-OpenAI-instruction prompt and asks why can’t our 1b-parameter bling models process them!)

We would recommend moving in the opposite direction: start the POC with the smallest possible model, optimize every facet of the data pipeline to get the best possible accuracy, and then where needed, increase the size of the model in testing and production to get better results. It is much easier to “start small” and increase the model size, then to go in the opposite direction.

Small models are getting better — and getting smaller — and reaching levels of accuracy than seemed inconceivable as recently as a couple of years ago. They are certainly not perfect, and require different tactics and ways of handling them then larger models.

If you are interested in the benchmark findings, then please continue with the second part of this post — “Best Small Language Models for Accuracy and Enterprise Use Cases — Benchmark Results”

To check out some of our small, specialized fine-tuned models — none of which claim to be AGI but humbly aspire to be really useful fact-based business tools (when used in conjunction with sound generation loops and well-designed data pipelines ) — please go to our repo home page on HuggingFace — LLMWare RAG Instruct Models.

For more information about llmware, please check out our main github repo at llmware-ai/llmware/.

Please also check out video tutorials at: youtube.com/@llmware.

Building the Most Accurate Small Language Models — Our Journey

Written by Darren Oberst