RAG-Instruct Capabilities: “They Grow up So Fast”
Comparing 1B vs 3B vs 7B parameter LLM model capabilities
This is part of an ongoing series of blogs on our work developing and optimizing smaller open-source decoder based LLM models and fine-tune training for RAG-Instruct tasks. Please feel free to review some of our other blogs on this topic, and follow / subscribe to receive updates for upcoming blogs.
In this blog, we are going to take a look at overall performance comparison of models between 1B — 7B parameters to analyze the ‘real-world’ performance of smaller models in RAG tasks, and to assess the correlation between model size and expected performance results. For evaluation purposes, we will be using the LLMWare RAG-Instruct-Benchmark-Tester.
Background
Our key research question for smaller models has been: in which use cases can smaller models be deployed for enterprise retrieval augmented generation (RAG) workflows? We started with the assumption that smaller models would need to have a specialized mission and not try to be good at everything but that for a smaller subset of tasks smaller models could deliver solid results.
We fine-tuned a wide range of pre-trained decoder-based models in four distinct size categories — 1.0B, 1.3–1.4B, 2.7–3.0B, and 7B ranges. We tried to instruct fine-tune models below 1.0B, but found that the results were very poor, and assessed that models below that size level lack the capability to consistently learn and respond to basic questions. Accordingly, for purposes of this small model evaluation, we set the absolute “floor” for small LLM at 1B parameters.
We fine-tuned 20+ models, across a wide range of the major open source foundation models (e.g., Llama, Falcon, Mistral, RedPajamas-INCITE, Pythia, Deci, Cerebras, StableLM, Mosaic, and Sheared-LLama) using a proprietary RAG-Instruct training set designed for fact-based critical reading comprehension, summarization and extraction. We went through multiple rounds of iteration with each model, to try to get the best performance for each model, and then pruned down to the 12 best performers among the 4 size categories.
For each of these 12 best performing models, we then completed 2 testing cycles (200 questions each), and tabulated the results, and averaged by model category. The summary results are in the table below:
Model Performance Results on RAG Benchmark Test
Key Observations
- Even very small models (1B and 1.3B) can be surprisingly effective at core question-answering extractive tasks. We would encourage everyone to look at the top 100 core test questions and context passages (available at the link above, or described in the blog here). These are fairly representative “real-world” questions from complex business, financial and technical materials. To achieve 80%+ accuracy at 1.3B parameters and 90%+ accuracy at 2.7B parameters — all running locally on a laptop— is a very encouraging result. For a high-volume RAG workflow, where cost will be an important practical consideration, that is focused on (i) extractive activity and (ii) “human in the loop” follow-up review, we believe that there is a viable use case of fast, free, local LLMs to do this work, especially in the context of applying wrap-around preprocessing, post-processing and some basic “prompt engineering” to get the best results. The beauty of these very small models (besides: “free” and “fast”) is that they can be quickly and cost-effectively further refined and fine-tuned to specific domains. In short, we would advocate that even these “mini-LLMs” can be useful in production in certain business process automation workflows, provided that the use case is specific and well-defined. One further caveat that is that the effective use of these smaller models in RAG puts a greater premium on retrieval quality, as the model will require a bit of “spoon-feeding” of the right context passages, chunked in moderate sized context windows (e.g., up to 500–600 tokens maximum in our testing).
2. Focusing the task is a key part of getting good results with smaller models. While the core Q&A results are surprisingly positive, this is in stark contrast with the limited capability of very small models on more complex logic, such as not-found-classification, boolean question answering and math/logic. We tried multiple different distilled training sets to try to teach the 1B and 1.3B models some of these behaviors, and found that the models lacked the “depth” and “complexity” to effectively do so. Until 2.7B parameters, we found that the ability to learn even basic math was extremely limited, and that even at 2.7B parameters, the models are quite bad at simple math, with a meaningful jump in math accuracy at 7B parameters (although still barely adequate at every-day math). If a particular use case does not require boolean classification, recognition of a mathematical threshold, or a high number of “out of context” samples, then it is possible to get decent results from a 1.3B parameter model, but if any of these elements are required, then at least 2.7B, and likely 7B parameter, models need to be considered.
3. First Baby Steps of Being an LLM — the jump in results from 1B — 1.3B is an interesting and important one worthy of further research and analysis. While arguably it is an artifact of the selected models, it does appear that this is a size range in which new capabilities are rapidly emerging within the model, and where even a few hundred million additional parameters makes a big impact — is this the size range in which an LLM is “born” in terms of instruction-following capability? From 1B to 1.3B parameters, the overall accuracy jumps 9 points, with substantial improvements in “not found classification” and “yes-no” boolean question handling. Empirically, it seems that in this size range, the model gains the depth and complexity to begin to handle more detailed reading comprehension and to start to be able to handle boolean questions and recognition of out of scope examples.
4. 3B parameter models are a great bridge from testing to production. When we started this project, we were a bit skeptical of 3B parameter models, which seemed like “no-mans-land” between a good local testing laptop model (1.3B) and a good solid production GPU-based model (7B). However, after systematic testing, we believe that there is a lot of potential for 3B parameter models for many use cases. Inference of 3B models runs reliably, if slowly, on 16 GB of RAM on a Mac M1 laptop, and can be swapped over to a GPU for production, and then are blazing fast on even relatively small GPU memory sizes. The accuracy of the best performing 3B parameter models was not that different than the average 7B model, although the 7B models had much greater capability in more complex tasks.
5. Hallucinations were largely absent across the board. We did not identify any meaningful “hallucination” behavior across any of the tests. There were obviously wrong answers, sometimes very bad wrong answers, but no clear instance of “inventing” a wrong answer not found in the context passage. We believe that this result is a positive validation of RAG generally, and of training models for closed-context question-answering specifically, that the model learns to derive its answers from the passage context, rather than draw on general knowledge. Note: we did observe relatively benign “filling in the gap” behavior in a very small number of test cases, specifically in long summaries involving public names and companies, and there were a few limited cases where the model response drew upon public information not contained in the context, e.g., Microsoft Excel is a Microsoft product, but we did not record any example of a classic hallucination where the model conjures an imaginary fact. Note: testing was performed at low generation temperature (0.3) which also reduces the hallucination risk considerably.
6. Epistemological Uncertainty — Recognizing Limits. Not Found Classification is an extremely important task. This is a separate risk from hallucinations, but is a key risk in RAG. If the model receives a passage and a question, and the answer to the question is not clearly provided in the passage, how does the model respond? The worst outcome would be hallucinations, e.g., inventing an answer. The second worst outcome, however, and a realistic one, is that the model draws on information from the context inaccurately. For example, if there is an address of Company A in the context passage, and the question is: what is the address of Company B? The correct answer should be “Not Found”, but often times, LLMs will be overly helpful and answer with the address for Company B. It is notable the improvement in not-found-classification from 1.3B to 2.7B to 7B. We have found that 7B parameters can consistently avoid this trap, while 1.3B models have a lot of difficulty learning this concept. We tried increasing the percentage of “not found” training samples, and still could not succeed in “teaching” a 1.3B parameter model to effectively make this classification. It may be a lack of depth in “reading comprehension” or a a larger “concept” to grasp, but this is definitely a key area where model size seems to unlock new capabilities.
Conclusion
We believe that smaller, cost-effective, special-purpose LLMs enable a much wider range of practical RAG use cases in the enterprise, where the cost benefit, privacy/security, ease of workflow integration and speed of domain adaptation are key enablers that are often times the right trade-offs compared with much larger models. It is a truism in the AI world that “bigger is better” in terms of model size, but it is equally true that it is important to have the right tool for the task, and to increasingly understand the specific use cases where specialized smaller models can be effectively deployed to drive true productivity benefit.
There is a lot more to unpack in this research initiative in upcoming blogs and videos. This blog focused primarily on analyzing 1B — 3B parameter models. We are still completing our analysis of the 7B RAG-instruct trained models, which we believe will be the main production “workhorse” model for many enterprise use cases, but does require a GPU-based inference server. We will be publishing a more detailed blog and analysis on this subject in the next couple of weeks.
For more information about the models discussed in this blog, please check out: LLMWare Bling RAG Instruct Models.
For more information about llmware, please check out our main github repo at llmware-ai/llmware/.
Please also check out video tutorials at: youtube.com/@llmware.