SLIMs: small specialized models, function calling and multi-model agents

Darren Oberst
5 min readMay 14, 2024

--

Why small, specialized function calling LLMs are the future

In 2023 Generative AI (that was so long ago!), one of the most common showcase examples was ingesting a PDF document, and generating an open-ended summary with lot of bullet-points and nice headers. Great for a quick video, a basic RAG tutorial, and to demonstrate the ever-larger model context windows. However, this widely-used example misses the mark on most real-world use cases, which require a multi-step series of analyses, with structured outputs that are required to complete a form, enter a new record in a database, or use the LLM output programmatically as an “if/then” condition to make a decision in a longer process or perform a complex chain of look-ups.

In response to this, over the last several months, there has been a growing recognition that “RAG 2.0” will evolve beyond the basics of parsing, vectorizing and summarizing, and increasingly feature function calls, structured outputs, and multi-step agent processes. As a simple example, a “RAG 2.0” process may start with an incoming document, e.g., a company earnings transcript, and then require several specialized steps to be performed:

  1. Classify (Yes/No) if the company met analyst expectations.
  2. Classify sentiment (Positive/Negative).
  3. Extract company name — use as lookup in a private knowledge base — and then ask additional questions (e.g., when was the last time that the company missed expectations?)
  4. Extract stock ticker — convert to a SQL query to lookup information in a stock database and/or use as a lookup key in an external web service to get the latest data;
  5. Extract key executives mentioned in the text, and then use as a lookup to get more information about their backgrounds — and run follow-up queries using that secondary source material;
  6. Figure out if any key information is missing, and if so, take triage steps to fill in any gaps.
  7. Organize all of the output and package up for the next step in the process.

In steps 1–7, there is an overall workflow that orchestrates a series of specialized model inferences, grounded on a particular set of source materials, integrated with a set of different retrieval activities at each step in the process to update and enrich the source materials being used. (Check out this video for a good example of this type of multi-step research in action — https://youtu.be/l0jzsg1_Ik0?si=0HZpj2FGYB_JzoLX)

What is fascinating to us is that many of these specialized steps involve classification and extraction subtasks, which until recently, were the sole domain of traditional encoder-based classifiers. Before LLMs took over the world, most of NLP machine learning was focused on these types of tasks, approached using a classical technique of adding a ‘classification’ head on top of an encoder, labeling samples with the expected output value (e.g, label = 0), and using to produce solid baseline classifiers for almost any task. With the renewed the focus on these types of tasks, there are now two different ways to approach these problems:

  1. Classical ‘encoder classifier head’ approach— still the workhorse of most NLP applications, but notable for many limitations in inflexibility to adapt to changes in the number of categories, generalize to new domains in sensible ways, use the semantic meaning of the labels (which are interpreted ‘outside’ of the model), requiring bespoke pre-processing, post-processing, dataset labeling and model architecture for each classifier, and not fitting especially neatly into ‘natural-language’ oriented LLM-centered processes; and
  2. LLM ‘decoder’ Function Calling approach — has the benefit of the flexibility of using natural language and an indefinite number of categories, but suffers from the “opposite” problem of being overly flexible and difficult to constrain the wide application of the LLM to perform a more narrow task, leading to an ever-increasing amount of complex ‘prompt magic’ templating approaches to elicit structured responses from ‘open ended’ OpenAI calls. Ironically, (is this progress?) this approach usually takes a 100 billion parameter model and attempts to constrain its task to deliver an outcome more comparable to the 100 million parameter traditional BERT-based encoder classifier that was used 5 years ago!

We recently launched a new model family called SLIMs (Structured Language Instruction Models), designed to try to blend the best of both of these approaches. SLIM models are small, specialized decoder-based, function-calling LLMs, fine-tuned on a specific task to generate structured outputs, such as python dictionaries, lists, json and sql. Over the last two months, we have released 15 different SLIM models, all fine-tuned on top of high-quality, open source 1–3B parameter model bases, 4-bit KM-quantized and delivered with GGUF, with integrated easy-to-use function_calling prompt templates provided in llmware.

The first batch of SLIM models focus on common classification, summarization and extraction tasks including: Summary, XSUM, Extract, Boolean, NER, Sentiment, Topics, Category, Emotions, Intent, SQL, NLI, Ratings, Sentiment-NER (combination) and Tags (in both 1B and 3B versions).

Each model is a small, specialized, locally-running LLM that has been fine-tuned to be be a function-calling specialist, prompted in a structured manner, with a consistent function call prompt template. Each model expects input in the following form:

{context_passage} <function> parameters </function> 

So, to prompt the slim-sentiment model, the prompt would be structured as follows:

{context_passage} <classify> sentiment </classify>

And the output from the model generation is a structured python dictionary with key corresponding to the parameter passed to the function, e.g.,

{"sentiment": ["positive"]}

The models have been trained to respond to a single function, e.g., classify, extract, summarize, boolean, etc., and generally to accept one or more specialized parameters, such as sentiment, emotions, people, location, or custom keys in the slim-extract model, such as “company name” or “revenue growth.”

Models like slim-summary have been trained to accept an optional “list length” parameter that guide the number of list elements provided in the summary, e.g.,

{context_passage_to_summarize} <summarize> key data points (5) </summarize>

LLM Output (Python List):
['point1', 'point2', 'point3', 'point4', 'point5']

While models like slim-boolean have innovative, experimental features like an explain parameter, e.g.,

{context_passage} <boolean> Is that true? (explain) </boolean>

LLM Output:
{'answer': ['no'], 'explain': ['the passage says ...']}

Since the models are small, quantized and consistent in their API, we can easily stack and chain them together in multi-step processes — we routinely run 10+ SLIM models concurrently in an agent-based process on CPUs.

We are working on the next batch of SLIM models already, and see almost infinite possibilities for additional functions and parameters, including the ability to combine and aggregate these capabilities in a single multi-purpose model, especially with 7B models, with 1–3B models serving as small, fast specialists.

We believe that SLIMs are part of an upcoming revolution in the way that people think about both decoder-based LLMs as well as traditional encoder-based machine learning classification models — all centered around the concept of specialized function-calling — and especially using smaller specialized LLMs.

To check out some of our fine-tuned models, please go to our repo home page on HuggingFace — LLMWare SLIM Models.

For more information about llmware, please check out our main github repo at llmware-ai/llmware/.

Please also check out video tutorials at: youtube.com/@llmware.

--

--