Small Language Models (SLM), Large Language Models(LLM), or Micro LLM (MLM)?

Todd Mozer's Desk
Sensory Perspectives on AI
3 min readAug 23, 2024

LLMs & Small LLMs.

Since the arrival of generative AI and Large Language Models the world has learned a whole new vocabulary. It’s impressive to me that “Large Language Models” and LLMs have moved into tech societies’ mainstream dialog and conversations.

Language Models are nothing new. Since the early days of Hidden Markov Modeling, language models have been an important component of a voice-based system. In fact, SLM’s used to mean Statistical Language Model as opposed to a simple grammar which was also a Language Model.

Now SLMs are Small Language Models. But they aren’t small. An LLM can be up to a trillion or more parameters. Today SLM’s are under 5 billion parameters (plus or minus a few billion parameters). They really should be called Small LLMs.

Typically, Small LLMs are reductions and quantization that focus in on certain domains. So there may be extreme expertise and conversation-ability about a particular topic, but only broad knowledge beyond that.

Chips for LLMs.

Building a LLM is very different than running an LLM. The building side is done typically in large data centers with big Nvidia chips. Running an LLM is often on a similar type of cloud or data center platform, but takes significantly less time and or processing power.

The hope for Small LLMs is to run on device in a product rather than going out to a Cloud, Data Center, or a large On Premise solution. On-device solutions offer several advantages in cost (i.e. no service fees), privacy, and availability (no connection required).

The platforms that run Small LLM’s are nevertheless quite large and power hungry. Think of cell phone chips from Qualcomm or Mediatek or even PC based Intel systems. This works for higher end products that can handle the cost and heat, but not for low-cost consumer electronics, like the dream of wearable assistants that we take with us.

Creating Micro LLMs.

There’s a need to go even smaller and break into more mainstream chips and products. We’d want this for reasons such as cost and lower heat dissipation (e.g. we can’t have glasses or earbuds that heat us up!). For this, a Micro LLM is needed that can run on 100 million parameters or less. Much like with Small LLM’s, sacrifices need to be made to reduce computational and memory requirements. A Micro LLM can have the same domain expertise but lose some of the conversational ability with more directed questions and responses.

Micro LLM solutions can more easily run on the lower end of the Qualcomm and Mediatek type products but are also a really nice fit for new generations of processors with special purpose acceleration for TensorFlow Lite and Tensorflow Lite Micro. ARM for example has an IP line called Ethos U series (55, 85, etc.) that is bringing the inference costs down, Cadence has similar efforts to provide IP for inference.

Micro LLM’s with Cognitive Arbitration to LLMs

One of the techniques Sensory has deployed over the years to lower cost or power consumption is to have multi-staged models. For example, for a wake word or sound identification, we deploy a tiny model that can be “always on”, then we have backup bigger models that improve on the accuracy but are only periodically called on.

Sensory Automotive Voice Assistant

This same sort of layered approach can be applied to LLMs, with a MicroLLM running on device and Cloud LLM or SLM deployed as a backup. For example, Sensory has an automotive Voice Assistant solution that can run on device with as little as 34 MB. It accurately captures and responds to automotive commands, but when an out of domain query is given, we can pass on to a Cloud (or bigger on device) solution. Multiple domains can run simultaneously, and model built for cognitive arbitration can determine which Micro, Small, or Large Language Model should respond.

--

--