Are the Mega LLMs driving the future or they already in the past?

Darren Oberst
8 min readOct 1, 2023

For long-time practitioners of AI, the growth in model size over the last 5 years has been dizzying, and undoubtedly has played a significant role in the progress and efficacy of language models, but the question remains: how big is big enough, and how big is too big?

It can seem quaint to recollect, but at the time of publication of the now famous “Attention is All You Need” paper that launched the transformer revolution, a “large” state of the art CNN was on the order of 100 million parameters. It is not surprising that the initial base BERT model had 110 million parameters, ROBERTA had 125 million, and GPT2 had 117 million parameters. Of course, this all changed in 2020, with OpenAI’s publication of “Language Models are Few Shot Learners” and GPT3, which included what seemed like a research moon-shot of training a 175 billion parameter model, and drove the model size “arms race” among all of the leading AI research shops with Baidu launching Ernie Titan (260 billion parameters), Google PaLM (540 billion parameters), and Microsoft and Nvidia launching Megatron-Turing (530 billion parameters). All of this acceleration in size led to hushed rumors of trillion parameter models coming on the way soon and unimaginable emergent properties of these mega models on the fast track to artificial general intelligence (AGI).

The combination of the transformer architecture’s scalability, combined with ever more powerful parallelized GPUs, created a perfect storm in 2020–2022 for the emergence of LLMs, and the proliferation of “mega” models with more than 50 billion parameters. Just about all research papers publishing a series of models showed at least a linear, if not “emergent” improvements, as model sizes scaled, with the same architecture and training, which reinforced the basic principle: Bigger is Better.

While it may be a rough heuristic (and constantly moving target), we can try to put language models into four categories by size:

Mega LLMs — 50 billion+ parameters, require parallelized GPU training, and in almost all practical use cases, a highly parallelized GPU architecture for inference. For the foreseeable future, it is likely that training and running Mega LLMs will largely be in the province of “Big Tech” companies, and will generally require a hosted API infrastructure to access the model. (Note: we debated the boundary line for this space, whether 50 billion is too low, but decided to apply a simple rule of thumb: can inference for the model run on a single A100 80GB GPU?)

Enterprise LLMs — 7 — 40 billion parameters— this is where most of the energy is focused in the open source community currently, with many high-quality open source pre-trained foundational models in the 7 billion, 13 billion, and 20 billion range, with even a few in the 30–40 billion range. Models in the lower-end of this range can still be trained on a single GPU, and all of these models (with some relatively straightforward quantization) can run inference on a single GPU. We are labeling models in this size range, so — called “Enterprise LLMs”, in contrast with the Mega-LLMs, as these models can be deployed for a single enterprise inside their private cloud, and are within the practical reach of most enterprises around the world today. We believe that models in this size range will be the “workhorse” models where most work will get done and will be the size range for most “specialized” LLM models.

Mini-LLMs — 1 billion — 5 billion — it may seem like a contradiction to put “mini” in front of “LLM”, but that captures this model size category well. It is a little bit of a “no-man’s land” as these models are not nearly as effective as Mega and Enterprise LLM models, yet still “bigger” and “slower” than what is needed for most specialized transformers. Are decoder-based models in this size range still “LLMs” or is this a dead-zone of innovation? We have recently kicked-off an open source project around what we are calling BLING (Best Little Instruct No-GPU) models to experiment with specialized instruct training on 1–2 billion parameter models. What is interesting about models in this size range is that they can be comfortably trained cost effectively on a single GPU, and in most cases, will run reasonably effectively on a laptop CPU for inference (with relatively straightforward quantization). We see models in this size range as a bit unexplored in terms of their applicability, especially for specialized LLM use cases, with 1B as a minimum dividing line for calling a model a “LLM” (note: we debated whether to call these sub-7B models “LLMs” or whether to put the line at 7B, but decided that “mini-LLM” best described this segment).

Embeddings and Specialized Transformers — <1 billion parameters. Most sentence transformers, embedding models, and classifier models are in this size range of 50 million — 1 billion parameters, even for production-grade use, but generally decoder-based models in this size range do not demonstrate the capacity for the type of “instruct” capability that has generated so much of the excitement around LLMs. So, we would not classify these smaller decoder-based models as LLMs. This is not to cast any aspersions, however, on transformer-based models in this size range, as these are still the most widely used and deployed across a wide range of NLP applications today. Notably, sentence embedding models as small as 50 million parameters can deliver excellent performance, and many sentiment, NER, and text classification models are similarly best-in-class with less than 1 billion parameters. Our only point is that we have struggled to train meaningful instruct capability in decoder-based models below 1 billion parameters.

Certainly, we should expect that these boundary lines will evolve over time, with enhancements in GPU memory capacity, GPU parallelization and quantization. However, the basic segmentation will likely remain, even if the boundary lines continue to evolve upward.

Cost and Complexity

In our opinion, the dividing line is most pronounced between the “Mega LLMs” and the “Enterprise LLMs” in terms of cost and complexity. Put simply, the Mega LLMs are extraordinarily complicated and expensive to train, and once trained, they may be even more complex and expensive to scale industrialized inference for speed and concurrent user support. The complexity is intrinsic in the need for massively parallelized GPUs for both training and inference — and is unlikely to change in the short-to-medium term. Furthermore, the lifecycle of these models in terms of ongoing upgrades, enhancements, and incremental training seems to be another contributor to cost and complexity.

In contrast, at AI Bloks and llmware, we have trained dozens of models in the Enterprise LLM category, and in our experience the cost of training and running inference of a high-quality, highly customized 7B parameter model is easily 1–2 orders of magnitude less expensive than a Mega LLM. For the short-to-medium term, it seems pretty clear that the 7B parameter model (and potentially up to 12–13B parameters models) occupy an important sweet-spot as they can be trained generally on a single GPU — and for inference, can be served from a single GPU, or pools of single GPUs, and each even off-the-shelf Nvidia A series (A5000s and A6000s) have enough memory to support batching of inferences for 7B models. It is not an accident that so much of the open source community energy has been centered around models of this size with such rapid innovation over the last six months.

Instruct Fine-tuning

The other major dimension is this discussion is the role of specialized Instruct training, and we see this as a big equalizer when thinking about how big is big enough.

OpenAI’s paper, cited above, which announced the 175B version of GPT3, crystallized this notion around “few shot learning” and identifying what appeared to be emergent behavior in Mega LLMs with only core causal language model training (and no specific “instruction-following” training). This insight became formalized in 2021–2022 with Instruct models, training datasets and procedures, which are arguably the most important innovation that led to Chat GPT’s explosive popularity, and which have since been emulated and replicated by virtually all of the large model providers and across the open source community.

It seems increasingly clear that the “instruct” skills, tasks and capabilities will be what enable LLMs to cross the chasm into the enterprise for most knowledge-based automation and RAG tasks. And even more importantly, what most enterprises need is not “all things to all people” universal instruct-following, but extremely high-quality specialized instruct following on a relatively narrow set of specific skills, capabilities and domain. To effectively integrate an LLM into an enterprise workflow, the model needs to do a few things really well, not lots of things pretty well. Specialized tasks are at the core of most enterprise use cases.

It seems incontrovertible to say that without any instruct training, a Mega LLM will outperform an Enterprise LLM on virtually every task in “zero shot” or “few shot” mode. (The HuggingFace LLM leadership board is a good quick reference to illustrate this.) However, it becomes a more interesting question to consider: what if the smaller 7B-13B model is carefully fine-tuned for a specific instruct task? With laser-focused “one task” or “few task” instruct training, can the 7B+ model demonstrate behavior comparable to the Mega model? We believe that the answer to this question is: Yes.

For general-purpose, “all things to all people” instruct-following, there seems to be little doubt that Mega models will deliver superior performance. However, for a specific set of tasks, implemented with a thoughtful custom instruct fine-tuning, our experience is that Enterprise scale models in the 7B — 13B range can compete effectively and deliver comparable, and in some cases, superior performance in specific tasks.

If the LLM story of 2020–2022 was ever larger model sizes, we believe that the next chapter will be focused much more on high-quality Instruct datasets and training objectives, as well as all of the “ware” (e.g., software and middleware) required to integrate LLMs at scale and quality into enterprise workflow processes. In short, LLMs in the enterprise range of 7B+ are “big enough” to meet the needs of most of these use cases, and given their dramatically lower cost and complexity, combined with the rate and pace of innovation in the open source community, are likely to become the most ubiquitous in enterprise adoption.

Conclusion — One Fine Day …

The idea of “artificial general intelligence” (AGI) has captivated our imaginations for 50+ years, and one fine day, a research team will get there — and it will change the world. While no one can predict if that day is tomorrow or in 20 years, it is likely still pretty far out on the horizon, at least in the form of any practical deployment. While we tend to think that even the Mega LLMs are pretty far away from AGI, the brilliant researchers and companies behind them will undoubtedly be among the people pushing the envelope and taking us closer to that goal. To ever reach AGI will likely take innovation in all of the elements required for scaling LLMs, so the Mega LLMs will play a critical role in the pursuit of that future. However, our money is strongly on the side of the Enterprise LLMs as the zone where practical deployment of AI will cross the chasm in the enterprise.

Check out our open source models models on Hugging Face: https://huggingface.co/llmware.

--

--