The Smaller The Better?

Published in

Eleos Health

9 min readApr 30, 2023

The surprising potential advantages of smaller models in niche-specific NLP tasks

If you’ve lived long enough, you’ve probably learned already that too often having more of something or owning bigger things — is not necessarily better. This applies to the food you put in your mouth, your workload, and even more generally — the literal stuff you own, as our home organizer and savior Marie Kondo taught us religiously — only keep things that “spark joy”. Similarly, history and mythology echo this sentiment, with numerous examples where smaller or seemingly weaker contenders outperform their larger and more powerful opponents under specific circumstances, such as the famous biblical story of David and Goliath.

Surprisingly enough, this principle is also important and relevant to the realm of Machine Learning/Deep Learning, and more specifically, to the field of natural language processing (NLP) tasks.

Although large text-generative models (and even more generally — all the “generative AI” models) have recently captured the eyes and ears of the global community for their impressive performance on a wide range of general tasks, these models may not always be the optimal choice for highly specific (or “niche”) tasks. In fact, much (much) smaller NLP models that are specifically designed for these tasks can be equivalent or even outperform larger models under certain conditions, particularly when they are paired with high-quality and dedicated datasets.

In this post, I will explore with you some of the reasons behind this phenomenon, and shed some light on other important aspects of why smaller models can be a big deal when it comes to getting the job done.

Faster, cheaper, greener

Some advantages of smaller language models are quite obvious and “technically dry”, but still highly important: the speed of their training, the cost of training (and inference) and their lower energy consumption (and thus lower carbon imprint).

But how much of a difference can it be?

Well, before we deep dive into this — we need to talk about the elephant in the room, or more specifically — the term that is mentioned in each and every blog post that excitingly announces the release of a new and promising generative language model, and one that many people don’t really understand fully: parameters.

Parameters in a language model are the learnable weights and biases that the model uses to make predictions, and the ones that get adjusted during training to improve the model’s predictions. These parameters are crucial since they determine the model’s capacity to learn complex linguistic patterns and relationships in the data. Having said that, more parameters also mean significant increased computational requirements, memory usage, and energy consumption during training and inference.

Let’s compare some numbers so we can understand what we’re really talking about, with two distinct but rather famous language models: BERT, a famous 2018 Google’s model, which is considered to this day one of the most powerful smaller language models, and GPT-3, one of OpenAI’s most famous language models that showed remarkable abilities in many linguistic tasks, certainly for the time it was released.

Well, GPT-3 was trained on no less than 175 billion parameters, while BERT (base) was trained on 110 million parameters. To understand this difference even better — let’s say we would use a single strong training machine such as the NVIDIA V100 GPU to pre-train these models: while it would approximately take BERT-base ~9 days for the whole training process to complete, it would take for GPT-3 on the same machine… hold tight — roughly 355 GPU-years.

I will let you go wild with guessing the number of actual machines needed to use in order to train the latter in a reasonable time (and the energy such machines will consume), but for simplicity of the cost comparisons’ sake — let’s stay in our imaginary 355 years vs. 9 days world: if we take Azure’s “Windows Virtual Machines Pricing” page as an example — 9 days of using such a machine for BERT training will cost you ~600–700$, while for a GPT-3, or a model of roughly same size, it would cost you, well… 140,651$! (but with a saving plan, guys! Let’s go for it, you only live 355 GPU-years once).

A cute illustration of me waiting patiently for my breakthrough 974-trizillion parameters model, GignaticliciousGPT, to train

Home safe home: the security perks of local storage for small NLP models

One often overlooked advantage of smaller NLP models is the increased security and privacy they offer due to their ability to be stored locally.

In contrast, larger models may regularly require storage on remote servers, or even more problematically — on remote servers which are 100% owned by another company (as is the case with OpenAI’s large models), due to their size. Smaller models, on the other hand, can oftentimes be stored directly on your own device, or on a more secure and more privately-owned remote server. This local storage approach has a few key benefits: it reduces the risk of unauthorized access to your sensitive data, as the data never leaves your device or your privately-owned server; and in the case of fully local storage — it also reduces the reliance on potentially unstable internet connections, ensuring that the performance of the model remains consistent even in the face of connectivity issues.

“Jack of all trades, master of none” — the downside of generalization

Higher generalization capabilities in the context of NLP refer to language models that can perform a wide range of language tasks with rather high accuracy. These models can assist us with translation, sentiment analysis, and sometimes, as we’ve seen lately, with really impressive generative capabilities. Usually, these models are pre-trained on vast amounts of text data, which allows them to capture a broad range of linguistic patterns and relationships and apply their acquired “general knowledge” on any required task you throw on them.

These generalization capabilities, though, do come with major costs. One of them is the risk of becoming a “jack of all trades, but master of none”: since these models try to learn everything, they can sometimes struggle to excel in very niche tasks, as there is simply too much information for the model to process effectively.

Even more interestingly, if such ridiculously-large models attempt to fine-tune on a very specific niche task, their crazily high number of parameters could quite possibly result in overfitting — the scenario in which a model becomes too specialized in the training data and fails to generalize well to new, unseen data, impairing both their “general knowledge” power — but also more so in these specific niche tasks that they just learned.

To bring all these complex concepts to life, let’s explore a peculiar, yet informative example.

Imagine, for some reason, you need a model to excel in the highly specific (and rather weird, if I’m honest) task of understanding and explaining queer lingo in Hebrew. This task is extremely niche, considering:

Hebrew is not a widely spoken language, with roughly only ~10 million speakers worldwide (a bit more than 0.1% of the world population).
LGBTQ-identifying individuals represent a small minority, around ~5%-10% of the population, at least in the United States (but probably similarly in Israel, where people speak Hebrew).
Not all LGBTQ individuals are familiar with or proficient in queer lingo.

Combine all these statistical factors together, and you get an incredibly niche task for a language model, but that’s EXACTLY what I asked ChatGPT to do, because… well, why not?

When I asked it about the word “ohch” (אוחצ׳), which can be simply defined as either a an effeminate gay man or a colloquial term similar to “sis” in queer English (as in “sis, what’s the tea?”), it told me that… well, ohch is actually military slang(!) derived from the Hebrew word “ucaf” (אוכף, meaning saddle, yes — the one that it fastened to horses). It then continued to “explain” that “ohch” expresses the feeling of being tired or humiliated, and is used to describe soldiers who are weary or lacking energy.

To put it bluntly, ChatGPT failed. Miserably.

This is not surprising, as explained before — the power of ChatGPT lies in its “general” power rather in excelling in very specific (and sometimes bizarre) tasks. However, we can only speculate that even a very small language model, such as T5-small (a 60 million parameter model), fine-tuned on the ultra-dedicated “Even-Shoshana” online detailed dictionary of Hebrew queer lingo, could potentially perform better. Much, much better.

An authentic illustration of the origins of an Israeli ohch (אוחץ׳), according to ChatGPT

Quick to adjust: the versatility of smaller models

Smaller models have another advantage that is often quite underestimated: high adaptability.

Because of their simpler architecture and lower parameter count, these models can be more easily adapted to new or changing tasks within a specific niche. Their light weight, simplicity and higher “fine-tuning readiness” so-to-speak, allow for quicker and more convenient fine-tuning and optimization, making them a more versatile choice for businesses or researchers that need to pivot quickly in response to new challenges or opportunities.

Smaller but just as efficient: the best of both worlds

A recently published 2023 paper [1] demonstrated that several language models, when fine-tuned with task-specific data for applications such as summarization and sentiment analysis, often produced better or highly similar results compared to non-fine-tuned (“zero-shot”) ChatGPT and GPT-3.5. In another paper from 2020 [2], it was shown that ALBERT-base (an “only” 12 million parameters model), a much lighter version of BERT-base (a 110 million parameters model, i.e. — roughly x10 in size), achieved almost identical performance as BERT-base on a question-answering task, obtaining an 80.0 F1 score compared to BERT-base’s 80.4 score.

In essence, we can draw two key conclusions from these findings:

Even highly impressive language models like ChatGPT, which outperform many other models on specific tasks, can be surpassed when these other (and sometimes smaller) models are specifically fine-tuned for the same tasks, using dedicated datasets.
Smaller models can achieve similar or possibly better results compared to their larger counterparts, even when both are fine-tuned on the same data or task, if the architecture fits the task at hand and the preparation of data is done correctly.

Taken together, these conclusions emphasize the potential of smaller models fine-tuned on specific tasks to maintain or even surpass the performance of their larger (sometimes much larger) counterparts, all while preserving the advantages that we discussed throughout this post.

In conclusion …

While we can’t deny the “cool factor” of the latest colossal language models making headlines, like OpenAI’s GPT-4 and Google’s BARD, it’s crucial not to undervalue the potential of (truly excellent) smaller models in our everyday, slightly less glamorous lives. These fun-sized powerhouses offer major benefits, such as increased security and data privacy, adaptability, and in some cases, equal or even superior performance on niche tasks.

By using the power of smaller models and fine-tuning them for the specific tasks you may need these linguistic virtuosos to handle, you can locally & quickly create efficient and effective NLP solutions tailored to your personal and professional needs — without desperately relying on either remote solutions and/or gazillion parameters models.

[1] Qin C., Zhang A., Zhang Z., Chen J., Yasunaga M., & Yang D., Is ChatGPT a General-purpose Natural Language Processing Task Solver?, arXiv preprint arXiv:2302.06476, 2023. https://arxiv.org/abs/2302.06476

[2] Zhenzhong L., Mingda C., Goodman S., Gimpel K., Sharma P., & Soricut R., ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, ICLR, 2020. https://arxiv.org/abs/1909.11942

The Smaller The Better?

Written by Amit Spinrad