Language Models: Task-Specific Models vs. LLMs APIs

Comparison of Solutions for Language Models Applications

Ian Fukushima
Blue Orange Digital
8 min readJan 23, 2024

--

Large Language Models (LLMs) have spread the possibilities of AI to everyone. Natural language processing (NLP) tasks that were only achievable by AI engineers with computing resources are now a prompt away for virtually anyone. However, some applications might be better off using smaller, more specialized models. This article discusses three applications of language models, comparing what LLMs have brought to the table versus what fine-tuned task-specific models bring.

First, let’s define what is meant (in this article) by LLM and by task-specific models.

LLM Definition: An NLP model that can be prompt engineered to perform different NLP tasks, notably classification, Q&A, style transfer, and summarization. Furthermore, it is also a capable conversational model, able to code, perform machine translation, etc. These models have at least a billion parameters, typically more.

Task Specific Model: A model designed for a specific NLP task, such as machine translation, summarization, or style transfer, with a narrow focus and not extending beyond their designated task. They are optimized to perform their specific task, and do not exhibit the “emergent” capabilities observed in LLMs. Generally speaking, task-specific models are smaller in size and complexity, requiring less compute and data to be trained, and have been seemingly integrated into various applications for a longer time.

The release of GPT3 API marks a transition point for the applications of NLP in the industry. It allowed virtually anyone to have the capabilities of task specific models (and more) via a simple API call plus some prompt engineering. The interest in improving processes or creating products using these techniques skyrocketed, new releases of LLMs started coming, including great Open Source LLMs as well.

Now that we have these definitions, let’s jump to the applications. We will cover (1) ChatBot applications (with RAG), (2) Real-time text classification (Moderation bot), and (3) Style transfer (document standardization). For all applications we will discuss LLMs and task specific models as the definitions above, but note that fine-tuning LLMs is also a possibility that is becoming more affordable.

ChatBot Applications

The key aspects to consider for chatbot applications is how broad a conversation can be, and how much external knowledge it will need to provide answers. LLMs are exciting due to their ability to maintain open-ended conversations, code, and reason about many things. However, the factual knowledge they provide is oftentimes inaccurate (hallucinations), and it can struggle to answer domain specific questions.

On the other hand, smaller models can be fine-tuned with domain specific vocabulary and knowledge and be used in conversations with a narrow scope. These models can provide more precise and relevant responses within their scope. Of course, this comes with a higher initial investment, but can be more economical in the long run if the application has enough users, as exemplified in the graph below.

Cost evolution as number of requests increases: LLM vs Fine-Tuning.

However, LLMs can be “buffed” to alleviate its weakness. One approach to reduce the hallucinations, and also to extend LLMs factual knowledge is Retrieval-Augmented Generation (RAG). RAG works by providing external information to the LLM via the prompt, enhancing its ability to answer questions accurately with up-to-date information. Another approach to improve its capacity is prompt engineering, which consists of several techniques, for example the notable MedPrompt for Q&A.

Regarding costs, while LLMs generally incur a linear cost increase with usage, reflecting a pay-as-you-go model, fine-tuned models require an upfront investment. However, over time and with increased usage, these fine-tuned models can prove to be more cost-effective, especially for applications with high interaction volumes.

The table below summarizes the comparison.

Summary of the comparison between LLMs and Fine-Tuned smaller models for chatbot applications.

Text Classification

Text classification has a wide range of applications that dates back decades. A classical example is spam detection: it has relied on classification models trained or fine-tuned on large datasets of spam and non-spam samples. The rise of LLMs, however, marked a significant shift in the landscape. The paper Language Models are Few-Shot Learners demonstrates that LLMs are capable of performing classification with very few examples, a technique referred to as few-shot learning, which lowers the barrier to entry for text classification tasks.

This section discusses a particular form of text classification: content moderation. Drawing inspiration from a tweet by @levelsio, it is remarkable to see the feasibility of creating a very customizable moderation bot. For $5/month he has an excellent solution that is able to cover 15,000 messages monthly. That is $0.00035 per message. I assume that he uses the API in batches of messages, because if we consider an average of 350 input tokens + 20 output tokens, for 15,000 API calls we would have (as of 2024–01–16) the following costs:

Cost of 15,000 API calls considering 350 input tokens + 20 output tokens in 2024–01–16.

For comparison, a very simple cloud hosting machine costs around $50 a month. One could host a fine-tuned Bert family model for inference in such a machine. But the fine-tuning itself would need good examples of every behavior we want to ban.

Now let’s assume that the 15,000 messages are evenly distributed in time. In a 30 day month we have 43200 minutes, which gives us 0.3472 messages per minute. Hosting a machine 24/7 for this sounds wasteful, but would be well justifiable if the number of requests increases — costs would be fixed until a certain load, in contrast with the LLM solution that has costs scaling linearly with number of requests, as exemplified in the previous section graph.

Another key aspect to consider is adaptability. If we want to add or remove moderation policies, the LLM solution is easily adaptable via the prompt, while the fine-tune solution would require further data labeling and model training. Furthermore, people might update their behavior, be it to bypass the moderation or simply because things change. Again, this means that from time to time a task-specific model needs to be updated, while for an LLM simply updating the prompt can solve the problem.

The diagram below provides a decision flow to consider when deciding between the options.

Decision Flow: LLM or task-specific fine-tuned model?

Automated Style transfer and Document Standardization

Style transfer is an exciting frontier in document processing. It consists of adapting the style and format of a text to match a predefined one. Models available on platforms like Hugging Face 🤗, such as those converting informal to formal text, lay the groundwork for these applications. LLMs, however, enable rapid application of various standards to text. This capability is transformative in fields such as legal document reformulation, engineering documents standardization, corporate branding consistency, technical manual localization, etc.

In style transfer, the use of specific terminology and the need for domain adaptation are crucial. LLMs can address these needs via few-shot learning, making it an invaluable tool for quickly handling style transfer and document standardization tasks. For task-specific models, a robust dataset of “before and after” examples is vital. With such datasets, it is possible to fine-tune these models to achieve the desired style effectively. Moreover, domain adaptation techniques can further improve the usage of specific terminology and enhance the models’s grasp of the domain.

Regardless of the solution, style transfer poses two specific technical challenges: maintaining semantic meaning and evaluating the style transfer.

Maintaining semantic meaning: Since these applications often involve important documents, maintaining 100% of the intent of the original is crucial. Some specialized models can be used to evaluate the content preservation (e.g. Bi-Enconders and Cross-Encoders). Human review is always necessary in the initial iterations.

Evaluating Style Transfer: While test datasets for task-specific models exist abundantly, we do not have the same for LLMs. Regardless, when dealing with novelty documents, evaluating whether the generated text is following the predefined style is essential for confidence in the application. Specialized NLP models can be applied, alongside predetermined tests. Human review is also necessary in initial iterations before handing bulk style transfer jobs over to a model.

In general, while LLMs have brought great capabilities to this field, using them with precision in volume can be challenging and usually requires human feedback in initial iterations. Because of this, industrial-grade automation applications usually consist of more than one mode to achieve a reliable pipeline with quality metrics.

Closing Thoughts

As we conclude, several key takeaways emerge from the comparison of Large Language Models (LLMs) and task-specific models:

  • Flexibility vs. Precision: LLMs offer broad flexibility, ideal for a range of tasks, while task-specific models excel in specialized applications.
  • Cost and Scalability: The choice between LLMs and task-specific models often depends on cost-effectiveness and scalability, with LLMs scaling with usage and task-specific models requiring upfront investment, both can offer long-term savings.
  • Adaptability: LLMs adapt quickly to new requirements, whereas task-specific models demand retraining for significant change.
  • Ethical Considerations: Both models come with ethical implications, including data privacy and bias, necessitating careful usage.
  • Evolving AI Landscape: The field of AI in language processing is rapidly advancing, promising more sophisticated close and open source models, new training techniques, new frameworks and infrastructure solutions. Fine-tuning LLMs is also a possibility that has become more affordable, and can take advantage of both worlds.

In summary, the choice between LLMs and task-specific models is nuanced, depending on specific needs and long-term goals. As AI technology evolves, so too will the strategies for its application, and this will continually shape the future of language processing.

--

--

Ian Fukushima
Blue Orange Digital

Machine Learning Engineer and Data Scientist. Msc. in Applied Economics.