How to decide which LLM to use for an application

Edgar
about ai
Published in
4 min readJul 3, 2024

Learn about the differences between types of LLMs to navigate the tradeoffs and choose the best LLM for your application.

Training an instruction tuned (instruct) and chat-based LLM. The figure was adapted from LLMs: a premier and gastroenterology applications, 2024.

Once you identify a problem that is suitable to tackle using a Large Language Model (LLM) and build a proof-of-concept (PoC), then you need to decide what LLM is a good choice for your problem. Usually, in the PoC stage you want to demonstrate that the problem you identify can be solved with an LLM and for that it is better to start with a large foundational model. However, in production you might not want to foundational model due to their longer latency or their cost. In this post I summarize the factors you might want to consider when deciding which LLM to choose from based on their size of type.

LLM parameter sizes

The size of an LLM is determined by the number of parameters that were used in the transformer network. The parameters are the values that are “learned” or adjusted during learning. In theory, the larger the number of parameters, the more complex the relationships it can “understand” in the language. The number of parameters in an LLM can go from hundreds of thousands to trillions. For example, GPT4 has 1,700 billion parameters, Llama 2 has 70 billion parameters. However, there are “families” of models with different parameter sizes for different versions. For example, Llama 2 7B and Llama 2 70B denote Llama 2 with 7 and 70 billion parameters.

In summary, you might often find that a foundational large model will show better results than the smaller model version. For example, using Llama3 70B would be better than Llama 2 7B. However, the hardware you would need to load the 70B parameter on a GPU might be significantly more expensive than if you use a 7B LLM. The question then is, can you solve your task with the smaller version, and if so, how to do it? You can read more about prompt engineering here and fine-tuning here.

Instruct vs Chat LLM types

Another way to distinguish LLMs is by their purpose. Instruct and Chat LLMs (Language Models) are designed for different purposes, reflecting their distinct functionalities and interaction styles. Depending on these differences you might want to choose one or another according to the task you are trying to solve.

Instruct LLMs:

  1. Purpose: Primarily used for executing specific instructions and tasks, often in a single-turn interaction.
  2. Interaction Style: Direct and to-the-point responses based on given commands or queries.
  3. Usage Examples: Data analysis, summarization, content generation, code completion, and other specific task-oriented applications.
  4. Training: Focused on following explicit instructions provided by users.

Chat LLMs:

  1. Purpose: Designed for multi-turn conversational interactions, aiming to simulate human-like dialogues.
  2. Interaction Style: Engages in ongoing conversations, handling context, maintaining coherence over multiple exchanges, and adapting to the conversational flow.
  3. Usage Examples: Customer support, virtual assistants, interactive storytelling, and other applications requiring natural, engaging dialogues.
  4. Training: Emphasizes understanding and generating conversational context, maintaining personality and tone consistency throughout interactions.

In summary, Instruct LLMs are tailored for executing specific tasks efficiently, while Chat LLMs are optimized for engaging, coherent conversations over multiple exchanges. This will be also important if you are trying to fine tune an LLM. Arguably it will be more work to train a model to solve a task that was not for its original purpose.

Custom vs off the shelf

Finally, you might need to decide whether you will use an off-the-shelf (pre-trained) model or a custom (fine-tuned) model. The answer to this question might depend on model performance, cost or latency.

Usually, as mentioned above, the initial solution is proposed using a pre-trained foundational model because it is easier to prototype a solution than fine tune and optimize a model for your particular task. To help you decide, you need to first evaluate how good is your model is at solving the task. This involves defining a metric and a dataset that you can use to evaluate the model.

Another thing to calculate is how expensive is to run your model, this depends on how you host the model, could be a cloud based or a self hosted. For cloud based models, a cloud provider like amazon, azure or gcp or hugging face (and others) can provide a serverless solution in which an API is served and authenticated. This is the simplest solution for which you decide the hardware and the provider is in charge of the administration of the service. Another solution might be to host the server and you deploy the model into it. Finally, you might decide to serve the model locally, even offline using services like Ollama or llama.cpp

Finally, you might decide what type of model to use based on latency, you might find that getting the model output over the internet is too slow for your task or producing an output for a large input (context) might be too slow. Then you need to think about using a smaller model or a locally hosted model.

Summary

Once you solve a task with an LLM, usually for a PoC, you use a large foundational model. However, when you want to put such a solution in production, you need to consider other limitations such as latency or cost of the solution. Then you need to decide whether you need to use a smaller model. Here I summarize some of the factors you might want to consider when deciding what LLM to use instead of the large foundational model.

References

[1] Mastering LLMs: A conference for developers and data scientists. 2024. https://maven.com/parlance-labs/fine-tuning/1/home

[2] https://en.m.wikipedia.org/wiki/Large_language_model

--

--

Edgar
about ai

PhD in Computer Science and AI. I write about AI in healthcare and Computer Science in general. The opinions expressed in my stories are my own :)