Comprehensive guide for model selection and evaluation

12 min readDec 22, 2023

Selecting the most appropriate foundation model for your needs requires navigating a matrix of capabilities, customizations, constraints and costs. With this space expanding rapidly, with both open source and proprietary providers offering an array of pre-trained models employing specialized techniques from distillation to reinforcement learning across modalities like text, code, voice and vision.

This blog offers guidance through model selection and evaluation to guide your journey.

Model Selection process

Selecting the optimal language model requires careful evaluation of multiple criteria and the trade-offs amongst them. As you consider options, take into account details of your specific use case, domain data, customization requirements, context size etc.

A thorough model selection checklist could incorporate:

Functionality

Use case : Start by identifying the precise types of tasks the model will need to execute, spanning text generation capabilities like text classification, summarization, Q&A, Named entity extraction (NER)NER, chat functionality, enterprise search, other generation tasks creation. Also consider needs for code generation (e.g. SQL creation), voice to text generation for meeting transcripts, closed captioning generation for videos and more. Models have varying strengths across these domains.
Context Window Size: Context window is the maximum number of tokens that can be input and completion. Based on this the amount of context that can be set to a model invocation is determined. While the tokenization could vary by models, for many models (English language) a good approximation number of tokens to words ratio is 4/3 (or a token is 0.75 word). Models with larger context windows can understand and generate longer sequences of text, which can be useful for tasks involving longer conversations or documents. Anthropic Claude 2.1 on Amazon Bedrock supports 200K content window.
Required Modality support: Single, multiple. Modality refers to different types of input data and outputs handled by a model. Single modality models specialize on use cases like summarization, Q&A etc. Multi-modality models accept multiple types of data (text, image, audio) as input and they can link semantic meaning across these data types.
Language support: While English seems to be language for majority of the models, there are models trained on other languages.

Training and Customization

Evaluate data: What type of datasets your use case requires (General purpose or Domain specific)? Where published by the model provider, evaluate what type of training data the model was trained on
Fine tuning support & customization: Customizing a model to fit your domain/ use case is a critical step to improve the performance of the model. Fine-tuning of a pre-trained model makes the model adapt to your domain functions. However not all models are fine-tunable or even when it is an option, you many not be able to pick a fine-tuning strategy. Both Amazon Bedrock and Amazon SageMaker support fine-tuning of a LLM.
Training Data: Wherever possible, review details from model creators on the datasets leveraged to train the model, including proportions of domain-specific vs more general data sources. Review the data used to train the model- internet data, coding scripts, specialized instruction sets or multimodal mixes of text, imagery and speech. Understanding these origin datasets provides insight on real-world performance for your needs.
Type of model: General purpose model (Pre trained model), instruction tuned for your domain specific tasks & RL tuned models

Deployment, Performance & Scalability

Hosting type: Self hosted or model as a service. This depends on the choice of the model provided as a service, your use case and access patterns. The decision can hinge on whether your organization has in-house AI and machine learning talent and whether this aligns with strategic priorities. Many find a hosted model service more efficient when it comes to time to market. Amazon Bedrock provides model as a service option and with Amazon SageMaker you can build, train and deploy ML models at scale. Both offerings provide access to offers a wider range of models (First party, Third-party and open source models).
Performance : Quality of the response and technical aspects (first token latency, end to end latency, streaming support, throughput). For more details on model evaluation refer to section below.

Cost

Cost: Infrastructure, software requirement to host the model. Compare the cost per unit of throughput. Estimate the average size of a response in terms of tokens and number of requests in a specific time-period for model as service utilization patterns.
Pricing mode: Hosted models are typically priced based on input tokens and completions. Evaluate your use case, volume and arrive at pricing.

Governance

License type: Depending on whether you have access to model weights, training data and license type the model category can be considered as Open source, Open model or Proprietary. With Open source models you get know the model weights and the dataset the model was trained on. Proprietary models do not provide this information and the model weights are considered as their intellectual property. In addition they often provide hosted solutions and are typically priced for number of tokens. With some models you do get access to their weights but come with a restrictive licensing terms and they are deemed as “Open models” and it is rather a controversial topic.
Licensing conditions: It’s important to note that some models are open-source but can’t be used for commercial purposes, due to licensing restrictions or conditions. e.g. LLama requires license from Meta if your application has more than 700 million monthly active users.
Data Privacy: Determine what kinds of personal or sensitive data the model was trained on or exposed to. It is a good practice to establish guardrails to remove any PII references in the model output before shared with the end-users. In addition, evaluate policies around data sharing with model providers. Amazon Bedrock has the concept of escrow account which clearly segregates the access boundaries and ensure that the intellectual properties of model providers and model consumers are protected.
Ethical & Responsible AI considerations: Evaluate model cards where available to review what considerations were made when the model was trained. Assess how risks were evaluated and what actions were taken to address issues, and if biases evaluated (gender, race, age, ethnic groups etc.) and their mitigation strategies.

What is missing from the above list is “Parameters”. While this was a factor and Large models (With more parameters) were considered better, there have been studies/techniques to show that some of the smaller models trained in specific tasks outperform larger models.

Depending on your use case and organizational needs, you might have additional criteria for model selection. The above gives a solid framework to start your selection process.

Model Evaluation process

Building, maintaining a comprehensive model evaluation and performance requires effort. You can start with leaderboards to review the models’ published metrics by tasks. For a more comprehensible evaluation process, you need to define three items.

Scope of evaluation: Does it include evaluation of a full system including prompt templates, contextual data, model parameters in addition to model or just the model?
Evaluation criteria: This defines the datasets you would use for evaluation, tasks that you would evaluate and metrics you would collect as part of your evaluation. It is important evaluate metrics across model aspects — quality(e.g. accuracy), technical metrics (e.g. latency), and social aspects (e.g. bias, robustness, and toxicity)
Method of evaluation: Depending on the dataset size, scope of your use case you might want to automate your evaluation process or add human to perform the evaluation. If you expect to repeat the evaluation multiple times during the project, it is a good practice to automate.

Evaluation components

When it comes to model evaluation, there are different evaluation components.

Datasets- This is a key criteria for your evaluation. You would either leverage an open source dataset available, augment with your domain specific dataset or curate a dataset for the evaluation.

General purpose — GLUE, SuperGLUE, ExpertQA
Different categories — World knowledge, Commonsense reasoning, Language understanding, Problem solving, Reading comprehension, Code development
Domain specific — Legal contracts (e.g. CUAD), Medical Q&A etc.

Evaluation Tasks — The task(s) that you would evaluate and collect metrics. This could vary from Text generation, classification, code generation, Q&A etc.

Metrics — Metrics are the quantitative measurements of the LLM outputs. Depending on the type of the task you would use a combination of metrics. Below provides an overview of the metrics and refer to section “Selection of metrics” for a high level guidance on the type of metric by task.

Classical ML metrics- Precision, accuracy, F1, AUC etc.
Similarity metrics- ROUGE, BLEU, METEOR, Word Error Rate, Exact Match. These metrics compare the generated text with a reference text and typically compare how similar they are. They are measured by number of overlapping words and their order and not considering semantic meaning
Embedding based metrics- Instead of raw reference text comparison, the vector representations (Embeddings) are compared.
Metrics for LLM based evaluation- LLMs are good candidates for evaluating open ended responses. There are some attempts made to measure the responses using the probabilities of the tokens generated. GPTScore and G-Eval are examples of this.

Benchmark Frameworks- They evaluate LLMs based on the task types (e.g. Summarization, Q&A) etc. There are multiple benchmark frameworks available in the open source world to leverage and extend for your use case. They helps to standardize the comparison across the models. It should be noted that these are not comprehensive or domain specific.

General purpose benchmarks — HELM, AlpacaEval, Google BIG Bench, Chatbot Arena. These can be used for Common reasoning, Generative content, Q&A
Domain specific benchmarks — e.g. TRUSTGPT(toxicity, bias), EmotionBench, MultiMedQA etc. Some examples are generation of Legal contracts, and Medical Q&A

Please refer to section “References to Benchmark frameworks and Leaderboards” below for more benchmarks.

Leaderboards- Provide living benchmarks across different tasks, base models and fine tuned models. They typically use benchmark frameworks for their evaluation

Evaluation Target- The target of evaluation could be the language model or system. For system evaluation we perform valuation of items part of the system such as model params, prompt template, context and prompt question besides the model.

Evaluation by — How is the evaluation performed. Human, Automated, LLM or a Hybrid approach with a combination of these.

Below provides an overview of how they are interconnected with each other

Guidance for metrics selection

Below provides a high level guidance on selecting metrics for evaluation.

For general purpose tasks(no domain specific evaluation) use open source benchmarks. There could be differences among different implementations of a benchmark by different frameworks. For example, Hugging face found that the implementation of MMLU is different among HELM, EleutherAI and original implementation.
For classification (e.g. document categorization), sentiment analysis and extractive answers use classical metrics like precision, accuracy, f1, recall etc.
For extractive tasks (e.g. Extractive summarization, short answers) and you have ground truth data use BLEU, ROUGE scores. However it should be noted that these metrics are not based on semantic meaning checks rather N-gram comparisons. These are not good metrics if you are not looking for “exact match” with the Ground Truth. If you want to match meanings rather than exact words (e. abstractive summary vs. extractive summary), use semantic comparison metrics like BERTScore. It compares the semantic similarity between tokens of Ground Truth and LLM response.
For open-ended tasks (e.g. Generative tasks) and tasks without Ground Truth use another LLM for evaluation. Studies show that LLMs have positional bias, verbosity bias or bias towards a specific integer when used for scoring. The evaluation method is not idempotent. By their nature, they predict next token and it is possible that we end up with different evaluation scores for the same input. If you have ground truth data, you can leverage semantic comparison with BERTScore or MoverScore. if your task is to improve the response over previous versions, leverage LLM for evaluation
Use Pearson or Spearman correlation if your LLM is performing scoring or ranking tasks and you have ground truth or reference scoring/ranking.
Tasks requiring high quality, and need to validate automated responses use human evaluation. This is an expensive and time consuming process but improves quality.
For tasks that result in generation of structured output (e.g. xml, json, csv), or code generation, best way to evaluate is to process the model output.
For evaluations that involve model’s the linguistic capabilities (e.g. ability to generate coherent text), use metrics like Perplexity and Entropy.

References to Benchmark frameworks and Leaderboards

Below is a curated list of various benchmark frameworks and leaderboards. This is not a comprehensive list as the list could vary point in time.

Stanford HELM- A very popular framework for evaluating foundation models. Provides assessment of models across various tasks (e.g. Summarization, Q&A), different scenarios (e.g core scenarios, data imputation). It also includes scenarios from other benchmarks like HellaSwag, MMLU etc. Supports both LLMs and Text to image models. You can access leaderboard and Github for source code.
lm-eval by Elethurea AI — Provides a framework to evaluate models. Supports more than 200+ tasks and supports Hugging face transformer models and many proprietary APIs
Open AI Evals — This is an open source framework for comparing responses from multiple LLMs. It can be represented as yaml file, give an input data set, select a class that performs eval (e.g. Matching), metrics (e.g. accuracy) and start the evaluation.
Hugging face evaluate: https://huggingface.co/blog/evaluating-llm-bias
Google Big Bench:A comprehensive set of over 200 benchmarks is available to test the capabilities of a model across various tasks. These tasks include arithmetic operations, transliteration from the International Phonetic Alphabet (IPA) to assess the model’s ability to manipulate and use rare words, and word unscrambling to analyze its proficiency with alphabets. You can find a detailed list of these benchmarks in the GitHub repository. Models like GPT-3 and LaMDA demonstrate significant improvement on these tasks, starting from near-zero performance and achieving emergent abilities at a certain scale.
G-Evals — Uses LLM to evaluate responses. It is done in three steps
- Generate CoT :They provide a task introduction and evaluation criteria to an LLM and ask it to generate a CoT of evaluation steps.
- Evaluate Coherence: To evaluate coherence (e.g. news summarization task), they concatenate the prompt, CoT, input (e.g news article), and output (e.g. summary) and ask the LLM to output a score between 1 to 5.
- Final score: Use the probabilities of the output tokens from the candidate LLM to normalize the score and take their weighted summation as the final result.
AI2 Reasoning Challenge (ARC) is designed as a benchmark dataset for advanced Q&A. It has 7K questions categorized as easy set and challenge sets.
HellaSwag — Another challenge dataset and evaluates model’s ability to perform common sense reasoning. It is prepared from WikiHow articles and Activity Net captions. It offers diverse/ complex scenarios. It assesses how well model can complete a sentence https://arxiv.org/abs/1905.07830
MMLU- Multitask Multi domain Language Understanding — It has dataset from multiple domains -STEM, US history, law, humanities, social science etc. and 57 tasks. It assesses how well LLM can multi task. https://arxiv.org/abs/2009.03300
TruthfulQA- Measuring truth fullness of a model answering to a diverse list of questions. It has questions (800+) from multiple categories (30+) e.g. Can coughing stop heart attack?. https://arxiv.org/abs/2109.07958. The evaluation consists of two tasks: 1) Generation: The model will be asked to answer a question with 1 or 2 sentences. 2) Multiple-choices: The second task involves multiple-choice questions, where the model must choose the correct answer from either 4 options or True/False statements.
AlpacaEval- It is an evaluation framework for instruction tuned models. It is LLM-based automatic evaluation framework (uses GPT4 & Claude). The LLM evaluators measure output from the models to evaluate and provide their measure. It is fast, cheap, replicable, and validated against 20K human annotations.

Conclusion:

This post has aimed to provide a framework spanning selection criteria, benchmark components, evaluation, metrics choices and best practices to chart an effective course.

However, this is just the start of architecting responsible LLM-based AI assistants. In subsequent posts, we will explore LLM patterns covering content generation, application design and training aspects.

References:

A Survey of Evaluation Metrics Used for NLG Systems https://arxiv.org/abs/2008.12009

Evaluating Large Language Models: A Comprehensive Survey https://arxiv.org/pdf/2310.19736.pdf

Thank you for taking the time to read and engage with this article. Your support in the form of following me and sharing the article is highly valued and appreciated. The views expressed in this article are my own and do not necessarily represent the views of my employer. If you have any feedback and topics you want to cover, please reach me at https://www.linkedin.com/in/gopinathk/