LLM Evaluations Hub

4 min readMay 30, 2024

This repository is dedicated to the comprehensive list of evaluation tools for large language models (LLMs). It includes a diverse set of evaluation frameworks and performance metrics to assess aspects such as relevance, accuracy, fluency, coherence, readability, coverage, and diversity of generated content.

An architecture for LLM evaluation metric (source of image).

1. datasets

Datasets provide various general and NLP-specific metrics to evaluate the performance of your models.
Metrics

from datasets import list_metrics
metrics_list = list_metrics()<br>
len(metrics_list)<br>
print(metrics_list)

['accuracy', 'bertscore', 'bleu', 'bleurt', 'cer', 'comet', 'coval', 'cuad', 'f1', 'gleu', 'glue', 'indic_glue', 'matthews_correlation', 'meteor', 'pearsonr', 'precision', 'recall', 'rouge', 'sacrebleu', 'sari', 'seqeval', 'spearmanr', 'squad', 'squad_v2', 'super_glue', 'wer', 'wiki_split', 'xnli']

[homepage]

2. evaluate

Evaluate is a library designed to simplify and standardize the evaluation and comparison of models for text, computer vision, and audio. It provides three types of evaluations: metrics, comparisons, and measurements. Additionally, you can create new evaluation modules and upload them to a dedicated space on the Huggingface Hub.
Metrics: Evaluate provides access to dozens of popular metrics. It covers a range of evaluation metrics for text, computer vision, and audio applications as well as tools to evaluate models or datasets. Comparisons are used to measure the difference between models and measurement metrics are used to evaluate datasets. [webpage]
[homepage]

3. Rouge, Bertscore

These are libraries to measure individual metrics for LLM tasks.
Metrics: Rouge, Bertscore
[Rouge], [Bertscore]

4. Ragas

Ragas is a library that is used to evaluate Retrieval Augmented Generation (RAG) pipelines.
Metrics: Faithfulness, Answer Relevance, Context Precision, Context Relevancy, Context Recall, Context entities recall, Answer semantic similarity, Answer Correctness, and Aspect Critique.
[homepage], [quick start], [example_1]

5. TruLens

TruLens is a library that enables you to objectively assess the quality and effectiveness of your LLM-based applications using feedback functions. It integrates seamlessly into your LLM app development process (such as LangChain or LlamaIndex). You just need to install it and add a couple of lines to your LLM app. You can then monitor any application and evaluate it with the model of your choice using a dashboard.
Metrics: Context Relevance, Groundedness, Answer Relevance, Comprehensiveness, Harmful or toxic language, User sentiment, Language mismatch, Fairness and bias, other custom feedback functions you provide, logging human feedback about LLMs performance.
[homepage], [Documentation], [Quickstart]

6. Deepeval

DeepEval is an open-source framework for LLMs that simplifies the process of building and evaluating LLM applications. It allows you to easily “unit test” LLM outputs, similar to Pytest. With over 14+ LLM-evaluated metrics, synthetic dataset generation, and highly customizable metrics covering all use cases, DeepEval supports real-time evaluations in production environments.
Metrics: G-Eval, Summarization, Faithfulness, Answer Relevancy, Contextual Relevancy, Contextual Precision, Contextual Recall, Ragas, Hallucination, Toxicity, Bias
Benchmarking: BIG-Bench Hard, HellaSwag, MMLU (Massive Multitask Language Understanding), DROP, TruthfulQA, HumanEval, GSM8K
[homepage], [Documentation], [Blogs]

7. LangSmith

LangSmith is designed to inspect, monitor, and evaluate your LLM application. It also enables you to continuously optimize and deploy with confidence.
Metrics: Simple Heuristics, AI-assisted: “LLM-as-judge”, Log humans feedback
[Quickstart]

8. MLflow LLM Evaluate

MLflow provides an API mlflow.evaluate() to help evaluate your LLMs. It consists of 3 main components; A model to evaluate, metrics, and evaluation data.
Metrics: It consists of statistical metrics and model-based metrics. It includes answer similarity, answer correctness, answer relevance, faithfulness, rougeL, toxicity, Custom LLM-based metric
[Documentation]

9. OpenAI Evals

OpenAI Evals offers a framework for evaluating large language models (LLMs) and systems built with them. It includes a registry of pre-existing evaluations to test various aspects of OpenAI models and allows you to create custom evaluations tailored to your specific use cases.
Metrics: There are two main ways we can evaluate/grade completions; writing some validation logic in code or using the model itself to inspect the answer.
[Getting Started]

10. Evidently

It assists in evaluating, testing, and monitoring data and ML-powered systems across a range of tasks. For predictive tasks, it supports classification, regression, ranking, and recommendations. For generative tasks, it handles chatbots, retrieval-augmented generation (RAGs), Q&A, and summarization. Additionally, it offers data monitoring capabilities, including data quality and data drift assessments for text, tabular data, and embeddings.
Rule-based. Detect specific words or patterns in your data. ML-based. Use external models to score data (e.g., for toxicity, topic, tone). LLM-as-a-judge. Prompt LLMs to categorize or score texts. [Metrics]
[Quickstart], [Tutorial], [Documentation]

11. Amazon Mechanical Turk

Amazon Mechanical Turk is a crowdsourcing marketplace that allows individuals and businesses to easily outsource their tasks. It can be used to have humans evaluate model-generated responses based on the HHH (helpful, honest, harmless) alignment criteria.
Metrics: Human-based feedback on model-generated response
[homepage]

Useful Resources

Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices [medium].
Evaluation for Large Language Models and Generative AI — A Deep Dive [youtube].