Building LLM Applications: Evaluation (Part 8)

48 min readApr 7, 2024

--

Learn Large Language Models ( LLM ) through the lens of a Retrieval Augmented Generation ( RAG ) Application.

Posts in this Series

Table Of Contents

· 1. Overview
· 2. LLM Benchmarking Vs. Evaluation
· 3. LLM Benchmarking
· 3.1. Language Understanding and QA Benchmarks
∘ 3.1.1. TruthfulQA
∘ 3.1.2. MMLU (Massive Multitask Language Understanding)
∘ 3.1.3. DROP
· 3.2. Common-sense and Reasoning Benchmarks
∘ 3.2.1. ARC (AI2 Reasoning Challenge)
∘ 3.2.2. HellaSwag
∘ 3.2.3. BIG-Bench Hard (Beyond the Imitation Game Benchmark)
∘ 3.2.4. WinoGrande
∘ 3.2.5. GSM8k
· 3.3. Coding Benchmarks
∘ 3.3.1. HumanEval
∘ 3.3.2. CodeXGLUE
· 3.4. Conversation and Chatbot Benchmarks
∘ 3.4.1. Chatbot Arena (by LMSys)
∘ 3.4.2. MT Bench
∘ 3.4.3. Language Model Evaluation Harness (by EleutherAI)
∘ 3.4.4. Stanford HELM
∘ 3.4.5. PromptBench (by Microsoft)
· 4. Limitations of LLM Benchmarks
· 5. LLM Evaluation Metrics
· 6. Different Ways to Compute Metric Scores
· 6.1. Statistical Scorers
∘ 6.1.1. Word Error Rate (WER)
∘ 6.1.2. Exact match
∘ 6.1.3. Perplexity
∘…

Vipra Singh

Written by Vipra Singh

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams