How Good is this LLM?
Testing Efficacy of Large Language Models with a prompt-suite
Large Language Models (LLMs) are everywhere, and new ones seem to emerge daily. They become increasingly integrated in everything from chatbots to coding assistants. These models are incredibly versatile, capable of generating human-like text, solving logic puzzles, translating languages, and even spitting out code.
While these models often come with impressive benchmark scores, as a real-world user, you might find it hard to relate to the numbers they claim. That’s why it’s better to have your own set of “smoke tests” to evaluate their effectiveness. Here’s my approach, offered as a reference. Most of these are not the general tests that you find elsewhere (apart from the initial one or two below), but are cryptic and specific to my workflow. Feel free to have your own, based on what you need.
General Knowledge & Word Facts
LLMs are trained on vast datasets that cover everything from basic trivia to advanced subjects. Testing general knowledge is an essential way to determine if the model has learned these facts well and can retrieve accurate information when needed.
- "What is the capital of Maharashtra?"
- "Who wrote the 'Mahabharata'?"