How to evaluate complex GenAI Apps: a granular approach

Published in

Relari Blog

6 min readFeb 27, 2024

Series of blog posts to share our perspectives on how to evaluate and improve your GenAI application pipelines

(written by Yi Zhang and Pasquale Antonante at Relari.ai)

In our prior articles about RAG / LLM pipeline evaluation, we analyzed retrieval and generation as the two components of the pipeline. It is, however, an abstraction we used to simplify many production RAG / LLM systems. Real world pipelines typically contain many more “modules” and “steps” between a user query and the final system output than the simple 2-step process.

Below is an example GenAI pipeline from one of our customers. As you can see, more than 10 modules were used to process the data from User Input to Final Module output (image is blurred for confidentiality purpose). LLM Classifiers are used to detect user intent and send queries down the right processing paths. Multiple types of retrievers are used to produce the optimal context-retrieval results. LLM rerankers are used to filter and compress the context before feeding to LLM. Agents are used to call specific functions when necessary….

GenAI App pipeline from one of our customers

As you can probably imagine, just evaluating the final responses, or two high-level components, is not enough to tell you what’s working and what’s not. If you want to understand what caused a poor system output, you will need to backtrace multiple steps to understand the intermediate outputs and judge their quality. You might be able to go through the trouble a few times to spot check anecdotally, but it is almost impossible to analyze these large pipelines at scale.

In this article we will show how to get more granular insights and how to tailor your auto-evaluators to measure and test each pipeline component. We will also show a complex RAG example.

Why have pipelines got so complex?

But before we get there, why have AI application pipeline got so complicated? Aren’t LLMs supposed to be know-it-alls and you just need to prompt it correctly to produce the right output?

Turns out in order to use the full potential of LLM’s reasoning and processing capabilities, you need to provide them with access to the right context, tools to interact with the environment, and maybe another LLM to reason over intermediate outputs. You can imagine LLM as a worker with decent level of general intelligence, but still need to be provided the right knowledge, instruction, and access to tools to be able to produce good work.

As a result, you find in many production systems the need to build a robust retriever system, a variety of functions or tools to call, and multiple filtering / processing modules to be able to provide helpful answers / actions.

Challenges to evaluating complex pipelines

To truly understand the capability of your GenAI application, it’s much more than just evaluating the LLM itself, you need to make sure all the surrounding modules, and the way they interface the LLMs are functioning correctly. Just like unit tests are written to test the components of the software, you want to have tests at different levels of granularity to help you gain both detailed and high-level perspectives of what’s working and what’s not.

An ideal evaluation framework needs to have the following:

An easy way to log the intermediate inputs / outputs
Granular metrics that can measure the quality at module level
Ability to leverage a tailored golden dataset to judge each module

In the latest version of continuous-eval v0.3, we designed the framework to have these exact capabilities. Let’s walk through a case study below to see how it works.

Case Study: evaluate a Complex RAG application

In the RAG pipeline below, a query is first sent to a classifier to decide the intent of the question and then passed through three separate retrievers. The Base Retriever runs a vector search in the vector database, the BM25 Retriever uses keyword search through documents, and a HyDE generator creates hypothetical context documents and then retrieve semantically similar chunks. A reranker which uses Cohere LLM reorder and compresses the retrieved chunks based on relevance and finally feeds into the LLM to produce an output.

To build an evaluation Pipeline tailored to this complex RAG application , you can define the Modules as follows using continuous-eval.

from continuous_eval.eval import Module, Pipeline, Dataset, ModuleOutput

dataset = Dataset("data/eval_golden_dataset")

classifier = Module(
    name="query_classifier",
    input=dataset.question,
    output=str,
)

base_retriever = Module(
    name="base_retriever",
    input=dataset.question,
    output=Documents,
)

bm25_retriever = Module(
    name="bm25_retriever",
    input=dataset.question,
    output=Documents,
)

hyde_generator = Module(
    name="HyDE_generator",
    input=dataset.question,
    output=str,
)

hyde_retriever = Module(
    name="HyDE_retriever",
    input=hyde_generator,
    output=Documents,
)

reranker = Module(
    name="cohere_reranker",
    input=(base_retriever, hyde_retriever, bm25_retriever),
    output=Documents,
)

llm = Module(
    name="answer_generator",
    input=reranker,
    output=str,
)

pipeline = Pipeline([classifier, base_retriever, hyde_generator, hyde_retriever, bm25_retriever, reranker, llm], dataset=dataset)

To select the appropriate metrics and tests to a module, you can use the eval and tests fields. Let’s use the answer_generator module as an example and add 3 metrics and 2 tests to the module.

from continuous_eval.metrics.generation.text import (
    FleschKincaidReadability,
    DebertaAnswerScores,
    LLMBasedAnswerCorrectness,
)
from continuous_eval.eval.tests import GreaterOrEqualThan

llm = Module(
    name="answer_generator",
    input=reranker,
    output=str,
    eval=[
        FleschKincaidReadability().use(answer=ModuleOutput()),
        DebertaAnswerScores().use(
            answer=ModuleOutput(), ground_truth_answers=dataset.ground_truths
        ),
        LLMBasedFaithfulness().use(
            answer=ModuleOutput(),
            retrieved_context=ModuleOutput(DocumentsContent, module=reranker),
            question=dataset.question,
        ),
    ],
    tests=[
        GreaterOrEqualThan(
            test_name="Readability", metric_name="flesch_reading_ease", min_value=20.0
        ),
        GreaterOrEqualThan(
            test_name="Deberta Entailment", metric_name="deberta_answer_entailment", min_value=0.8
        ),
    ],
)

Once you have the pipeline set up, you can use an eval_manager to log all the intermediate steps.

eval_manager.start_run()
  while eval_manager.is_running():
    if eval_manager.curr_sample is None:
      break
    q = eval_manager.curr_sample["question"] # get the question or any other field
    # run your pipeline ...
    eval_manager.log("reranker", Document) # log intermediate results
    # ...
    eval_manager.next_sample()

Finally you can run the evaluation and tests,

eval_manager.run_metrics() to run all the metrics defined in the pipeline
eval_manager.run_tests() to run the tests defined in the pipeline

Below is an example run on the pipeline. In this visualized output, you can see that the final answer is generally faithful, relevant and stylistically consistent. However, it is only correct 70% of the time. In this case you can trace back the performance at each module (it looks like on of the Retrievers is suffering at Recall that’s worth investigating).

Pipeline metrics and tests generated using Relari.ai

Check out more examples:

To checkout more complete examples, we have created four examples with complete code. github.com/relari-ai/examples. The applications themselves are built using LlamaIndex and @LangChain and has continuous-eval evaluators built-in.

Full examples of multi-step applications with evaluators built-in (github)

Try it in your application:

Here’s the link to the open-source continuous-eval: github.com/relari-ai/continuous-eval

Coming Next

How to make the most out of LLM production data: Simulated User Feedback (posted on Towards Data Science)
Techniques for curating golden dataset
How to make the most out of your Embedding Model?
What data should I use for Fine-Tuning?