AI Series Part VII: Evaluating your RAG with TruLens (NextJS+Python)

8 min readApr 10, 2024

Hey, there! We’ve already created a RAG chat app using both JavaScript and Python and the results were pretty cool. But currently, we have uploaded documents that are familiar to us and asked questions about it. In this case, we have the knowledge necessary to evaluate whether the LLM responses are good. But in real cases, you might be working with documents from your users that are unknown to you. So how would you know if the answers you’re getting from the LLM are really accurate with the document?

That’s where evaluation tools like Trulens come in place.

What's TruLens?

TruLens is a framework developed by TruEra and written in Python that can connect to your RAG application and generate score-based evaluations for the LLM responses considering some metrics.

TruLens follows a principle defined by TruEra called The RAG Triad. It consists of 3 metrics that are essential to evaluate how good a LLM response is. They are:

Groundedness: It evaluates how well based is the response compared to the context provided (documents retrieved from the vector store)
Answer Relevance: It evaluates how the answer is relevant to the question
Context Relevance: It evaluates how well the context provided is relevant to the question

https://www.trulens.org/trulens_eval/getting_started/core_concepts/rag_triad/

In practical terms, it means that based on the retrieved chunk documents, user prompt, and LLM response, Trulens can calculate the accuracy in a score from 0 to 1 to provide you insights on how well your RAG application is performing.

Those insights are really helpful so you can understand why some responses are not accurate and take action to improve the gaps and reduce issues, without having previous knowledge of the provided data.

TruLens also provides a dashboard where the results are shared, along with other features that can be helpful for you and your team.

You can visit their website and learn more: https://www.trulens.org/

TruLens vs NextJS Challenge

Okay, now that I provided a bit more context of TruLens, let’s talk about another use case.

As I mentioned, TruLens is written in Python, but if you read the very first post of this series, you know that I’m a JS guy with almost zero Python knowledge. And, as you may have already noticed, the most of the posts so far we’re using NextJS to build our RAG application. So, how to integrate TruLens with a NextJS app?

That’s the question I couldn’t find an answer.

So I reached out to TruEra team asking how I could use the responses I already got from the LLM in a JS server and use it in their Python server. They showed me a way in the documentation that’s possible to pass an object a class and that would trigger the evaluation and display the results in the dashboard. Great, I got the first piece of the puzzle solved. Now, how would I be able to send the data across different servers?

As TruLens uses Streamlit under the hood, I couldn’t find a way to create an endpoint and make an HTTP request. The workaround I could think of was storing the result in a temporary file from the JavaScript server, and then reading this file in the Python server. And that’s what I ended up doing:

First, I had everything running via Docker, so I could use a volume to share the file across different containers. Then, I used the JS RAG as always, the only change is that I would get the LLM response along with other data needed and store them as a JSON in the Docker shared volume. In the Python server where TruLens is running, I added a listener inside the Docker volume, so when the file changes, I rerun the TruLens dashboard to update it with the new data. That’s the strategy I could think of, and it worked.

So, let’s take a look at this implementation.

The Solution

First, let’s make a copy of the nextjs-chat-rag project we created previously. If you’re new here, I recommend you to read this post first, or you can download the code from this GitHub repo: https://github.com/soutot/ai-series/tree/main/nextjs-chat-rag

Also, make sure you have Python and Docker installed. We covered a bit of that in the Part V and Part III of this AI series post.

Okay, now that we’ve got the app up and running, let’s start creating the Trulens project structure.

In the root directory, create a new folder and name it trulens. Open it in the terminal and run

To create and initialize your venv:

python -m venv venv && source venv/bin/activate

And to install trulens and other dependencies:

pip install trulens_eval chromadb openai flask flask_cors langchain langchainhub bs4 tiktoken ipytree

Now create a .env file with the following keys

OPENAI_API_KEY="your-api-key"
TRULENS_RESULT_FILE="/public/tmp/trulens-rag-results/data.json"
TRULENS_APP_ID="RAG"
TRULENS_FILE_POLLING_INTERVAL="10"

Don’t worry, I’m going to explain them later.

Now, create a Dockerfile and add the following content

FROM python:3.11.5

EXPOSE 8501
WORKDIR /trulens
COPY requirements.txt .
RUN pip3 install --no-cache-dir --upgrade pip
RUN pip3 install --no-cache-dir -r requirements.txt
ENV FLASK_DEBUG=1
COPY . .
CMD [ "python3", "-m" , "flask", "run", "--host=0.0.0.0", "-p", "8501"]

With all we have at hand, let’s start with the code. Create an app.py file in the root of the trulens folder.

We’ll first start importing some dependencies and initializing the env variables

from dotenv import load_dotenv

load_dotenv()
import time
import os
import logging
file_path = os.getenv('TRULENS_RESULT_FILE')
app_id = os.getenv('TRULENS_APP_ID')
file_polling_interval = int(os.getenv('TRULENS_FILE_POLLING_INTERVAL'))

Now, let’s create our first function. It’s a polling function that’s responsible for monitoring the file changes. It also loads the dashboard during the first server load

def monitor_file_changes(callback):
    last_modified = -1

    while True:
        if os.path.exists(file_path):
            current_modified = os.path.getmtime(file_path)
            if current_modified > last_modified:
                logging.info(f"File has been modified. Reloading...")
                last_modified = current_modified
                callback()
        else:
            logging.info(f"File not found: {file_path}")
        if last_modified == -1: # first run
            logging.info(f"Initializing dashboard...")
            last_modified = 0
            callback()
        time.sleep(file_polling_interval)

Then, we have a function that’s responsible for loading the file content

def load_json_data():
    import json
    import logging

    json_data = {}
    try:
        with open(file_path, 'r') as file:
            json_data = json.load(file)
        logging.info(f"Data: {json_data}")
        return json_data
    except FileNotFoundError:
        logging.error(f"File not found: {file_path}")
    except json.JSONDecodeError:
        logging.error(f"Invalid JSON file: {file_path}")
    
    return None

We then initialize TruLens Virtual App. That’s what we’ll use to be able to log the results in the dashboard

from trulens_eval import Select
retriever_component = Select.RecordCalls.retriever

virtual_app_data = dict(
    llm=dict(
        modelname="GPT-3.5-turbo"
    ),
    template="RAG App Evaluation",
    debug=""
)

Then we initialize the event recorder, passing the data from the file

def load_rec():
    from trulens_eval.tru_virtual import VirtualRecord

    json_data = load_json_data()
    if not json_data:
        return None
    
    context = json_data['context'] if 'context' in json_data else dict()
    rec = VirtualRecord(
        main_input=json_data['prompt'],
        main_output=json_data['response'],
        calls=
            {
                context_call: dict(
                    args=[json_data['prompt']],
                    rets=[context]
                )
            },
        )
    return rec

Next, we initialize the OpenAI object, along with TruLens context

from trulens_eval.feedback.provider import OpenAI
from trulens_eval.feedback.feedback import Feedback

# Initialize provider class
openai = OpenAI()
# The selector for a presumed context retrieval component's call to
# `get_context`. The names are arbitrary but may be useful for readability on
# your end.
context_call = retriever_component.get_context
# Select context to be used in feedback. We select the return values of the
# virtual `get_context` call in the virtual `retriever` component. Names are
# arbitrary except for `rets`.
context = context_call.rets[:]

We then initialize the evaluation methods

from trulens_eval.feedback import Groundedness
import numpy as np
grounded = Groundedness(groundedness_provider=OpenAI())
# Define a groundedness feedback function
f_groundedness = (
    Feedback(grounded.groundedness_measure_with_cot_reasons, name="Groundedness")
    .on(context.collect()) # collect context chunks into a list
    .on_output()
    .aggregate(grounded.grounded_statements_aggregator)
)

# Question/answer relevance between overall question and answer.
f_qa_relevance = Feedback(openai.relevance, name="Answer Relevance").on_input_output()
# Question/statement relevance between question and each context chunk.
f_context_relevance = (
    Feedback(openai.qs_relevance, name="Context Relevance")
    .on_input()
    .on(context)
    .aggregate(np.mean)
)

Then, we create some loader methods

def load_app():
    from trulens_eval import Select
    from trulens_eval.tru_virtual import VirtualApp

    virtual_app = VirtualApp(virtual_app_data) # can start with the prior dictionary
    virtual_app[Select.RecordCalls.llm.maxtokens] = 1024
    virtual_app[retriever_component] = "this is the retriever component"
    return virtual_app

def load_virtual_recorder():
    virtual_app = load_app()
    from trulens_eval.tru_virtual import TruVirtual
    virtual_recorder = TruVirtual(
        app_id=app_id,
        app=virtual_app,
        feedbacks=[f_groundedness, f_qa_relevance, f_context_relevance],
        initial_app_loader=load_app
    )
    rec = load_rec()
    if rec is not None:
        virtual_recorder.add_record(rec)

Then the method that’ll run the dashboard

def run_dashboard():
    load_virtual_recorder()

    from trulens_eval import Tru
    tru = Tru()
    
    if tru._dashboard_proc is None:
        tru.reset_database()
    if tru._dashboard_proc is not None:
        tru.stop_dashboard(force=True)
    
    tru.get_leaderboard(app_ids=[app_id])
    tru.run_dashboard()

And finally, we call the method that will trigger the whole process

monitor_file_changes(run_dashboard)

Cool, we’re finished with the Python code. Now let’s get back to the NextJS app.

First, we need to adjust the environment. Open the docker-compose.yaml file, or create one if you don’t have it already, and add the following:

services:
  rag-trulens:
    container_name: rag-trulens
    build:
      context: ./trulens
      dockerfile: Dockerfile
    ports:
      - 8501:8501
    volumes:
      - ./trulens:/rag-trulens
      - rag_results:/public/tmp/trulens-rag-results
    command: python app.py
    environment:
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      TRULENS_RESULT_FILE: ${TRULENS_RESULT_FILE}
      TRULENS_APP_ID: ${TRULENS_APP_ID}
      TRULENS_FILE_POLLING_INTERVAL: ${TRULENS_FILE_POLLING_INTERVAL}
  nextjs-rag-trulens:
    container_name: nextjs-rag-trulens
    depends_on:
      - rag-trulens
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      TRULENS_RESULT_FILE: ${TRULENS_RESULT_FILE}
    ports:
      - 3000:3000
    entrypoint: sh -c "pnpm install && pnpm build && pnpm dev"
    working_dir: /nextjs-rag-trulens
    volumes:
      - .:/nextjs-rag-trulens
      - rag_results:/public/tmp/trulens-rag-results
volumes:
  rag_results: {}

And, edit the .env as the following:

NEXT_PUBLIC_APP_SERVER_HOST=http://localhost:3000

# OPENAI
OPENAI_API_KEY=your-api-key

# TRULENS
TRULENS_RESULT_FILE=/public/tmp/trulens-rag-results/data.json
TRULENS_APP_ID=RAG
TRULENS_FILE_POLLING_INTERVAL=10

And if you haven’t created the Dockerfile yet, here’s it:

ARG PNPM_VERSION=8.7.1
FROM node:20.6.1

COPY . ./nextjs-rag-trulens
WORKDIR /nextjs-rag-trulens
RUN npm install -g pnpm@${PNPM_VERSION}
ENTRYPOINT pnpm install && pnpm run build && pnpm start

Now, there’s only one thing we need to update in our JS code. Open the api/route.ts file and scroll down to the chain.invoke() method. We’ll replace it with the following

chain.invoke({question: prompt}).then(async (response) => {
    try {
      const sources = response?.sourceDocuments?.map((doc: any) => doc.pageContent)?.filter(Boolean)
      const result = {
        prompt,
        response: response.text || response.response,
        context: sources?.length ? sources : undefined,
      }
  
      await writeFile(`${process.env.TRULENS_RESULT_FILE}`, JSON.stringify(result), () => {})
    } catch (e) {
      console.log('ERROR: ', e)
    }

  await handleChainEnd(null, id)
})

It’ll get the LLM response and generate the file with the structuring needed to be read by TruLens.

If everything’s working as expected, you now will be able to run docker-compose up in the NextJS root and you'll see the RAG app running at http://localhost:3000 and the TruLens app running at http://localhost:8501.

Upload a file, send a question, and you’ll start to see the evaluations in the TruLens dashboard.

Conclusion

In this post, we explored the importance of evaluating the accuracy of LLM responses in real-world scenarios. While we may have the knowledge necessary to evaluate responses based on familiar documents, working with unknown user documents presents a challenge. That’s where tools like TruLens come into play.

Integrating TruLens with a NextJS app may seem challenging, especially for those with limited Python knowledge. However, we learned a working solution using a temporary file shared via Docker and a polling strategy to read that file.

To learn more about TruLens and its features, visit their website at https://www.trulens.org/.

In the next post, we'll explore more about RAG evaluation. So stay tuned.

See you there.

Full code repo: https://github.com/soutot/nextjs-rag-trulens

AI Series Part VII: Evaluating your RAG with TruLens (NextJS+Python)

What's TruLens?

TruLens vs NextJS Challenge

The Solution

Conclusion

Written by Tiago Souto