How to evaluate large language model chatbots: experimenting with Streamlit and Prodigy

Published in

Discovery at Nesta

13 min readOct 17, 2023

Nesta’s Discovery Hub has launched a project to investigate how generative AI can be used for social good. We’re now exploring the potential of LLMs for early-years education, and in a series of Medium blogs, we discuss the technical aspects of our early prototypes. This article was written in collaboration with Rosie Oxbury and Kostas Stathoulopoulos.

In this blog we’ll describe our initial experiments with evaluating the outputs of a large language model (LLM). We tried two simple methods of evaluating a parenting chatbot — one of our generative AI prototypes.

For the first experiment, we used Streamlit and Amazon S3 storage to collect user feedback as they conversed with the chatbot.

For the second, we performed A/B tests and compared answers to parenting questions written by humans and LLMs — and found some surprising early results.

Early-years parenting chatbot

We have prototyped a parenting chatbot that provides guidance to caregivers of young children. You can see it in action in the video below.

Demo of the early-years parenting chatbot prototype

The chatbot uses OpenAI’s GPT-4 and utilises retrieval-augmented generation (RAG) to ground the LLM in a knowledge base consisting of advice on pregnancy, babies and parenting — in this case the NHS Start for Life website, which we have used as an openly available source of information for prototyping purposes.

Experiment 1: Collecting feedback with Streamlit

In order to make the chatbot more helpful and accurate, we wanted to implement the option for users to provide their feedback to individual responses.

The parenting chatbot prototype is built with Streamlit, which does not provide a native widget for collecting user feedback. Therefore, we opted for streamlit-feedback, a third-party component developed by trubrics, which provides a simple and pretty interface for this purpose.

The streamlit-feedback component enables two types of feedback, binary (thumbs up and down) and a 5-point Likert scale (from sad to happy face). It also provides a free text field for users to add their comments.

In our prototype, we decided to use the 5-point Likert scale and allow users to provide feedback for each assistant response. Here is an example of how this looks in our prototype.

*Example of the user feedback widget in Streamlit*

The widget is implemented by simply adding the following lines after the LLM response in the code:

from streamlit_feedback import streamlit_feedback

(..)

  # Submit feedback
    streamlit_feedback(
    feedback_type="faces",
    single_submit=False,
    optional_text_label="[Optional] Please provide an explanation",
    key="feedback",
  )

Logging user feedback to S3

We decided to save the user feedback and message history to S3 so that we can analyse it at a later stage. We utilised Streamlit’s session management functionality to keep track of multiple user and assistant messages. When a user starts a conversation with the parenting chatbot, the app defines a unique session identifier, which will be used to create a folder with the same name in an S3 bucket where we collect the messages and user feedback.

import streamlit as st
import uuid

(..)

st.session_state["session_uuid"] = f"{current_time()}-{str(uuid.uuid4())}"

Each time a user submits feedback, we append a dictionary containing the last user and assistant messages, the feedback rating and the free text field into a JSON Lines file in S3 using boto3:

  user_feedback = {
      "user_message": st.session_state["messages"][-2],
      "assistant_message": st.session_state["messages"][-1],
      "feedback_score": st.session_state["feedback"]["score"],
      "feedback_text": st.session_state["feedback"]["text"],
  }
  
  write_to_s3(
      key=aws_key,
      secret=aws_secret,
      s3_path=f"{s3_path}/session-logs/{st.session_state['session_uuid']}",
      filename="feedback",
      data=user_feedback,
      how="a",
  )

If this prototype is eventually deployed to a group of test users, this feedback could be assessed to identify problem areas for the chatbot. For example, there might be topics that are not sufficiently covered in our knowledge base, and users would likely point out such topics with their negative feedback.

We are not, however, deploying the prototype yet and hence we also tested the chatbot answers in a more controlled way — using A/B tests with a small group of people.

Experiment 2: A/B testing with Prodigy

We experimented with an evaluation approach that could help us systematically compare the chatbot responses with answers written by human experts. We used a tool called Prodigy to make pairwise comparisons between answers from different sources (ie, LLMs or humans).

Prodigy is a scriptable annotation tool for small teams doing rapid, iterative machine learning model development. There are also other tools that can be used for this purpose.

*Prodigy annotation tool presenting a parenting question and two answers for annotators to choose from.*

Parenting questions

We used the “top ten most asked parenting questions” from an article on The Kid Collective website. According to the article, these questions have been sourced from Google Trends and pertain specifically to the UK. The questions are:

How to change a nappy?
How to store breast milk?
When to stop breastfeeding?
How to get baby to sleep?
How much to feed baby?
How do babies learn?
How to breastfeed?
How to baby-proof?
How to potty train?
What is colic?

The article also provides short, 1–2 paragraph answers to each of the questions. We used these answers as the reference “human” answers in our evaluation.

While this source is not necessarily an authoritative one (The Kid Collective appears to be primarily a baby retail website), we deemed it as a good starting point for our experiment, as the questions are presumably among the most popular ones that caregivers have, and the answers are written in a conversational style that is easy to understand.

Other openly available options that we considered were answers to parenting questions by the US-based Public Broadcasting Service (also not necessarily an authoritative source) and Action for Children (more specialised in mental health). In the future, we could expand this data with more questions and the answers could be provided by caregiving experts.

Evaluating the answers

We asked the ten questions to our parenting chatbot prototype and presented both answers to annotators (a few people from our team) to select the best answer. The annotators were not told which answer was from the chatbot and which one was from a human.

In addition, we also included a third source of answers, which is an answer from the GPT-4 model without any external knowledge base. This was to see whether the use of RAG and grounding the answers in a trusted knowledge base improves the perceived quality of the answers.

We could then analyse the pairwise comparisons and determine which source of answers had been the most preferred one. All the answers used in this experiment can be found in our GitHub repository.

Setting up Prodigy

In the following, we describe how to set up the Prodigy tool to perform the evaluations. If you would like to see the results of the experiment instead, skip ahead to the end of the article.

First, to install prodigy, you will need to run the following command in the terminal

python -m pip install --upgrade prodigy -f https://XXXX-XXXX-XXXX-XXXX@download.prodi.gy

Replace XXXX with your licence key (see the official prodigy docs here for further information).

To define the annotation task, we have to write a Prodigy “recipe” file (see here the full recipe for our evaluation task). We begin by importing various modules.

import random
from typing import Dict, Generator, List
import prodigy
from prodigy.components.loaders import JSONL

Next, we define GLOBAL_CSS which is a string of CSS styles. This lets us tailor the appearance of the Prodigy interface. Through this, we can modify font sizes, layout of answer option boxes, container width, and more.

GLOBAL_CSS = (
  ".prodigy-content{font-size: 15px}"
  " .prodigy-option{width: 49%}"
  " .prodigy-option{align-items:flex-start}"
  " .prodigy-option{margin-right: 3px}"
  " .prodigy-container{max-width: 1200px}"
)

We then define the best_answer function, which specifies our Prodigy recipe. With the @prodigy.recipe decorator, we define the expected arguments. The role of this function is to process the input text data and return it in a format suitable for Prodigy to render.

@prodigy.recipe(
    "best_answer",
    dataset=("The dataset to save to", "positional", None, str),
    file_path=("Path to the questions and answers file", "positional", None, str),
)
def best_answer(dataset: str, file_path: str) -> Dict:
    """
    Choose the best answer out of the given options.

    Arguments:
        dataset: The dataset to save to.
        file_path: Path to the questions and answers file.

    Returns:
        A dictionary containing the recipe configuration.

    """

We use Prodigy’s JSONL loader to fetch and store our answers dataset into the stream variable, making it ready for subsequent processing. Once data is loaded, the next step is to make sure it’s presented in a random sequence and in a format Prodigy understands. By shuffling our data, we prevent potential biases in the order of presentation.

  # Load the data
  stream = list(JSONL(file_path))
  
  def get_shuffled_stream(stream: List) -> Generator:
      random.shuffle(stream)
      for eg in stream:
          yield eg
  
  # Process the stream to format for Prodigy
  def format_stream(stream: List) -> Dict:
      for item in stream:
          question = item["question"]
          options = [{"id": key, "html": value} for key, value in item["answers"].items()]
          yield {"html": question, "options": options}
  
  stream = format_stream(get_shuffled_stream(stream))

Finally, we lay down the rules for Prodigy by setting up the recipe configuration. This dictionary defines how our task should appear and function within the Prodigy interface. From specifying the ‘choice’ interface, naming the dataset, feeding the processed data stream, to determining interaction buttons and setting up other interface-related parameters.

  return {
      # Use the choice interface
      "view_id": "choice",
      # Name of the dataset
      "dataset": dataset,
      # The data stream
      "stream": stream,
      "config": {
          # Only allow one choice
          "choice_style": "single",
          "task_description": "Choose the best answer",
          "choice_auto_accept": False,
          # Define which buttons to show
          "buttons": ["accept", "ignore"],
          # Add custom css
          "global_css": GLOBAL_CSS,
          # If feed_overlap is True, the same example can be sent out to multiple users at the same time
          "feed_overlap": True,
          # Port to run the server on
          "port": 8080,
          # Important to set host to 0.0.0.0 when running on ec2
          "host": "0.0.0.0",
          # Setting instant_submit as True means that the user doesn't have to click the "save" button
          "instant_submit": True,
      },
  }

Running Prodigy locally and on the cloud

Once the Prodigy recipe has been defined, you can run the following command in the terminal to spin up an instance of the Prodigy app and test it on your local machine

python -m prodigy best_answer answer_data src/genai/parenting_chatbot/prodigy_eval/data/answers.jsonl -F src/genai/parenting_chatbot/prodigy_eval/best_answer_recipe.py

You should now be able to access the Prodigy app in the URL http://0.0.0.0:8080. Note that you will need to specify the user session ID, by adding ?session=your_session_id to the URL. This is so that Prodigy can keep track of annotations from different annotators.

The annotation results can be fetched by running

prodigy db-out answer_data > output.jsonl.

To make the platform available to multiple annotators, you will need to run Prodigy on an EC2 instance. We can then share the URL with the annotators, and they can access the platform from their own computers.

We first spin up an EC2 instance (in our case t2.micro, which is quite affordable at around $0.0116 per hour) and use ssh to connect to it. Once connected to the instance, we cloned our generative AI repo (to get the recipe code and answer data) and installed Prodigy on the instance using the instructions above.

To keep the Prodigy instance running in the background even when we’ve disconnected from the instance, we used screen by first simply running screen in the terminal

screen

And then ran the same command to spin up a Prodigy instance

python3 -m prodigy best_answer answer_data src/genai/parenting_chatbot/prodigy_eval/data/answers.jsonl -F src/genai/parenting_chatbot/prodigy_eval/best_answer_recipe.py

To detach from the screen, press ctrl+a and then d. The Prodigy app will continue running in the background. To reattach to the screen, run the commandscreen -r. You can also run screen -ls to see the list of screens that are running. To stop a screen and terminate the prodigy instance, run screen -X -S <your_prodigy_screen_session_id> “quit”.

Finally, you will need to open the port 8080 to allow other users to join from their computers. This can be done by going to the EC2 instance settings on AWS website, and adding a new inbound rule to the security group. The rule should be of type “Custom TCP”, port range 8080. If you wish to control who can connect to your Prodigy instance (recommended), you can also specify the allowed IP addresses.

Early results are encouraging

So, what where the results? We performed a proof-of-concept experiment with four annotators, who in total submitted 96 judgements. (Not every annotator completed all assessments.) This is definitely too small and non-representative of a sample to draw any strong conclusions. Nonetheless, during a rapid prototyping process, such exercises might still provide helpful signals on how a prototype is working and highlight potential problem areas.

The chart below shows the percentage of choices where annotators preferred one type of option (rows) over the other (columns). To our surprise, our chatbot prototype was preferred over human-written answers 91% of the time. This might have to do with the fact that the chatbot prototype generally had longer, more detailed and structured responses. After the test, some annotators noted that they preferred such type of answers.

*Percentage of times one type of option (rows) were preferred over the other type of option (columns).*

Meanwhile, GPT-4 without RAG also tended to provide longer and more detailed answers — and it was preferred about 66% of the time over human-written responses.

Interestingly, however, our chatbot prototype was preferred over GPT-4 around 83% of the time, suggesting that the use of RAG together with a trusted knowledge base improves the responses. It might be due to more specific, contextual information — for example, the chatbot tends to provide the UK’s National Breastfeeding Helpline phone number when asked about breastfeeding. The chatbot also differentiated from GPT-4, by occasionally beginning with a phrase like “I did not get any relevant context for this but I will reply to the best of my knowledge…” thus signalling a level of confidence in the answer.

We can also check the results for each question separately, to spot any interesting patterns. The chart below shows the percentage of times each type of response was preferred across all pairwise comparisons.

For example, we can see that for the question “How to change a nappy?” GPT-4 in fact appears to have performed the best. This could warrant further inspection into the differences between the answers from the chatbot prototype and GPT-4, and provide ideas on how to continue improving the prototype (eg, by prompt engineering).

*Percentage of times each type of response was preferred across all pairwise comparisons*

Limitations and further exploration

These early results are encouraging in that they suggest that grounding a large language model in a trusted knowledge base might yield an improvement over simply using GPT-4, and that the answers from an LLM-based chatbot can appear sensible when compared with a set of human-written responses. Nonetheless, these are indications rather than conclusive evaluations. A more convincing assessment would require broader testing in diverse scenarios.

Firstly, our annotators were part of our wider development team. This may have influenced the results, introducing potential biases and not reflecting the diverse perspectives of the broader UK parent audience. We also did not implement strict evaluation criteria for the annotators to follow.

The human-written answers were sourced from a website that very likely hasn’t been as rigorously vetted as the NHS Start for Life. Another, perhaps a more meaningful test would be to compare how the users prefer using the chatbot versus browsing the website that provides information to the chatbot’s knowledge base. In such a test, the quality of the available information would be similar between both sources of information.

We also used only one answer per question — however, LLMs can produce slightly different versions of responses every time you use them. Therefore, one should sample multiple responses (potentially at different temperature settings) to make the results more robust. The LLM responses should also be tested separately for their accuracy by domain experts.

More generally, there are many approaches for evaluating various aspects of LLMs such as accuracy, fairness and potential for misuse. For example, besides A/B tests, LLMs can be evaluated by using domain-specific multiple-choice questions. One can also do ‘red-teaming’ to probe the guardrails of an LLM and check if it refuses to produce harmful outputs.

It is also possible to speed up the evaluation process by using LLMs themselves to do the evaluation. Recent research has shown that there can be a high (85%) agreement between judgements made by GPT-4 and human experts. This led the authors to suggest that using strong LLMs like GPT-4 for evaluation might be “a scalable and explainable way to approximate human preferences”.

Each of these approaches, however, comes with its own set of challenges and developing LLM evaluation methods is an active research area.

Conclusion

We have shown how we implemented two approaches to support the evaluation of our LLM-based chatbot prototype. These are only early experiments, but we hope this has been helpful and provided you with ideas for testing chatbot assistants. You can find all the code and data on our GitHub.

The type of quick evaluation we have discussed is useful during early development to quickly identify potential problem areas. Obviously, full and rigorous testing of LLM-based systems will become increasingly important as they become ready for release and come into popular use.

Organisations will need to weigh out the failure rate of these systems versus the benefits they bring to the users. The evaluation will be especially critical in contexts where vulnerable groups of people are involved, such as young children or people seeking mental health support.

If you are also working on evaluating LLMs and chatbots, we’d be curious to hear about your experience — get in touch!

Thank you to Dan Kwiatkowski and Faizal Farook for reviewing the article, and to Discovery Hub’s Laurie, Natalie, Rosie and Kostas for participating in the evaluation experiments.