Monitoring Virtual Assistants with IBM watsonx.governance

Published in

IBM Data Science in Practice

9 min readMay 21, 2024

Authors: Aakanksha Joshi, Jasmeet Singh, Morgan Carroll

As enterprises scale the integration of foundation models or large language models (LLMs) into their applications, one of the top use cases is the creation of advanced LLM-driven virtual assistant. But as these assistants become more pervasive, so do concerns around trust and safety of their deployments. Enterprises are now looking at ways to monitor how these assistants behave in production so they can proactively address issues like hallucinations and drift. That’s where IBM watsonx.governance comes into play.

But before diving into that, let’s talk about some of the foundational platform and concept blocks for the solution.

Platform Foundations

watsonx — watsonx AI and data platform includes three core components and a set of AI assistants designed to help you scale and accelerate the impact of AI with trusted data across your business. The core components include: a studio for new foundation models, generative AI and machine learning; a fit-for-purpose data store built on an open data lakehouse architecture; and a toolkit, to accelerate AI workflows that are built with responsibility, transparency and explainability. The components we leverage in our solution are:

watsonx.ai — watsonx.ai AI studio is part of the watsonx platform that brings together new generative AI capabilities, powered by foundation models and traditional machine learning into a powerful studio spanning the AI lifecycle.
watsonx.governance — watsonx.governance toolkit for AI governance allows you to direct, manage and monitor your organization’s AI activities. It employs software automation to strengthen your ability to mitigate risks, manage regulatory requirements and address ethical concerns for both generative AI and machine learning (ML) models.
watsonx Assistant — watsonx Assistant is a conversational artificial intelligence platform powered by LLMs where you can build AI-powered voice agents and chatbots, deliver automated self-service support across all channels and touch-points and seamlessly integrate with the tools that power your business.

Watson Discovery — Watson Discovery is an AI-powered intelligent document understanding and content analysis platform which boosts the productivity of knowledge workers by automating the discovery of information and insights with advanced Natural Language Processing and Understanding.

Concept Foundations

Prompt — A prompt is a way to interact with a foundation model. watsonx.ai offers users a Prompt Lab where they can experiment with different prompts and models to find the perfect combination of model, prompt and parameters for their use case.

Retrieval Augmented Generation — RAG is the top reference architecture that drives most of the LLM-driven assistant implementations. The architecture diagram below describes the information flow in greater details and you can read more about it here.

A general Retrieval Augmented Generation (RAG) architecture

Based on this RAG architecture, we can identify two variables that determine the final output. The first one is the more obvious one — user input question. The second one gets generated behind the scenes — the passages that get returned when relevant information is extracted for the given query at step 5. The variables query and passages get sent to the language model along with the pre-defined prompt to get the final human-like response back to the user. Remember these two variables, we’ll need them in a few minutes.

LLM Evaluation Metrics — There are many metrics that can be used to evaluate an LLMs performance. Some key ones leveraged in our use case are highlighted below:

Personally Identifiable Information (PII): PII measures if the provided content contains any personally identifiable information in the input and output data by using the Watson Natural Language Processing Entity extraction model.
Hate, Abuse, Profanity (HAP): HAP measures if there is any toxic content in the input data provided to the model, and also any toxic content in the model generated output.
ROUGE: ROUGE is a set of metrics that assess how well a generated summary or translation compares to one or more reference summaries or translations. The generative AI quality evaluation calculates the rouge1, rouge2, and rougeLSum metrics.
METEOR: METEOR is calculated with the harmonic mean of precision and recall to capture how well-ordered the matched words in machine translations are in relation to human-produced reference translations.
Readability: The readability score determines the readability, complexity, and grade level of the model’s output.

You can find the comprehensive list of metrics supported for generative AI models in watsonx.governance here.

Now, let’s see how we connect these different foundational components and set up this pipeline step by step. For our use case, let’s build “BATBOT”, a virtual assistant that can help answers user questions about bats.

Pipeline

Step 1: Set up your domain specific data repository

The first step towards setting up a RAG-based assistant is creating the domain specific knowledge repository that the LLM should extract its response from.

For our use case we leveraged Watson Discovery and ingested some domain specific PDF files about protection against bats to fuel our assistant. You also have the flexibility to use a vector database like Elastic Search in watsonx Discovery or another vector database of your choice.

Step 2: Create a use case to track the prompt in a model inventory

Any data science or AI project should begin with the definition of the use case. The use case documentation must capture details like the business purpose of the project, its risk level, and current status. For example, our use case is focused on creating a RAG-based Assistant.

Since we’re just getting started here, we’ll not see any details under lifecycle yet, but as we create prompts under this use case, we’ll be able to track all the prompts and see whether they’re under the development phase, the testing phase, the validation phase or the operational phase.

Step 3: Create and deploy a prompt

Now that the business need for the project has been documented, we can begin the development work.

We can create a prompt asset, associate it with our desired use case, and validate that it shows up under the Develop stage in the inventory. Once we have verified that the prompt is being tracked as expected, we can dive into prompt engineering.

The user can go into the prompt lab and try out various styles of prompts, models, and parameters till they become successful with all their test cases. Make sure you define your RAG variables in the prompt here so that the deployed prompt can recognize the information when it receives it from the front-end assistant.

For our use case, we finally settled on the following combination:

Once the prompt has been evaluated in the development environment, we can first deploy the prompt in the pre-production deployment space, followed by evaluation of the prompt deployment to capture pre-production metrics.

After a thorough evaluation in the pre-production space, the same prompt should be promoted to a production deployment space. From there, the prompt can be deployed as an online endpoint and connected to watsonx Assistant.

After the deployment has been created, we can come back to the use case and ensure the status of the model has changed.

Step 4: Integrate with watsonx Assistant

You need a few custom assets before you can start Q&A.

Connecting your knowledge base into the ever-expanding realm of channels and web apps will empower your virtual assistant to deliver comprehensive, omni-channel customer support, fetch real-time data and effortlessly automate complex, repetitive tasks. watsonx Assistant continues to expand ways to customize your virtual assistant with starter kits for some of the most popular integrations. Use these extensions to enhance your virtual agent with advanced functionality, retrieve real-time information from a database, reference a CRM, submit a ticket and more. This can be achieved in just a few clicks.

The first integration you need is with Watson Discovery. This integration will take the user input question and send it to Watson Discovery. Watson Discovery will return the top passages it finds in relation to the question that was asked. You can find the OpenAPI specification for that integration here.

The second integration is with the prompt deployed in watsonx.ai. This integration allows watsonx Assistant to send the user input question and passages extracted from Watson Discovery as payload to the deployed prompt. Inside the deployment, the prompt variables for passages and user question are filled in, and the completed prompt is sent to the LLM specified in the deployment along with the defined parameters. The response from the LLM is sent back to watsonx Assistant ant and displayed back to the user. You can find the OpenAPI specification for that integration here.

The RAG reference architecture aligned with the IBM products used in this solution

Step 5: Monitor the Assistant interactions in watsonx.governance

The seamless communication between watsonx.ai and watsonx.governance within the watsonx platform means that we are all set once watsonx Assistant is integrated with the prompt deployment. As users interact with the Assistant, the questions, passages, and responses are logged into watsonx.governance as payload. This payload data can be analyzed in the watsonx.governance UI to look for content containing Personally Identifiable Information (PII) or Hate, Abuse, Profanity (HAP). It is automatically tracked in watsonx.governance and used to calculate and monitor for PII, HAP, and Readability metrics without the human intervention. Since the assistant is open to public users, we definitely want to keep track of any PII or HAP that a bad actor may be injecting through the user interface.

As watsonx.governance is storing all the interactions with the prompt deployment, these interactions can be downloaded and appended with the ground truth to create a feedback dataset. This feedback data can be used to evaluate model responses against ground truth provided by subject matter experts. Metrics like ROUGE, METEOR, Readability etc. can be calculated and analyzed by comparing the assistant responses against the ground truth.

Alerts are created automatically and displayed in the watsonx.governance UI if any metric value breaches the threshold. Developers also have the option to be alerted about threshold violations over channels like email by setting up alert notifications.

Based on the evaluation results it looks like the user could engineer the prompt more and increase the readability of the model’s output. The model responses could also be tested more thoroughly for the PII they are reported to be generating. The small percentage of Output data PII detected may just be a phone number to call or an email ID to reach out to for help. But it gives the prompt engineer some actionable insights to act on and improve the performance of the assistant in production.

Conclusion

As we can see, developers get alerted if any of the assistant metrics fall below a given threshold. These alerts provide developers actionable insights into prompt improvements like increasing or decreasing the token count, increasing or decreasing the temperature, adding an instruction to keep the responses short and stick to facts. The actionable insight can also be focused on updating the ground truth in instances where the model response turns out to be better than the ground truth upon further review.

Using this end-to-end pipeline, you can not only leverage the best-in-class LLMs to supercharge your assistants, but you can also build a mature governance framework around the assistant deployments to ensure they interact with your end users in a safe, trusted and secure way.

Curious to learn more about these IBM offerings? You can begin by requesting a free trial environment for watsonx.ai, watsonx.governace, watsonx Assistant or Watson Discovery!