Feedback Intelligence - Medium

End-to-End Debugging: Tracing Failures from the LLM Call to the User Experience

Mariam — Mon, 22 Sep 2025 19:00:18 GMT

Why End-to-End Debugging Matters

Building with LLMs has now reached a stage where it is about delivering outcomes rather than just text, that are fast, accurate, and genuinely useful to users. But as teams move from prototypes to production systems, they discover that failures can occur at multiple layers of the stack, often without clear visibility into where things are breaking down.

Infrastructure issues like timeouts, routing errors, or sudden cost spikes can derail performance.
Prompt and model issues lead to hallucinations, irrelevant answers, or repetitive responses.
User experience issues show up when users feel confused, rephrase their requests, or abandon the interaction altogether.

Each of these problems looks similar on the surface — “the AI failed” — but the root causes live in very different places. Without connecting signals across infra, model, and user experience, teams end up spending hours chasing dead ends. Debugging becomes guesswork instead of a systematic process.
That’s where end-to-end debugging comes in: tying together what happened at the request level with how the user experienced it. Done right, it shortens the feedback loop from failure to fix, and helps teams build AI systems that are not just functional, but dependable at scale.

Portkey: Full Trace of Requests Across Models and Tools

When debugging AI systems, the first step is to understand what happened at the infrastructure layer. Was the request slow, expensive, or dropped altogether? Did the gateway route it correctly, or did it fail at the provider side?
Portkey gives teams this visibility with end-to-end request traces across 250+ models and providers. Every request flowing through the gateway is automatically instrumented, so developers don’t have to stitch together logs from different SDKs or cloud services.

With Portkey, each trace captures key details:

Latency and throughput — how long each request took, and where time was spent.
Token usage and cost — full visibility into consumption and spend.
Routing and caching behavior — which provider handled the call, whether it was served from cache, and what fallbacks were triggered.
Errors and timeouts — broken down by provider, so teams can isolate external issues quickly.

For example, a developer investigating a latency spike can open the Portkey dashboard and immediately see that responses from a specific provider slowed down after a certain time. Instead of guessing, they now know it’s a provider-side issue, not their code or prompt.
This goes beyond single LLM calls. Portkey can also trace entire agent runs where an LLM is orchestrating multiple steps across APIs, tools, and other models. Instead of seeing just the final output, teams can follow the full chain of reasoning and execution: which tool was invoked, how long it took, whether retries happened, and where bottlenecks emerged.

Feedback Intelligence: Full Trace of User Experience and Interaction Quality

Observability focused on user-facing outcomes

Even when system metrics report all green, the user experience can still fail. The pipeline may execute flawlessly, yet the interaction breaks down. This is where the focus shifts from infrastructure observability to user-facing outcomes, capturing intent alignment, response quality, and satisfaction signals directly from interactions to understand why users leave confused or dissatisfied. That way, you’re not waiting for a thumbs-up/down, you’re getting rich insights into what’s really happening and can act on it instantly.

Proprietary lightweight LLMs orchestrated to evaluate conversations across specific dimensions:

Each interaction is evaluated by Feedback Intelligence’s orchestration layer, which combines specialized small LLMs, ML/NLP techniques, and domain-specific logic for real-time conversational diagnostics. This multi-component approach not only ensures evaluations are context-aware and highly adaptable to different use cases but also generates use-case-specific reports so teams can track the metrics that matter most:

E-commerce → resolution rate and task completion are critical.
Fintech → correctness, compliance, and factual precision dominate.
Mental health → empathy, tone, and user sentiment take priority.

All evaluators and reports are highly customizable, allowing teams to define the exact outcomes they care about, from business KPIs to conversational quality metrics, and measure them systematically.

Some of the evaluation dimensions are below:

User confusion or retries flagged - Detects when users rephrase, repeat, or escalate requests, using turn-level patterns and interaction loops to flag friction points or unclear responses
Misalignment (hallucinations, unhelpful responses) detected -identifies when the model output is off-topic, factually incorrect, or unhelpful, leveraging FI’s evaluators for hallucination risk, coherence, and task relevance.
Intent vs. response match, satisfaction signals, sentiment - Scores how well the response aligns with the original user intent and infers satisfaction or frustration from sentiment, dwell time, re-engagement, and drop-off patterns.

Together, these signals build a structured view of the interaction and reveal problems that remain invisible in system logs.

For example, a product manager reviewing this dashboard can immediately see that most chats are rated “Very Satisfied,” but over 130 are marked “Very Unsatisfied.” Instead of guessing why users left frustrated, they now know which conversations to investigate and can act on concrete interaction-level insights, not just system metrics

Better together

On their own, infrastructure traces or user feedback only tell part of the story. Portkey shows whether the system performed as expected at the request or agent level. Feedback Intelligence shows whether the user actually got what they needed. When you connect the two, you can debug problems with far more precision.

Here’s how the combined view plays out:

Infra view (Portkey): Was the request slow, routed incorrectly, or did the provider fail?
Experience view (Feedback Intelligence): Did the user express confusion, repeat their request, or show signs of dissatisfaction?

By overlaying these perspectives, teams can isolate issues quickly:

If infra is clean but users are retrying → the problem is with prompting or model quality.
If infra latency spikes and users churn → it’s an infrastructure bottleneck.
If both infra and routing are fine, but users still fail → it’s likely an alignment or training gap.

Instead of guessing where failures originate, teams can move straight to fixing them, whether that means switching providers, tuning prompts, or refining the model itself.
The result is a much faster path from failure → diagnosis → resolution, and more reliable AI agents in production.

Towards reliable AI at scale

Reliability in AI is all about whether the system runs well and delivers meaningful outcomes to users.

Portkey ensures performance reliability. It provides the infrastructure visibility: latency, costs, routing, and full traces of requests and agent runs.
Feedback Intelligence ensures experience reliability. It interprets what the logs can’t: whether users felt satisfied, confused, or compelled to retry.

By combining these two layers, teams get a complete debugging loop. Infrastructure health and user alignment are no longer separate silos, but part of the same feedback system. The result is faster iteration, fewer wasted cycles, lower costs, and AI experiences that end users can trust.
As AI becomes central to workflows, this kind of end-to-end debugging isn’t optional — it’s the foundation for building production systems at scale.

👉 Ready to see this in action?

Explore Portkey to unify your LLM infrastructure, tracing, and governance.
Try Feedback Intelligence to turn user interactions into actionable insights for your AI.

Here’s a quick reference for diagnosing AI agent failures, mapping infra metrics and user experience observability to Portkey or Feedback Intelligence

End-to-End Debugging: Tracing Failures from the LLM Call to the User Experience was originally published in Feedback Intelligence on Medium, where people are continuing the conversation by highlighting and responding to this story.

Why traditional product analytics are simply not working for LLM applications

movchinar — Thu, 30 Jan 2025 01:55:42 GMT

When we developed the LLM-powered AI assistant part of our product offering we implemented a direct feedback mechanism with thumbs up/down and an input field. We thought that channel was enough to know when things go wrong and then improve the assistant using that feedback.

But the reality was different — it was 100 times harder to incorporate the collected feedback from the AI assistant into the improvement pipeline. Also, our users do not like to give us feedback — less than 10% of users are giving feedback so it’s biased.

Director of PM at a Fortune500 company

This is a typical complaint that I’m getting from AI teams daily …

I spend a lot of time speaking to different stakeholders within mid-market to large enterprises to understand how they improve their LLM applications aka what feedback mechanism they have, whether it is scalable or not, how they use that data, how big their usage is, etc.

Before talking about the challenges of using traditional product analytics for LLM applications, let’s break down what problems product analytics tools solve, eg, Amplitude, Mixpanel, etc.

Prior to tools like Amplitude existing, people who built apps, websites, or digital products had a tough time figuring out what users liked and how they behaved. Imagine trying to guess why someone stopped playing with a toy without seeing them play — that’s the kind of problem they faced!

In short, digital analytics tools made life easier for people building apps and websites by helping them see and fix problems faster. Better user metadata and behavior tracking, followed by an insight distillation pipeline is a much better way of gathering feedback without asking for explicit user input — which could create user friction and hence be biased.

Now can we use the same tools to gather feedback for LLM applications? But it’s not that simple…

Before explaining why, let’s see what data product analytics tools usually gather:

#1 Specific actions users within the product

Clicking a button, viewing a page or a screen, completing a certain journey, logging …

#2 Attributes that describe individual users

Demographics, account information, behavior traits (First visit date, last seen date, etc) …

#3 User sessions

Session start and end, duration, how many actions within each session …

And more …

Do you see the pattern? This feedback collection mechanism is deterministic aka everything is defined and specified in advance.

Now here’s the kicker — companies often try to apply the same deterministic approach they’ve used for web and mobile to their LLM applications. They set up feedback mechanisms like ratings, thumbs up/down, or in-app text messages, and even Zendesk tickets. But here’s the problem: it’s just not working.

Why?

LLM interactions are fundamentally different. They revolve around free-form text inputs and outputs, which makes it tricky to define success or collect meaningful feedback (we remember that less than 10% of users give feedback). Think about it: what exactly counts as “success”? Is it when the user gets the right answer, or when they give a positive rating? And what about feedback — does it come from the user’s input, their tone, or what they do next? It’s not as simple as tracking clicks or page views, but in many ways richer as the user intent can be more clearly understood from language (eg button click vs “I want to know how to do a transfer?”)

And there’s another layer — with reasoning LLMs, like O1/3, R1 (Deepseek) etc, the problem gets even worse. It’s no longer input -> output, it’s actually input -> thinking (chain-of-thought) -> output. It gets incredibly hard to predict the output. You need a fundamentally different way of judging.

LLM interactions, on the other hand, rely heavily on context. For instance, imagine a user asking, “What’s the best pizza recipe?” The LLM gives a recipe, but the user responds, “This doesn’t help; I wanted low-calorie options.” Traditional tools might record the initial question and the follow-up, but they miss the bigger picture — like the gap between the user’s intent and the response or the subtle frustration behind the comment.

The table below compares challenges in LLM applications, how traditional analytics tools (e.g., Mixpanel, Amplitude) address them, why they fall short, and how Feedback Intelligence solves them.

At Feedback Intelligence, we tell you what to fix/build/improve so you can focus on building. We have built the most robust analytics tool specifically for LLM applications.

The results are actionable across teams:

It allows customer-facing stakeholders (sales/BD/marketing/product ) to measure the success that matters to their business needs,
It enables data scientists and AI engineers to have a clear picture of how to fix things (change the prompt according to recommendations, generate a specific type of data to enhance the fine-tuning or evaluation).

In the upcoming article, I’ll walk you through each feature, how it works, and why it’s actually useful — with real examples.

Spoiler alert: We’ll be diving into ChatGPT conversations 🤖.

Why traditional product analytics are simply not working for LLM applications was originally published in Feedback Intelligence on Medium, where people are continuing the conversation by highlighting and responding to this story.

You Deployed an LLM Agent — Now What?

movchinar — Thu, 09 Jan 2025 14:05:40 GMT

You Deployed an LLM Agent — Now What? Understanding the Landscape of Testing, Observability, and Analytics

Imagine you have a robot helper that talks to people and answers their questions. To make sure your robot works well, you need three types of tools — one to check it before it starts talking, one to make sure it doesn’t break, and one to help it learn and get smarter after it starts working. Let’s break it down!

1. Testing Tools (Evaluation Tools)

Evaluation tools are like quiz teachers that test your robot before it starts talking to people. These tools check if the robot is safe, fair, and ready to go so it doesn’t make big mistakes.

What They Do:

Make sure the robot follows rules and doesn’t say anything wrong.
Check if the robot is smart enough to answer tricky questions.
Test if the robot can handle tough situations without breaking.

Example: Imagine giving the robot a practice test to make sure it knows how to answer questions politely before it talks to real people.

Analogy: Evaluation tools are like school tests — they make sure the robot is ready to graduate before going into the real world.

2. Health Check Tools (Observability Tools)

Observability tools are like doctors who check your robot while it’s working. These tools don’t care about what the robot says — they just make sure it’s running smoothly and doesn’t break.

What They Do:

Watch if the robot’s battery (speed) is running low.
Make sure the robot doesn’t freeze or crash.
Send alerts if something is wrong so it can be fixed quickly.

Example: If the robot starts talking too slowly, observability tools will notice and tell someone to fix it right away.

Analogy: Observability tools are like mechanics for a car — they make sure the engine works but don’t teach the driver how to drive better.

3. Listening and Learning Tools (Analytics Tools)

Analytics tools are like coaches that help the robot learn and improve after it starts working. These tools watch how people talk to the robot and help it fix mistakes and get smarter over time.

What They Do:

Listen to what people say and how they react to the robot’s answers and vice versa.
Spot problems — like when people keep asking the same question or leave without finishing.
Teach the robot how to answer better so users don’t get confused or frustrated.

Example: If users keep asking, “How do I return my order?” and the robot gives unclear answers, Feedback Intelligence notices this and helps the robot explain it better next time.

Analogy: Analytics tools are like sports coaches — they watch how the robot performs, give tips to improve, and make it stronger for the next game.

Where Feedback Intelligence Fits In

Feedback Intelligence (FI) is one of the listening and learning tools. It doesn’t just check if the robot is working — it teaches the robot how to get better by learning from real conversations. FI is great for:

Making robots learn from users’ questions without waiting for big updates.
Helping businesses keep their robots relevant and helpful as users’ needs and intentions change.
Saving time and money by fixing small problems quickly instead of retraining the whole robot.

Stay tuned for upcoming articles where I’ll dive deeper into technical examples of some tools from testing/evals to observability to analytics tools.

You Deployed an LLM Agent — Now What? was originally published in Feedback Intelligence on Medium, where people are continuing the conversation by highlighting and responding to this story.

Get Implicit Feedback, Explicitly

movchinar — Tue, 05 Nov 2024 02:05:11 GMT

In the world of LLM apps like conversational AI, chatbots, and voice agents, user interactions are filled with valuable insights, often overlooked. Implicit feedback — such as rephrasing questions, hesitations, or abandoning conversations — provides crucial clues about where an app might be falling short. Unlike traditional methods that rely on explicit feedback, such as surveys or ratings, implicit feedback is continuously generated as users interact with the app. Leveraging this data allows for fine-tuning the models, making the app more intuitive and satisfying.

Many LLM apps currently depend on explicit feedback to gauge user satisfaction, similar to traditional product analytics that tracks clicks or session durations (Amplitude or Mixpanel). However, explicit feedback only captures a small part of the user experience. Implicit feedback offers a deeper, real-time understanding of user behavior, enabling rapid iteration and improvement based on what users do, not just what they say.

This is where Feedback Intelligence (FI) comes in. FI helps developers, data scientists, and product managers harness implicit feedback to enhance user satisfaction. By analyzing user interactions and behavior directly within the app, FI identifies user intent and satisfaction levels, prioritizing issues and optimizing the app.

Unlike periodic surveys, FI creates a continuous feedback loop, allowing apps to adapt and improve as user needs evolve. This proactive approach keeps LLM apps responsive and relevant, ensuring a refined and dynamic user experience.

By integrating implicit feedback into the lifecycle of LLM apps, organizations can identify and address friction points early, making their apps more aligned with real-world user expectations. Tools like Feedback Intelligence transform the user experience from static to dynamic, keeping LLMs adaptive and user-centric.

If you want to learn more about the Feedback Intelligence platform, check out the docs here or feel free to reach out at join@feedbackintelligence.ai for more details.

Get Implicit Feedback, Explicitly was originally published in Feedback Intelligence on Medium, where people are continuing the conversation by highlighting and responding to this story.

Build a RAG chatbot and optimize the performance through usage

movchinar — Fri, 13 Sep 2024 14:42:31 GMT

Build a RAG chatbot and optimize the performance through usage

RAG is widely used to leverage LLMs in practical applications like chatbots. In this project, a RAG system was implemented using publicly available datasets (GDPR dataset), with real-time monitoring to track usage and continuously improve its performance based on human feedback (explicit and implicit). OpenAI serves as the underlying LLM for the system.

The GitHub repository for the code can be found here. The real-time monitoring and optimization implementation is available here.

The process began by splitting the original PDF into 21 separate articles. Each article was then converted from PDF format into text files using a custom script (available in gdprqa/gdprqa/pdf_parsing/utils.py).

Once converted, semantic chunking was applied to break the content into meaningful sections. These chunks were vectorized and stored in ChromaDB, with each article assigned to its own collection for easier retrieval. Additionally, all chunks were combined into a single collection to facilitate cross-article searches when necessary.

Two helper models support the system: one identifies the most relevant article for answering a query, defaulting to a full-document search if no suitable article is found. The other model reformulates the query to enhance the effectiveness of semantic searches within the vector database.

You can ask questions about GDPR in the query section such as ‘How much does it cost to become GDPR compliant’, etc.

When you finish using the chatbot you want Feedback Intelligence (FI) to analyse the usage. Write exit to send the usage to FI.

To analyze this chatbot usage and optimize it, we use FI. For that, we created the GDPR chatbot project in the platform. Then through settings, generate the API key to connect the chatbot to the platform. As we can see the added and processed data is 0.

After exiting the chatbot the data has been sent to FI successfully and the data is 3 already (we did 3 queries). Insights Engine diagnoses the usage and provides insights on issues, topics, user satisfaction score, and more. The results are used to optimize the chatbot.

Drop us a line if you are building any RAG application.

co-authors Mels Hakobyan & Erik Harutyunyan

Build a RAG chatbot and optimize the performance through usage was originally published in Feedback Intelligence on Medium, where people are continuing the conversation by highlighting and responding to this story.

Feedback Collection Mechanisms for RAG and Prompt-Engineered Systems in Production

movchinar — Thu, 29 Aug 2024 12:38:57 GMT

When running RAG and prompt-engineered (PE) systems in production, gathering feedback is key to keeping these solutions accurate and relevant. Feedback generally falls into two categories: explicit and implicit, both crucial for optimizing the system’s performance.

Explicit Feedback is straightforward — users directly tell you what they think. This can be through ratings, comments, or suggestions about the system’s outputs. Users might provide this feedback through in-app prompts like thumbs-up/down buttons or post-interaction surveys. But it doesn’t stop there. Explicit feedback can also come through other channels like email, Slack, Zendesk, or even phone calls to product owners. If users aren’t happy with a response, they’ll find a way to let you know or they will churn.

Implicit Feedback is more subtle, focusing on user behavior. This includes how long someone spends on a response, whether they scroll past it quickly, or how often they rephrase their query. While harder to interpret, implicit feedback can provide deep insights into the system’s effectiveness and how well it meets user needs.

Why Feedback Matters

Collecting feedback is crucial for several reasons:

Optimizing Accuracy: User feedback helps fine-tune prompts and retrieval methods as they have the most context on the task, ensuring the system continues to deliver relevant results.
Enhancing User Experience: Understanding user interactions allows for adjustments that make the tool more intuitive and effective.
Managing Risks: Feedback helps catch issues early, like when the system generates irrelevant or biased content.

How to Collect Feedback

Here are some effective ways to gather explicit feedback:

In-App Prompts: Simple thumbs-up/thumbs-down buttons or quick surveys embedded in the app make it easy for users to give feedback on the spot.
UI Widgets: Built-in widgets like star ratings or comment boxes enable detailed feedback when needed.
Email, Slack, Zendesk, and More: Users can share feedback through a variety of channels, from emailing support teams to messaging on Slack or filing a ticket in Zendesk. Even a direct call to the product owner can be a valuable source of explicit feedback. Of course, the latter is not the ideal …

Implicit feedback, though less direct, is equally important and can be the hardest but most rewarding to gather. Tracking user behaviors — like how they navigate the app or which responses they spend the most time on — can reveal what’s working well and what needs improvement, even without explicit complaints. There are some techniques that engineers can implement to gather implicit feedback:

Preprocess Chat History: Start by preprocessing your users’ chat history into an easy-to-work-with format, such as a CSV file or, for more professional and automated analysis, a database. Then, for each user request, select the next five consecutive requests, skip the following five, and repeat this pattern.
Determine a Similarity Threshold: Establish a threshold for similar requests. You can do this by manually finding examples where users ask the same question but phrase it differently or using GPT to generate such variations. Calculate a similarity score for each example and use these scores to determine an average threshold.
Analyze for Implicit Feedback: Apply the threshold to the selected requests from step one. Filter out similar requests, and if you find at least two to three similar requests, it likely indicates that the user was not satisfied with the initial response and is rephrasing their question to get a better answer.

Feedback Intelligence: A Holistic Approach

Feedback Intelligence brings everything together in one place. It doesn’t matter if the feedback comes from an in-app, an email, or even a phone call — it consolidates and organizes all explicit feedback so you can easily see what’s working, what’s not, and why the users are not happy. This is particularly useful in applications like conversational AI, chatbots, and agents where quick and clear feedback is essential for making improvements.

But here’s the thing: Most users — around 90% — won’t give you direct feedback. That’s where Feedback Intelligence really shines. It automatically tracks how users interact with your system, picking up on subtle behaviors like navigating, rephrasing queries, or responding to outputs. This implicit feedback is crucial because it gives you insights you might miss if you only rely on what users explicitly say. By combining both types of feedback, Feedback Intelligence enables Ai teams to continuously optimize the RAG and PE systems to align the output with users’ needs.

For more technical information, check out our product documentation.

co-authors: Erik Harutyunyan & Mels Hakobyan

Feedback Collection Mechanisms for RAG and Prompt-Engineered Systems in Production was originally published in Feedback Intelligence on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to define metrics that matter to your use-case-specific RAG

movchinar — Thu, 22 Aug 2024 12:13:03 GMT

How to define metrics that matter to your use-case-specific RAG

When building a domain-specific chatbot or conversational AI, one of the main goals is to optimize the product to meet users’ needs — what we often call ‘personalization.’ Traditionally, metrics like accuracy, F1 score, and precision have been used to evaluate ML models. However, as LLMs evolve, newer metrics, such as faithfulness, relevance, and retrieval accuracy (and more), have become just as important. Engineers are even defining their own custom test cases, which reflect the specific goals of the application.

But the real question is: Which metrics actually matter for the specific task at hand? More importantly, how can we measure whether the output is what users really expect?

Beyond Fundamental Metrics: Focusing on User Experience and Intention

Building the initial version of an LLM application might be simple enough, but maintaining and continuously improving it is where things get tricky. It’s not just about whether the model works — it’s about ensuring that its outputs consistently meet user expectations. While traditional evaluation metrics are important, they only tell part of the story. Personalization and optimization of the LLM responses based on different usage scenarios are crucial for ensuring a positive user experience.

Introducing the Satisfaction Score Metric

One way to tackle this challenge is by introducing a new metric — Satisfaction Score. This metric is designed to measure whether a user’s intent has been met, factoring in things like how the query was processed, user feedback (both explicit and implicit), and a deeper analysis of why certain outcomes occurred. The idea is to focus less on rigid performance metrics and more on whether users are walking away satisfied with the results. As it turns out, satisfaction levels can vary greatly depending on what users are looking for — whether it’s detailed information, creative responses, or help with a specific task.

To calculate the satisfaction score for each chat entry — (query, response, context) triplets, our engine considers various characteristics of it. The most important ones of those are: the user feedback (explicit and implicit), the sentiments of query and response, the faithfulness of the response as well as the textual data itself.

co-author Erik Harutyunyan

Resources:

https://arxiv.org/abs/2212.09746

How to define metrics that matter to your use-case-specific RAG was originally published in Feedback Intelligence on Medium, where people are continuing the conversation by highlighting and responding to this story.

Enhancing LLMs with User-Driven Synthetic Data

movchinar — Thu, 15 Aug 2024 12:46:41 GMT

Garbage in, garbage out.

The quality of your data makes or breaks your model, impacting the model’s accuracy and performance in the real world. High-quality, diverse, and domain-specific data are key to building reliable models. But let’s be real — real-world data is rarely perfect. That’s where synthetic data comes in!

Synthetic data can fill the gaps when data is scarce, enhance privacy, and help reduce bias in Ai. It’s a cost-effective way to generate data and create a more accurate testing environment. Plus, it can boost your real datasets, making your models even more robust and reliable.

As we’ve seen, traditional ML methods and synthetic data creation techniques don’t work well for LLMs. Creating synthetic data for LLMs is tough — it’s hard to maintain realism, capture context, balance bias, scale data, ensure consistency, and validate quality. Plus, there are legal, ethical, and domain-specific challenges to consider.

Some companies, like RAGAS and Patronus AI, offer tools to help, but many enterprises still create hundreds of Question-Context-Answer pairs manually.

The practice shows that the available solutions are time-consuming and introduce bias. Evaluators simply do not have enough context and typically lack domain knowledge.

Fi places the end-user at the center of the development process as both judge and evaluator. With Fi’s dataset module, users can quickly create datasets for testing or fine-tuning LLMs. They can select individual chats, or make bulk selections based on shared insights or topics. Once selected, Fi’s engine then generates ground truth responses with the goal of aligning responses closer to the user’s original intent. This generation consists of user signals and intent based on implicit and explicit feedback. The algorithm treats users as the main evaluators and judges as they typically have domain expertise and appropriate context for domain-specific products.

This approach is crucial for enterprises evaluating RAG, fine-tuned, or prompt-engineered performance after deployment. Manually created evaluation sets can become outdated and unrepresentative of user intentions. Fi ensures that evaluation data stays relevant and up-to-date by continuously incorporating user feedback.

We performed a benchmark experiment on a synthetic dataset our team created. The idea was to perform ground truth generation and observe the quality of the suggested response instead of the original one. Here’s an example of a good ground truth generation:

Out of the 36 evaluated samples, 33 ground truth responses were of high quality as the one displayed above. We observed problems in 3 samples where our engine failed to generate a good ground truth response to the posed user query. By investigating the details of those samples we found out that the reason behind this was the existence of knowledge holes in the provided contexts. It is very natural as no LLM can fully answer a user query if the required information is missing in the context partially or fully. To combat this we apply our knowledge hole detection mechanism to filter out and warn users that those samples cannot be added to the dataset until the knowledge hole problem is resolved and generate high-quality ground truth responses for the rest.

If you’re looking to improve your LLM’s performance (RAG, fine-tuned, or prompt engineered) and ensure your evaluation data stays relevant, explore Fi’s Dataset module. Drop us a line here.

co-authors Erik Harutyunyan & Haig Douzdjian

Enhancing LLMs with User-Driven Synthetic Data was originally published in Feedback Intelligence on Medium, where people are continuing the conversation by highlighting and responding to this story.

Optimizing LLMs with RLHF

movchinar — Fri, 19 Jul 2024 12:46:03 GMT

Optimizing LLMs with RLHF

Reinforcement Learning with Human Feedback (RLHF) is a well-known technique in machine learning. It is essential for developing LLMs that are aligned with human values and expectations, provide high-quality and reliable outputs, enhance user satisfaction and trust, adapt to diverse tasks, and ensure ethical and safe AI development.

Let’s dive into more technical details on how it is being implemented in general and specifically for LLMs. Most importantly, what are the pros and cons of it in the context of LLMs?

RLHF is a machine learning technique that improves AI training by adding human feedback to the process. Here’s a simple explanation:

Reinforcement Learning Basics: Reinforcement Learning Basics: In standard RL, an agent learns by performing actions in an environment to earn rewards or penalties, improving its decisions over time.
Human Feedback Integration: In RLHF, humans provide additional feedback on the agent’s actions, such as ranking outputs, giving rewards, or correcting actions. This extra input helps the agent learn more effectively, especially when traditional rewards are unclear or insufficient.

Implementing RLHF in LLMs involves a few key steps to ensure that the model not only generates text but also aligns its outputs with human values and expectations. Here’s a simplified explanation of the process:

Initial Training: The LLM is initially trained on a large text corpus using supervised learning to learn language patterns and generate coherent text.
Collecting Human Feedback: Human evaluators review and rate the model’s responses based on relevance, coherence, and appropriateness.
Reward Modeling: A reward model, a separate neural network, is trained to predict human feedback, helping the LLM understand preferred responses.
Policy Optimization: The LLM is fine-tuned with reinforcement learning, using the reward model’s evaluations to adjust its generation policy. Techniques like Proximal Policy Optimization (PPO) ensure stable and efficient updates.
Iterative Process: This cycle of generating responses, collecting feedback, and fine-tuning is repeated iteratively to improve the model’s alignment with human expectations continuously.

Implementing RLHF in LLM-based applications such as Retrieval-Augmented Generation (RAG), Fine-Tuning (FT), and Prompt Engineering (PE) can present several challenges and drawbacks. Below are some of the main cons:

Right now, only foundation model companies and big tech can afford it in its current form at scale as it is complex and costly.
Designing reward models for LLM-based applications is non-trivial. Poorly designed reward models can misguide the LLM, leading to suboptimal or undesirable outcomes.
Collection of human feedback is usually done via a very simple mechanism: they are prompted to choose from two options of generations for each given prompt. This aims to reduce the labeling costs but provides a weak learning signal for the reward model that will be trained on this data.
With traditional RLHF, the task-specific feedback may not generalize well to other tasks or domains. This will lead to no versatility or reliability.

Feedback Intelligence (FI, formerly Manot) offers a robust alternative to traditional RLHF by providing an integrated SaaS platform designed to streamline the feedback loop for LLM-based products. Here’s how FI helps overcome the drawbacks of RLHF:

Effortless Feedback Consolidation

Evaluators do not have sufficient context of the task and that’s the reason they introduce bias. FI’s Connectors automatically collect explicit and implicit feedback directly from end-users who have all the context. This ensures consistent, unbiased, and scalable feedback collection from diverse user interactions.

Actionable Insights and Efficient Optimization:

FI’s Insights analyzes feedback to derive actionable insights and identify the root causes of issues. This enables targeted improvement of LLM-based products for faster issue resolution using the end-user expectations. Additionally, it reduces computational costs and resource demands compared to traditional RLHF methods. We achieve this by integrating traditional Deep Learning models with unsupervised learning and employing LLM-as-a-judge in a novel orchestration methodology.

co-author Haig Douzdjian

Optimizing LLMs with RLHF was originally published in Feedback Intelligence on Medium, where people are continuing the conversation by highlighting and responding to this story.

Product Analytics — Traditional vs AI Products

Haig Douzdjian — Wed, 17 Jul 2024 19:03:36 GMT

Product Analytics — Traditional vs AI Products

Ahh the good ‘ol days, when click-through rates, usage metrics, and conversion funnels were all a product team needed to thrive.

Traditional product analytics are the crutch all product teams lean on for optimization. Tracking everything from user actions (page views, clicks, heat maps, etc), conversion rates, retention, and churn to A/B testing, cohort analysis, and much much more.

Then… *dramatic pause*… everything changed

With the rise of LLM-powered products (conversational AI, agents, etc), we stepped into uncharted waters… Why? There are no longer pre-defined user journeys. Each journey generates unique responses and dynamic conversations.

This shift to product engagement makes it extremely difficult to use traditional product analytics tools to understand a user’s experience and expectations.

So, what can be done? There are two options:

Define your own evaluation metrics and do continuous manual analysis. This is an excellent option for early-stage products, less so for scaling and mature products.
Sit back, relax, and embrace Feedback Intelligence.

Despite Feedback Intelligence (FI) having arguably the longest name ever, it makes sense as an out-of-the-box concept. FI acts as the missing link between traditional analytics and the unpredictable world of LLM-powered products.

So, to best describe what Feedback Intelligence is:

It’s a solution designed to understand and analyze the unique interactions between users and LLM systems. By first evaluating high entropy LLM-powered products, it can then optimize those products.

As an example of that: FI allows Ai product teams to understand your user’s experiences and expectations (evaluate), and then personalize and improve those experiences to better align with expectations (optimize).

How does Feedback Intelligence work?

It interprets context, sentiment, and nuances in human-Ai interactions through:

Sentiment analysis — identifies relevance, conciseness, completeness, and emotion in each user interaction.
🔑 Root Cause Analysis (RCA) — understands issues at their core: what went wrong, where it went wrong, why it went wrong, and how to resolve it.
Customer evaluation benchmarks — creates personalized evaluation metrics for your product and underlying LLM-infrastructure, based on user expectations and experiences.
Synthetic golden set generation — automatically generates high-quality datasets from your user engagement, ensuring personalization without the expense.
Prompt optimization — automatically identifies pitfalls in current prompts and recommends personalized alternatives to better meet user expectations.
Automated hyper-parameter testing — leverages 🔑 RCA to find and resolve underlying issues in the retrieval process, avoiding time intensive manual testing.
Knowledge gap identification — leverages 🔑 RCA to find missing context and recommend information to add.

We are extremely bullish that in order to optimize solutions, you must first evaluate and understand your audience. Collecting high-quality personalized data is the difference between successful and unsuccessful Ai solutions.

😤 Enough said.

We’d love to chat, grab a time here!

More content from @movchinar and I here.

Product Analytics — Traditional vs AI Products was originally published in Feedback Intelligence on Medium, where people are continuing the conversation by highlighting and responding to this story.