Designing an Observability pipeline for LLM Applications

4 min readJun 19, 2024

With the rapid adoption of Large Language Models (LLMs) like OpenAI’s GPT-4, Cohere’s Command, Mistral Mixtral models, and Anthropic’s Claude, the complexity and dynamism of modern applications are skyrocketing. As developers and tech companies integrate LLM capabilities into their products, having a robust observability pipeline is no longer a luxury — it’s an absolute necessity.

Observability ensures that your LLM-driven applications not only perform reliably but also continuously improve over time. In this article, we’ll delve into the core aspects of observability for LLM applications, highlight essential metrics, and show how OpenLIT — an OpenTelemetry-native observability and evaluation tool for LLMs and GPUs — can address most of your challenges.

Why Is Observability Essential in LLM Applications?

The answer lies in the inherently unpredictable and opaque nature of LLMs. With their complex architectures, understanding why an LLM behaves a certain way or debugging performance issues can be a daunting task. More so, the rich and nuanced outputs of LLMs necessitate a deep dive into their workings to ensure they meet user expectations, adhere to security guidelines, and align with business goals.

Key Elements to Monitor in LLM Applications

Performance Metrics:

A popular and effective framework for monitoring these metrics is the RED Method, introduced by Tom Wilkie at GrafanaCon EU in 2015. This method focuses on three key areas:

Rate: Monitor the number of requests per second your application handles. In the context of LLM applications, this gives you a clear picture of your LLMs throughput and can help identify if your LLM is capable of managing load efficiently without bottlenecks. For instance, a sudden drop in request rate might indicate issues with scalability or availability of the LLM service. Monitoring rate can also help improve queue size management, ensuring that requests are processed in a timely manner.
Errors: Track the number of requests that fail. An uptick in error rates can signal underlying issues with LLM API calls, such as authentication failures, rate limiting errors, or model callback issues. Early detection through error rate monitoring can prevent larger systemic breakdowns and find frequent causes of errors, which can then be addressed to refine your LLM’s performance and reliability.
Duration: Measure the amount of time each request takes to be processed, also known as response time or latency. While fast responses are crucial for a smooth user experience in general, in the LLM landscape, minimizing latency becomes even more critical. LLM tasks, such as generating text or processing complex queries, can be resource-intensive, leading to longer response times. Monitoring and optimizing these durations ensure that users receive timely and accurate responses, which is essential for maintaining user engagement and satisfaction. High latency might also indicate inefficiencies in the model’s parametrs or the need for lighter LLMs.

User Interactions:

Usage Patterns: Monitor how users interact with your application, including the frequency and context of queries. This helps identify common use cases and operational challenges, allowing for better optimization.
Feedback: Collect and analyze user feedback to identify pain points and improvement opportunities. Direct user feedback is crucial for refining the LLM’s performance and meeting user expectations.
Behavioral Insights: Understanding user behavior helps refine the model and enhance the user experience. Analyzing user interactions can uncover trends and preferences that inform improvements and increase satisfaction.

GPU Performance Metrics:

When self-hosting LLMs using tools like GPT4All or Ollama, it’s crucial to monitor and manage GPU performance due to the heavy dependency LLMs have on GPUs.

- Utilization Percentage: Track the percentage of GPU resources being used. High utilization indicates efficient use of GPU capabilities, while consistently low utilization might suggest inefficiencies or bottlenecks in the system.

- Power Usage: Monitor the power consumption of your GPU. Keeping an eye on power usage helps in identifying potential issues like overheating or inefficiencies in processing that lead to lower performance efficieny.

- Memory Usage: Keep tabs on the amount of GPU memory being used. LLMs can be memory-intensive, so monitoring memory usage ensures that the GPU has sufficient resources to handle tasks without running into out-of-memory errors.

- Temperature: Track the temperature of your GPU. High temperatures can lead to thermal throttling, reducing performance, and potentially damaging your hardware performance.

Do we really need a new tool for LLM Observability?

Nope, existing observability tools like Grafana, DataDog, and SigNoz already do a great job. So, what’s different? Most AI developers aren’t SREs; they need a quick and easy solution for comprehensive monitoring. That’s where OpenLIT comes into play.

Introducing OpenLIT: Open-source LLM Observability

OpenLIT is an OpenTelemetry-native tool designed to help developers gain insights into the performance of their LLM applications in production. It automatically collects LLM input and output metadata and monitors GPU performance for self-hosted LLMs.

Advanced Monitoring of LLM, VectorDB & GPU Performance: OpenLIT offers automatic instrumentation that generates traces and metrics, providing insights into the performance and costs of your LLM, VectorDB and GPU usage. This helps analyze how your applications perform in environments like production, enabling efficient resource usage and scalability.
Cost Tracking for Custom and Fine-Tuned Models: OpenLIT allows tailored cost tracking for specific models using a custom JSON file. This feature ensures precise budgeting and alignment with your project needs.
OpenTelemetry-native & Vendor-neutral SDKs: Built with native OpenTelemetry support, OpenLIT integrates seamlessly with your projects. This vendor-neutral approach reduces barriers to integration, making OpenLIT an intuitive part of your software stack rather than an additional complexity.

So, OpenLIT acts as a simple solution to collect observability data from LLM applications, which can then be viewed in any of the OpenTelemetry-compatible observability tools mentioned above.

To get started with OpenLIT, I recommend checking out this
https://github.com/openlit/openlit