Foundations of Scalable LLM Analytics

A Low-Latency Architecture for Storing and Distributing Search History Across Teams

6 min readJun 4, 2024

In the rapidly growing market for enterprise LLM applications, efficiently managing and analyzing data from queries is becoming essential. Applications built with tools like Haystack, LangChain, and Llama-Index produce valuable data essential for shipping these applications to production. In addition to classical observability data like latency, understanding search history data is critical for qualitative application improvements. This can include detecting hallucinations or fine-tuning models.

We have revolutionized our approach, moving from a simple, relational database system for storing search history data to a scalable and extensible data analytics engine that keeps the query latency low and enables storing millions of searches per day. Key technologies were Amazon Simple Notification Service (SNS), Vector, AWS S3, and AWS Athena.

This new system empowers data teams to enhance LLM applications by collecting and analyzing search data more effectively. Teams can continuously improve the applications through hallucination detection, fine-tuning, or any other fancy new method to improve LLM applications that will be published in the future.

The Challenge: A Simple Approach

When starting work on a product, you begin with the simplest architecture to validate your approach first. Once you know it’s valuable to your users, you can optimize for high-volume traffic and scalability. The simplest solution that allowed us to authenticate a request and also store the results for later use was by adding a service (see API) in the middle that keeps track of the requests and stores the results on a relational database.

Once the number of requests was growing, we faced significant challenges:

Latency Issues: The API introduced a constant 20–400 ms latency per request due to synchronous search history storing.
Database Bottlenecks: The database became a single point of failure. It also does not use the unique characteristics of search history data. Postgres as we use it here is not ideal in the default configuration for this time series-like data. Additionally, it needs to be optimized for write performance. We learned that when working with LLM applications, you need to track the results for future training, evaluation or transparency of how your application behaves. This use case occasionally requires reading and aggregating many entries, but we can afford to wait for a warmup, a cache, or a higher latency. Using a cold storage for this data might be cheaper and fit the use case better.
Scalability: Horizontal scaling of pipeline replicas temporarily mitigated performance issues, but the increasing load threatened to overwhelm our existing infrastructure.

Not Just A Scaling Challenge

As we scaled, we needed to make the data we gathered available across teams, enabling them to build features on top of this fundamental information of the application. Calculating the Groundedness score of your pipeline to measure hallucinations or fact checking the responses are use cases you regularly see in enterprise systems. To address these challenges, we devised a new architecture leveraging SNS and Vector.

Asynchronous persistence with Vector

Vector Sidecar: We integrated Vector, an observability data pipeline, as a sidecar to our application. Vector monitors log lines for successful queries and asynchronously forwards the query and response to SNS. This reduces the latency impact on the API as logging occurs out-of-band. In order to process the logs later, we formatted them as JSONs [1].
Log Collection: Vector collects logs in real-time and identifies log entries indicating successful queries. Upon detection, it forwards the relevant data to SNS for further processing.

Before collecting the logs, we need to ensure the required information is logged into a file. For this, we can use structlog, as described in my previous article , and call:

import structlog 
logger = structlog.get_logger(__name__)

# My process that runs some LLM application 
# ... run a haystack pipeline 
# ... run a langchain chain 
# ... run a llamaindex pipeline 
# ... 
my_response = {"query": "Some query", "response": "Some response"}
logger.info("MyEvent", my_response=my_response)

Now that we have configured our log line, we can route the messages to SNS by using the aws_sns sink of Vector. All we need to do is parse the events, filter for the event name, and map the results to the correct sink:

data_dir: /vector-data-dir
sinks:
 # Tell vector to send the event "my_event" to SNS with a configured
 # AWS Simple Notification Service sink
    type: aws_sns
    inputs:
      - my_event
    topic_arn: {{ .Values.logging.snsTopic | quote }}
    encoding:
      codec: json
transforms:
  # Messages are read as plain text when receiving it from a file 
  # we need to parse them as json
  parse_json:
    inputs:
      - log_file
    source: |
      .message = parse_json!(.message)
    type: remap
  # We only want to distribute the configured "MyEvent" messages
  # to SNS since these contain the search history data we want to 
  # distribute
  my_event:
    inputs:
      - parse_json
    type: filter
    condition: "match!(.message.event, r'MyEvent')"
sources:
  log_file:
    include: ["/var/log/my-path/*.log"]
    type: file
    # SNS limits the message size to 256 KB
    # We drop all message exceeding this size
    max_line_bytes: 256000

Data Distribution with SNS and Firehose

Fire and Forget: SNS facilitates the distribution of messages to multiple clients without requiring the sender to know the recipients. This decouples the logging process from the downstream processing, allowing for more flexible and scalable handling of search history data.
AWS Integration: SNS messages are routed to AWS Firehose, which buffers, converts, and sends the data to S3 for long-term storage. The use of the Parquet format ensures efficient storage and retrieval.

Data Storage and Querying

S3 for Long-Term Storage: Search history data is stored in S3, an economical solution for high-volume, immutable data. This approach mitigates the database bottleneck by offloading storage responsibilities to S3.
Athena for Querying: AWS Athena, a serverless SQL query service, allows us to run SQL queries on the data stored as parquet files in S3. The columnar structure enables Athena to be scalable and a cost-effective solution for analyzing large datasets.

Implementation and Results

The implementation of this solution yielded significant improvements in performance and scalability:

The Good 🎉

Reduced Latency: By offloading the logging process to an asynchronous pipeline, we reduced the latency by 20–400 ms per request.
Scalability: The new architecture scales horizontally without the previous bottlenecks. Search history data is efficiently managed and stored, allowing us to handle higher query volumes seamlessly.
Enhanced Reliability: Decoupling the search history logging from the main application flow reduces the risk of performance issues affecting the entire system. Any issues with processing the search data do not impact the overall user experience when running queries against the pipeline.
Cross-Team Collaboration: The use of SNS has improved communication and data sharing across different squads. This has enabled better integration of search history data into various product features and analytics tools, enhancing the overall functionality and user experience.

The Bad 🤔

Eventually, consistency and potential for data loss: We introduced a delay between getting a 200 response and being able to fetch the previous questions. The search history is no longer guaranteed to be successfully stored when a 200 response is returned. However, this can be a benefit, as this constraint does not affect service availability. At scale, this is acceptable since this information is not latency relevant. While data loss is possible, if your application requires guaranteed storage of search history, a different solution would be necessary.
Complex test setup: Although configuring and using gluecode on AWS is a stable way of building products, testing is no longer a simple pytest run. To address this, we introduced system tests that run against replications of an ephemeral environment.

About me: I am a developer based in Cologne, working at deepset. I am part of the team building “deepset Cloud,” which is powered by the open source framework Haystack.

Github | LinkedIn