Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale
Co-authored with Anudeep
At Uptycs, our core product revolves around telemetry data collection from a vast fleet of endpoints across cloud, containerized, and on-premises environments. This data is central to generating insights for risk management, vulnerability detection, compliance auditing, fleet monitoring, and more.
Our data ingestion pipeline results in a high-volume, high-velocity data lake comprising thousands of tables and tens of thousands of columns — some ingest a few megabytes daily, while others see terabytes per day, overall data lake hitting exabyte scale. Querying this data efficiently is critical to providing a smooth user experience and real-time insights.
Understanding the Query Landscape
Customers interact with this data primarily through:
- Pre-built Dashboards: Interactive, visually rich, powered by analytical queries.
- Custom Queries via API: Allows customers to query specific telemetry data tailored to their use cases.
- Scheduled Queries: Periodic queries scoped to specific date ranges, often used for compliance and fleet health reports.
Latency Expectations
Queries in our system span a wide spectrum — ranging from milliseconds for simple lookups to several minutes for complex joins and large aggregations. To handle this variability, especially for analytical workloads, we employ a Polling-Based Asynchronous Query Model, where results are fetched once ready, ensuring system responsiveness even for long-running queries.
That said, cache response times are expected to be sub-second. While actual query execution can take time, the cached result delivery should meet 99th percentile latencies under 1 second, and 100th percentile under 10 seconds. The freshness of the cached data is governed by specific business use cases — it varies from feature to feature, and dashboard to dashboard. How we manage cache refresh and invalidation tailored to these differing needs is a topic for another time (and possibly another blog).
Query Orchestration and Result Retrieval
Challenge: Efficient Result Storage for Polling Queries
In our polling model, clients initiate a query, receive a query ID, and periodically poll until the result is ready. This model requires us to:
- Store query results temporarily.
- Ensure fast read performance for quick retrieval.
- Avoid redundant computation by caching results and serve historical queries.
Initial Solution: Redis for Temporary Caching
We started with Redis as the caching layer. As an in-memory key-value store, Redis is excellent for low-latency data access. Initially, Redis provided sub-millisecond reads and writes, meeting our needs.
However, as usage scaled, we faced significant issues:
Performance Bottlenecks
- Large Result Sets: Analytical queries returned MBs to GBs of data.
- Redis is single-threaded; when a large payload is written or read, it blocks other operations.
- Result: High latency spikes during peak hours, affecting all Redis consumers, not just query caching.
Cluster Complexity
- Redis supports clustering for horizontal scaling, but:
— Hash slots can still lead to hotspot nodes.
— Large query results landing on a specific node affected other keys on the same node. - Memory consumption increased due to the in-memory nature of Redis, with more pressure on the slaves to replicate the data.
Could It Have Been Avoided?
A separate Redis cluster dedicated to query results could have isolated this workload and reduced interference. However:
- This would increase operational and infrastructure costs.
- It would still not address core issues like hotspot nodes, single-threaded processing, and lack of reliability when handling large, concurrent payloads.
Despite extensive tuning, Redis couldn’t keep up with the increasing volume and payload size of query results. The need for a more scalable, reliable, and OLAP-friendly caching layer became evident.
Re-Evaluating Caching Requirements
We redefined our needs:
- Parallel Request Handling: Distributed architecture, avoiding single-threaded bottlenecks.
- Efficient Large Object Handling: Store and serve large blobs (query results) with high throughput.
- In-Memory or Tiered Storage: For low latency, but not constrained by total RAM size.
- Flexible Expiry (TTL): Auto-expiration of cached results post-polling window.
- Scalable and Observable: Operable at scale with monitoring and minimal DevOps burden.
Why Not Traditional Databases?
RDBMS: Too Rigid for Dynamic Analytical Workloads
Traditional relational databases were quickly ruled out due to schema rigidity and operational overhead. Our query results are stored as JSON objects, and modeling them in a relational format would have required dynamically creating schemas per query, leading to millions of tables per day — a DevOps nightmare in terms of management, maintenance, and scalability.
Additionally, relational databases are optimized for transactional (OLTP) workloads, not for high-throughput, analytical (OLAP) access patterns involving large, semi-structured datasets. The mismatch in data structure, scale, and access patterns made RDBMS an impractical choice for our use case.
MongoDB: Close, But Not Quite
MongoDB initially appeared to be a natural fit for caching query results due to its native support for JSON document storage. However, upon deeper evaluation, several limitations made it unsuitable for our specific needs:
Disk-Based Latency
- MongoDB’s disk-based architecture introduces latencies that are highly dependent on disk I/O performance.
- For our low-latency requirements, this variability was unacceptable, especially when compared to in-memory or memory-first systems.
OLAP Workloads: A Poor Fit
Moreover, the nature of query result retrieval in our system goes far beyond simple “fetch-and-show” use cases. Many dashboards and downstream systems perform complex slice-and-dice operations, aggregations, and analytical queries on these result sets — essentially treating them as mini ETL pipelines.
Traditional key-value stores, including MongoDB, are not optimized for OLAP-style workloads and struggle to meet the performance and flexibility demands of this pattern.
Lack of Integration with Analytical Engines
- During this phase, we were also exploring storage backends compatible with Trino and DuckDB to further optimize analytical workflows.
- MongoDB lacks native integration with DuckDB, effectively disqualifying it from our future-facing architecture that aimed for flexibility and performance.
Coupled Storage and Compute: A Scalability Bottleneck
- In MongoDB (as with traditional RDBMS), storage and compute are tightly coupled, meaning:
— Scaling compute to handle query load also requires scaling storage, even if not needed.
— This makes cost-effective scaling difficult and introduces operational inefficiencies. - Our aspiration was always to separate storage and compute, allowing:
— Independent scaling based on real demand.
— The freedom to switch compute engines as needed (e.g., Trino for large workloads, DuckDB for small/medium workloads), without being locked into a monolithic stack.
Conclusion:
While MongoDB checked some boxes with JSON storage and query capabilities, its disk-based latency, lack of DuckDB compatibility, OLAP limitations, and coupled architecture made it a poor fit for our high-performance, scalable caching needs.
HDFS / Object Storage (S3): Durable, But Not Designed for Speed
HDFS and S3 were obvious candidates from a durability and scalability standpoint. However, latency and cost concerns quickly surfaced, especially for use cases involving frequent reads of large query result sets. Additionally, these storage layers lack native caching capabilities, meaning any caching behavior would have required custom engineering effort, adding operational complexity.
Enter Alluxio: Designed for Distributed Data Caching
Alluxio stood out — a high-performance, distributed caching system with:
- Tiered Storage: Memory (RAM disk), SSD, HDD — configurable per node.
- S3 API Compatibility: Allows apps expecting S3 (Trino, DuckDB, internal APIs) to access Alluxio transparently.
- Parallel Read/Write: Fully distributed, avoids Redis-like bottlenecks.
- Built-in TTL and Eviction Policies: Handles cache lifecycles without external intervention.
- Minimal DevOps Overhead: Simple cluster management, horizontal scaling.
We realized that Alluxio could do more than just query result caching — it could evolve into a unified caching layer for analytics workloads.
Phase One: Breaking the Mould — Replacing Redis with Alluxio
We deployed Alluxio in RAM disk mode — each node cached query results locally, without UFS (underlying file system) persistence.
Advantages
- In-memory speed for both writes (query results) and reads (client polling).
- No replication overhead; results live only on the node that processed the query.
- Fast failure recovery: If an Alluxio node or its RAM disk is reset, we re-run the query (rare, acceptable tradeoff).
- TTL support: Cached results expire based on query polling window and data staleness requirements of dashboards (hours to days).
Results
- Latency reduced across the board; no blocking due to large payloads.
- Redis cluster stabilized: No more CPU/memory spikes from query result loads.
- Improved developer agility: Simple Alluxio APIs (S3-compatible) made integration easy.
Alluxio’s support for Time-To-Live (TTL) on files/folders allowed us to manage cache lifecycles effortlessly, mirroring Redis’ eviction capabilities. Integration was simple — just an additional implementation for our caching interface.
Benchmark Report: Alluxio vs Redis Query Response Time Comparison (300 Requests/sec)
We conducted a concurrency benchmark to compare Alluxio and Redis in terms of query result transfer times. Given that Redis was already struggling with larger result sets, this test focused on smaller result sets to purely measure whether Alluxio could match Redis in performance under load. The results showed Alluxio performing on par with Redis, even at 300 requests per second.
In a separate test with larger result sets, Redis began to falter — introducing latency spikes and stability issues — while Alluxio handled the load smoothly, showcasing its robustness and scalability for heavier analytical workloads.
Key Insights:
- Performance at Par: Alluxio delivers comparable response times to Redis under heavy concurrency (300 requests/sec), proving its efficiency even in high-pressure scenarios.
- Scalability Advantage: As bytes transferred increased, Redis began to show latency spikes and caused performance degradation in other components due to resource contention.
- Alluxio’s Edge: Alluxio’s memory-first architecture ensured consistent performance even with larger payloads, avoiding the issues Redis faced at scale.
Phase Two: Beyond Caching — Powering Analytics with Alluxio
Given Alluxio’s seamless fit in our architecture, we naturally looked to expand its role. One of the key areas was Detection & Response, a flagship Uptycs feature that analyzes telemetry data in real-time to detect threats and trigger alerts. Traditionally, we relied on Trino to run the complex SQL queries powering this feature.
Similar workloads existed across the platform — high volumes of concurrent queries targeting small to medium datasets (ranging from a few MBs to a few hundred GBs). However, we encountered significant challenges:
- Trino’s compute was overkill for such workloads. Its distributed architecture, while great for large-scale analytics, was computationally expensive and inefficient for smaller queries.
- High compute usage led to resource contention. As these queries scaled, we saw queuing, increased latency, and SLA breaches in production.
- To mitigate this, we were forced to provision and manage a dedicated Trino cluster solely for these workloads, ensuring they did not interfere with critical analytical queries from dashboards and ad hoc investigations.
- Similar workloads — large number of concurrent queries querying small to medium (few MBs to few 100Gbs) data
These challenges prompted us to rethink our compute strategy. With Alluxio enabling faster, localized access to query results, we began exploring DuckDB as a lightweight, high-performance engine for these smaller analytical workloads, further optimizing compute efficiency and cost.
How Alluxio Helped
- Alluxio’s S3 proxy exposed cached data to DuckDB, enabling local data access.
- Query results and raw data fetched from Alluxio were processed without touching S3 or Trino.
Impact
- DuckDB adoption increased, offloading the alerting workloads from Trino.
- Faster time-to-insight for detection queries, reduced cost per query.
Benchmark Report: Trino vs DuckDB for Small to Medium Analytical Workloads
We ran a benchmark comparing Trino and DuckDB for small to medium analytical workloads — the kind of queries that typically drive alerts and detections within our platform. This test reinforced our decision to route smaller workloads through DuckDB, reserving Trino’s distributed compute power for larger analytical workloads where it truly shines.
----------------------------|-----------------------|--------------------
| Metric | DuckDB | Trino |
|----------------------------|-----------------------|--------------------|
| 95th Percentile Query Time | 1 second | 1 second |
|----------------------------|-----------------------|--------------------|
| Relative CPU Usage | Baseline (1X) | 10X Higher |
|----------------------------|-----------------------|--------------------|
| Resource Efficiency | Highly efficient | Resource-intensive |
|----------------------------|-----------------------|--------------------|
| Execution Environment | Single-node, embedded | Distributed cluster|
-------------------------------------------------------------------------
Key Insights:
- Both DuckDB and Trino delivered similar query latencies for 95% of requests — completing in ~1 second.
- However, DuckDB achieved this performance at a fraction of the resource cost, consuming 10X less CPU compared to Trino.
- Trino’s distributed architecture introduced unnecessary overhead for small to medium workloads, resulting in inefficient CPU utilization.
- DuckDB’s lightweight, in-process execution model proved ideal for workloads where low-latency + low-resource usage is critical.
Phase Three: Fronting S3 — Supercharging Trino with Alluxio
Encouraged by Alluxio’s performance, we took a bigger leap — cache S3 data using Alluxio, making Trino queries faster and cheaper.
Why?
- 300TB+ data scanned daily from S3/HDFS by Trino (~200k queries/day).
- S3 read costs add up quickly; latency fluctuates with network and S3 load.
Strategy
- Deploy Alluxio as a caching layer for S3, serving Trino queries.
- Leverage Alluxio’s hot data caching: frequently accessed data stays in RAM/SSD, reducing cold S3 fetches.
- Enable write-through caching for data generated by ETL jobs.
Expected Gains
- Reduced query latency due to local reads.
- Lowered S3 costs with fewer redundant reads.
- Improved SLA adherence for customer-facing queries.
We’re currently rolling out this solution and plan to share performance benchmarks soon.
Phase Four: Powering AI-Powered Analytics with Alluxio
Why ?
As previously mentioned, we have thousands of tables in our data lake storing telemetry data across dozens of attack surfaces — ranging from on-prem environments to cloud platforms, from bare metal servers to Kubernetes clusters, and from developer machines to build systems, and beyond. With such a vast and diverse dataset, data discovery within this schema has always been a challenge for our users.
We’ve tried to address this in multiple ways: from pre-baked dashboards to schema discovery via natural language interfaces. While these solutions offered some improvements, nothing truly matches the flexibility and power of querying data as if you were chatting with a security expert equipped with ninja-level SQL skills.
Enter: GenAI + DuckDB + Alluxio — A Game-Changing Combo
This is where our latest architecture begins to change the game. We’ve integrated Generative AI-powered natural language interfaces with DuckDB for query execution and Alluxio for high-speed, distributed caching of results.
Now, after generating your complex analytical or data retrieval query using our GenAI-powered Text-to-SQL engine (more on this in an upcoming blog post!), storing and accessing the result sets is no longer a bottleneck. Alluxio efficiently handles the caching of these large result sets, and DuckDB processes them at lightning speed.
Here’s the beauty of it:
Once the data is retrieved and cached, you can interact with it using simple natural language prompts. All the complexities of the raw schema and the noise from massive datasets are already resolved in this structured result set, making it incredibly easy for GenAI to generate new, meaningful SQL queries on the fly. The speed of DuckDB combined with Alluxio’s in-memory caching enables a seamless, interactive experience — you’re now analyzing data at the speed of thought.
Lessons Learned
- Single-threaded systems (Redis) don’t scale for large object caching.
- Tiered, distributed storage (Alluxio) offers flexibility for diverse caching needs.
- Caching query results is just one use case — Alluxio can be a central data access layer.
- Simple API compatibility (S3) accelerates adoption across engines like Trino, DuckDB, and custom services.
Final Thoughts
Our journey — from Redis, through Alluxio, to caching S3 for Trino — has transformed how we think about data caching, analytics, and cost optimization. Alluxio empowered us to scale query performance, reduce infrastructure strain, and unlock new possibilities in how we process and serve telemetry data.
If you’re navigating similar challenges — caching large datasets, improving analytics performance, or cutting cloud costs — let’s connect! I’d love to hear your journey and share notes.
Wait …… there is Phase 5 !!!
The super beast mode, the Super Saiyan god phase ………
where GenAI can query the entire data lake through phase 4 transformations, choosing the query engine — Trino or DuckDB as it sees fit … stay tuned !!