Sitemap

Evaluating Data Warehouses for Low-latency Analytics

18 min readJul 5, 2024

Choosing the right cloud data warehouse can be complex, with many available options. I recently led our team at Agritask through an extensive proof of concept (PoC) comparing major cloud data warehouses. We aimed to understand which platform could best handle our unique requirements of running complex queries over billions of records with high concurrency and low latency at the best price for performance. This process involved thorough testing across multiple systems to identify strengths, weaknesses, and performance benchmarks. Given the depth and breadth of our findings, I’m eager to share these insights to help fellow data engineers navigate the complexities of selecting the right data warehouse for their needs.

I’ll cover:

  1. The use-case we aimed to solve
  2. Technical requirements
  3. Evaluating data warehouses: Snowflake, Timescale, Amazon Redshift, ClickHouse, Google Cloud BigQuery, Apache Druid, Apache Cassandra, Firebolt
  4. PoC results
  5. Vendor comparison breakdown
  6. Challenges and learnings
  7. Advice for fellow engineers

The use-case we aimed to solve

Agritask developed a leading SaaS platform that provides visibility into the predictability and sustainability of crop supply for food and beverage enterprises. Leveraging augmented agronomic data, machine learning, and analytics, Agritask equips enterprises with invaluable insights and cutting-edge tools to optimize supply operations across globally distributed smallholder and corporate farming ecosystems. Our platform offers features such as crop management, cost profiling, yield intelligence, sustainable compliance, and carbon accounting. By directly connecting enterprises to growers and suppliers at the source of their value chain, Agritask plays a crucial role in enhancing company sustainability and supply chain resilience.

We deal with a high volume of semi-structured data, including weather, crop status, soil conditions, pest/disease infestation, field activities, satellite imagery, sensors, and machinery, all essential for our agronomic analysis and risk assessment. The challenge lies in the diverse nature and sheer volume of this data — our system processes tens of millions of records daily, each representing a unique piece of information crucial for decision-making, then aggregates them for geospatial analyses, and presents accumulated months and years of such data in country-scale maps to multiple concurrent users with sub-second response times. Moreover, the searches performed in our database need to accommodate the flexible and diverse criteria set by different customers, adding another layer of complexity to our data management needs.

The current architecture, combining tens of horizontally partitioned PostgreSQL databases and a central Snowflake data warehouse, has worked well for over ten years, and allows good horizontal scalability to serve a practically unlimited amount of customers. However, for a single customer, the horizontal scalability options are limited, and scaling vertically is challenging. As a specific PostgreSQL database instance grows over 1 billion records, it becomes more and more expensive to scale it vertically. We want to move heavy lifting from PostgreSQL to the cloud, and still be able to serve our interactive data applications. The use case requires handling huge data volumes with low latency and high concurrency. This is where both PostgreSQL and Snowflake are not quite suitable (see more details and explanations below), so we started looking for alternatives or complementary solutions.

We considered different options, like replacing PostgreSQL, replacing Snowflake, or adding a cache on top of Snowflake. One of the solutions we evaluated can replace PostgreSQL, others can serve as a cache, and some can potentially replace Snowflake, so we effectively researched all possible options. We ended up with the cache on top of Snowflake with a future option to replace Snowflake completely. But we still kept PostgreSQL, to avoid complete refactoring of the whole system, just moving the heavy lifting from PostgreSQL to the cloud.

Technical requirements

  • Table layout and SQL compatibility with PostgreSQL: To minimize refactoring of the existing backend components, the plan is to keep the table structure as close to the original PostgreSQL structure as possible. The same SQL queries (with minimal automatic modifications) are supposed to run transparently against a “local” PostgreSQL data storage and a “remote” data warehouse storage.

This is a pretty hard requirement that eliminated some otherwise perfectly valid options such as Apache Cassandra and affected many of our architectural decisions. Getting deeper into this requires a separate article, as we have to consider the existing codebase, backward compatibility, and customer requirements affecting the backend application logic. In an imaginary scenario of designing a completely new system from scratch, the results of evaluation might differ.

  • Handling unstructured data: The data varies significantly in structure, from simple numeric values to JSON structures. A sensor’s metric can be reported as a simple numeric value, like 17 °C. A list of scanned containers can be reported as an array of IDs. A set of fields of a newly created location object is stored as a JSON structure. Records reported from the field can be associated with materials, machinery, and references to images or documents in the file storage.
  • Dynamic and complex queries: The need for flexible and unpredictable querying due to the diverse nature of data. This requirement also affected the table structure and SQL compatibility requirement.
  • Data volumes: Currently there are 15–20 horizontally scaled PostgreSQL servers with Snowflake as the data warehouse on top of them, managing over 2 billion records. The plan is to scale to over 1 trillion records.
  • High-concurrency: The system needs to support 50–100 concurrent queries per second initially, scaling up significantly to meet future demands without compromising performance.
  • Low-latency: For the data warehouse to be viable, it needs to support sub-second response times for operational queries, as the main goal is to power interactive data apps on top of the data warehouse.
  • Costs: We have a strict annual budget which cannot be exceeded. As part of our testing, we evaluated platforms based on their total cost of ownership.
  • Cloud-hosted solution: We were looking for a managed solution that reduces the maintenance load on our data engineers and DevOps.

When planning the evaluation, we gave an additional bonus weight to solutions that have free self-hosted versions, to be possibly used for test and development environments, but none of those passed the other criteria. The cloud-managed variant was a must for production.

Evaluation process

Our evaluation included several popular and emerging cloud data warehouses, some selected by us and others recommended by consultants. As we could not test all possible data warehouses in the world, we focused on those that we thought could work well for our use case. There is a certain bias in our selection as some of the potential candidates (for example, Oracle and MS SQL Server) were not considered by our team at all.

The evaluation process for most platforms consisted of the following steps:

  1. Uploading 100–200M records and running a test suite of SQL queries. This step allowed for the evaluation of ease of use and SQL compatibility.
    For tests to be representative of our use case, we extracted sample SQL queries from our production logs. Each of these example queries included a variety of joins, filtering conditions, and aggregate operations.
  2. Uploading about 1B records and running the same test suite of SQL queries. This step indicated the potential scalability of the data warehouse with our data and SQL queries. Also, this step indirectly evaluated the ingestion performance.
  3. Running 1–50 concurrent queries from a Java client with a selected SQL query. This step was the last in the evaluation process, and it indicated if the data warehouse could serve our interactive applications.
    Note that not all data warehouses got to this step, so some of those we dropped for other reasons may potentially support very good concurrency.
    Why only 50? Most of the tested data warehouses quickly degraded in performance and few got to even 50 concurrent queries (considering the costs/capacity limitations), so we stopped at 50 and just estimated the costs of running a cluster that proportionally scales up to our required concurrency.

We ended up choosing Firebolt which best met our requirements for ease of use, low latency, and cost-efficiency.

Evaluating cloud data warehouses

PostgreSQL

A word has to be spoken about PostgreSQL — it is a great open-source database. It is highly performant for very complex SQL queries on a single server, often outperforming many cloud databases and capable of handling billions of data records (with a good data model). It is feature-rich, flexible, stable, and battle-proven. It supports structured, unstructured, and geospatial data with ease.

PostgreSQL gets limited as soon as the database grows over what a single server can handle. Also, it has well-known limitations imposed by its design that generations of DBAs learned how to cope with (updates, backups, locks, etc.).

Snowflake

Snowflake was the very first option we considered because we already use it for BI and our data team loves it. Unfortunately, it was quickly filtered out because of latency and concurrency limitations. When talking with consultants, we heard about customers with use cases similar to ours who started using for example Apache Cassandra as a cache on top of Snowflake for the same reason.

Pros:

  • Ease of use: Snowflake is well-liked for its user-friendly interface and is compatible with SQL queries used by our current backend.
  • Easy to manage: As a cloud-native solution, Snowflake reduces the need for extensive database management and maintenance, offering a scalable architecture without the need for physical infrastructure management.
  • Special features: Snowflake supports geospatial data structures and processing, which is important in our use case.
  • Easy migration: Snowflake is already in place with all necessary ETL infrastructure.

Cons:

  • High query latency for UI queries: Even with the most expensive engines that exceeded our budget, the query latency cannot be reduced to sub-seconds, which is not suitable for real-time operational demands.
  • Concurrency: Snowflake is not designed for highly concurrent queries, so clustering is required, and that significantly increases costs.
  • Cost efficiency: The current architecture allows running Snowflake on demand, thus significantly reducing costs. To serve interactive data apps, it will have to run 24/7, and that again increases costs.

TimescaleDB

This was the second ideal candidate because of the ease of migration — it is the same as PostgreSQL, with added steroids.

TimescaleDB uses so-called “hyper tables” to enhance scalability. Hyper tables allow for effective data partitioning and query scaling, which is pivotal for managing large datasets dynamically. It is important to explain that there are two different formats of hyper table partitions. The default format is the regular PostgreSQL row-based format — it is very similar to standard PostgreSQL table partitioning. The second format is a special columnar format that uses an order of magnitude less disk space but has very different performance characteristics, as well as implies data modification restrictions.

Pros:

  • 100% PostgreSQL compatible — important in our use case because it allowed for the simplest migration and very little refactoring required in our backend.
  • Performance on most queries: TimescaleDB demonstrated strong performance for a variety of queries with time-series data converted to row-based hyper tables.
  • Cold start performance: TimescaleDB shows excellent performance of first-time query executions with hyper tables. It is a well-known issue in PostgreSQL and many other databases that queries take a long time to execute until the disk cache gets populated, and TimescaleDB solves it magically.
  • There are free self-hosted and cloud-managed variants.

Cons:

  • Disk space consumption and cost efficiency: The row-based hyper tables with indexes consume slightly larger disk space than the original PostgreSQL tables, and the estimated required disk space storage for 1B records is larger than even the maximum (over-the-budget) cloud-managed plan, which became a blocker right away.
  • Mixed data format performance issues: When part of a hyper table is converted to the columnar format for historical data storage, it saves a huge amount of disk space, but it can lead to unpredictable query performance. Queries might run fast on fresh data but slow on historical data or vice versa, making performance tuning challenging.
  • Mixed data format modification restrictions: With the columnar format, it is also difficult to manage historical data modifications which are sometimes required by the application logic.
  • Scalability: Like PostgreSQL, the database is limited to a single server, and can only scale by adding read replicas to add read queries concurrency. Therefore scalability is limited as far as the single master vertical server scaling allows.

Amazon Redshift

Amazon Redshift felt like the most reasonable option at AWS, especially when combined with PostgreSQL hosted at Amazon RDS.

Pros:

  • Serverless architecture: Redshift’s serverless solution alleviates the need for ongoing server management. There is also a provisioned option but it was not evaluated since we decided to prioritize uninterrupted scaling up and down.
  • PostgreSQL compatibility: It is quite compatible with PostgreSQL in data types and query syntax, except for certain corner cases such as the problem with storing more than 64 KB of text that PostgreSQL allows in its TEXT fields, or the problem with unsupported NaN numeric values.
  • Good data volume scalability: Redshift easily handles huge data volumes and has a good bulk ingestion performance.
  • Easy integration with PostgreSQL hosted at Amazon RDS — automatic ETL and federated queries.

Cons:

  • Mixed query performance results: Redshift performance was better than that of PostgreSQL in many (not all) queries, but not as good as in the case of the other evaluated solutions. It felt like a good BI heavy-lifting solution but less suitable for low-latency queries serving interactive data applications.
  • Latency issues: Redshift Serverless is affected by the cold start problem, much as PostgreSQL, and it often exhibited higher than acceptable latencies of the first query execution. Due to the elasticity of Redshift Serverless, it may become especially problematic if specific queries stop running for some time, and the database has to spin up again.
  • Concurrency limitations: Redshift struggles to handle high concurrency levels effectively. The system quickly reached its limit with fewer than 10 concurrent queries. There are configuration parameters that might increase concurrency, but they increase costs proportionally and the maximum permitted values are still not good enough to reach the planned concurrency.
  • Cost and scalability concerns: Redshift Serverless can be very cost-efficient for the occasional processing of huge volumes of data, but in our case, based on the costs of already executed queries, the projected costs in a fully scaled scenario were higher than the planned budget.

Clickhouse

Clickhouse was recommended to us as an extremely scalable in-memory database that is proven to handle petabytes of data in production and serve highly concurrent low-latency queries. It has both cloud-managed and free self-hosted variants, which was also attractive.

Pros

  • Data ingestion speeds: During our tests, we were able to upload hundreds of millions of records in a matter of minutes, which is impressively fast and beneficial for high-throughput environments.
  • High performance on specific queries: In scenarios where Clickhouse could leverage its strengths, such as simple analytical queries without joins, it demonstrated very high performance, handling large volumes of data efficiently.
  • Disk space consumption: Clickhouse uses way less disk space compared to PostgreSQL — easily two orders of magnitude less. Let’s say that the same applies to all columnar databases.

Cons:

  • Challenges with complex queries: Despite its strengths in data ingestion and simple queries, Clickhouse struggled significantly with more complex queries, especially those involving joins. On the same (relatively small) data volumes, PostgreSQL executed the same queries 10+ times faster than Clickhouse.
  • Query execution reliability: We observed that Clickhouse could not consistently handle complex SQL queries that Postgres could execute in milliseconds. In some instances, queries that involved hierarchical data structures or multiple joins would not complete, hanging indefinitely.

At the last AWS conference, I talked with Clickhouse representatives and they said they are working to improve their query planner and JOINs.

Because of the blocking issue with complex SQL queries, we did not continue with the concurrency tests.

Google Cloud BigQuery

Even though we were less interested in using a non-AWS-hosted data warehouse, we decided to evaluate Google Cloud BigQuery because we heard good things about its performance.

Pros:

  • Strength in batch processing and BI: BigQuery proved to be highly effective for large-scale batch processing tasks and business intelligence (BI) applications.
  • Data ingestion: One of the standout features of BigQuery is its data ingestion performance. During our tests, we were able to load large volumes of data swiftly and reliably, which is crucial for maintaining up-to-date analytics.
  • Serverless architecture: As a fully managed serverless solution, BigQuery eliminates much of the overhead associated with traditional data warehousing solutions.

Cons:

  • Data types compatibility: Column data formats’ auto-discovery makes it very complicated (sometimes practically impossible) to ingest data from unstructured data formats like CSV. Switching to the Parquet format helps a lot.
  • Cost concerns: BigQuery can become quite costly, especially as query loads increase. During our PoC, we found that operational costs could quickly escalate, making it a less viable option for keeping within our budget constraints.
  • Latency issues: Being an analytical database engine, BigQuery offers good performance for heavy-lifting queries processing huge volumes of data, but it is less suitable for low-latency queries serving interactive data apps.
  • Integration with existing infrastructure: Our existing infrastructure is primarily AWS-centric, and integrating BigQuery, a Google Cloud product, introduced additional complexity and potential latency due to data having to move across cloud platforms.

Apache Druid (Imply)

Initially, we did not plan to evaluate Apache Druid because its self-hosted variant is known to require significant setup and maintenance efforts. But once we learned about its cloud-managed variant marketed by Imply, we decided to give it a try. Eventually, it took a solid second place in our evaluations.

Pros:

  • Good performance with time-series data: Apache Druid performs very well with time-series data, making it suitable for our use cases that involve chronological data analysis. Although the first tests of complex SQL queries resulted in bad timings, by optimizing data formats, indexes, and queries together with their technical support team, eventually we were able to get very good performance results.
  • Good concurrency: It had shown very good concurrency results in our tests, outperforming most of the other solutions (note that some of the listed solutions like Clickhouse did not get to the concurrency tests stage, so we are not comparing with those here).
  • Self-hosted variant: while our target was the cloud-managed solution, there is this option as well.
  • Responsive support: The support from the Imply team was responsive and effective, helping us navigate technical challenges efficiently.
  • Good SQL compatibility: while there were some peculiarities with creating tables and defining column data types, the actual SQL query syntax was sufficiently compatible with PostgreSQL.
  • Advanced companion tools: The cloud-managed solution includes an integrated data visualization dashboard tool (similar to Tableau) which is a great selling point.

Cons:

  • Schema and indexing limitations: The database schema and the way tables are indexed in Apache Druid differ from more traditional SQL databases. This difference can lead to compatibility issues with existing data models and may require adjustments to data ingestion and querying processes.
  • Mixed query performance results: while it is possible to get very good performance results from Apache Druid, in our case it often required fine-tuning and consulting with the support team.

The differences from “regular” RDMSs our team is used to working with, such as Snowflake and PostgreSQL, and the amount of work required to fine-tune data structures and queries was perceived as one of the main issues by our team. Other than that, the performance and scalability of Apache Druid were quite impressive.

Apache Cassandra

Apache Cassandra is The Database for time-series data. So it was one of the first obvious candidates we considered. After a series of consultations with the support team behind the Cassandra cloud-managed solution, we concluded that employing Apache Cassandra would require a complete redesign of our database schema and queries, which we wanted to avoid. Therefore we did not start evaluating Apache Cassandra.

While checking on Apache Cassandra, we noticed ScyllaDB which is “Cassandra on steroids” and it looks very promising performance and resource-wise, though still not suitable to our use case.

Firebolt

Firebolt was not on the list of solutions we originally planned to evaluate, but at some point, it was recommended by different people independently. Even though it was quite late in the evaluation process, we decided to add it to the list and were pleasantly surprised by the evaluation results.

Firebolt became the leading solution for our use case, balancing performance, scalability, compatibility, costs, and ease of use.

Considering similarities between Firebolt and Snowflake in features, SQL compatibility, and operation, we may even consider replacing Snowflake with Firebot in the future completely (though Firebolt still lacks geospatial capabilities that Snowflake does support).

Pros:

  • Performance with large datasets: Firebolt demonstrated commendable performance when handling large volumes of data, handling many of our heavier queries in tens of milliseconds. More importantly, it delivered high performance without the need to adapt our data structures and minimal or no SQL query modifications.
  • Scalability and flexibility: One of Firebolt’s strengths is its scalability, which allows it to efficiently manage increased data loads without a significant drop in performance. In one of the tests, increasing the data set by 10 times increased the query execution times by 25–30% on average on the same hardware, which is very impressive.
  • Cost-effective at scale: Compared to the other solutions we tested, Firebolt presented a more cost-effective option when scaling up. Firebolt’s architecture allows for elastic engine resizing up and down without service interruption, thus optimizing costs and efficiently handling load peaks.
  • Good concurrency: Firebolt supports a higher level of concurrency than many other databases we evaluated, which is essential for our operational model that requires serving multiple simultaneous queries from the interactive data apps.
  • Responsive support: the Firebolt support team was very helpful and responsive during the onboarding, test data ingestion, and query performance analysis. At some point, we discovered a minor PostgreSQL SQL syntax incompatibility that we were able to adapt to on our side, and they promptly fixed that on their side.

Cons:

  • Streaming not supported: Firebolt performs well with large datasets, however, streaming is not supported currently. Micro-batch ingestion was used to address the need for near real-time data.
  • AWS only: Firebolt runs only on AWS, which wasn’t a problem for us since our existing infrastructure is primarily AWS-centric.

The data stack we went for

The diagram is simplified for the sake of illustrating the article and the critical database components. We leave outside Kafka, geo-spatial processing, ETL complications, aggregation processing, etc.

Here, Firebolt serves as a fast cache between the application servers and the data warehouse — it allows querying data for visualizations with low latency and high concurrency. And, thanks to its general-purpose design and heavy-lifting capabilities, it can potentially replace Snowflake in the future, thus simplifying our architecture.

PoC results

Each vendor was benchmarked against our existing database, PostgreSQL. For tests to be representative, we extracted sample SQL queries from our production logs. Each of these example queries included a variety of joins, filtering conditions, and aggregate operations.

We also created two datasets for testing: one dataset with 150 million total records, and another dataset with 966 million total records. The following table shows some of the results from benchmarking Firebolt’s performance versus our existing platform.

Firebolt PoC results vs our existing platform, PostgreSQL, for 150 million records:

Firebolt PoC results for 966 million records:

Vendor comparison breakdown

The table below is subjective and represents our team’s experience in our specific use case. In other use cases, the ratings may be distributed differently.

Query performance and latency ratings should also not be taken as a general indication of specific solution performance. For example, the bad query performance of Clickhouse in our case does not mean that Clickhouse is slow — it is super-fast when used correctly. It only means that it is unsuitable for the type of queries we attempted to run with it — many JOINs and multiple filter conditions.

Query performance is the ability of the database to execute heavy-lifting queries (those we selected) around 100 msec or faster, preferably in tens of msec.

The latency is the ability to quickly respond to very fast queries (of course considering the network time). For instance, Snowflake is very good at heavy lifting, but its latency renders it unsuitable for real-time querying.

The same goes, for example, for “SQL compatibility” — only TimescaleDB deserves 5 stars here because it is practically the same PostgreSQL engine. All other solutions can get very close but never 100%. We are not talking about an objective SQL:20xx standard here. It is a subjective measure of the amount of problems we hit while trying to run our queries against a certain database.

There were problems with using non-constant values on the right side of LIKE, with “$” in the column names, with hierarchical queries, with column data types, with NULL values, etc.

Challenges and learnings

The primary takeaway from our PoC was the importance of aligning warehouse capabilities with our specific data structures and query requirements. Each platform has its strengths, but also limitations that are sometimes only apparent under real-world load conditions. Cost, performance, and scalability have to be carefully balanced, with a clear understanding that the perfect solution might not exist and compromises might be necessary.

As we move forward, Firebolt remains our leading option, thanks to its ability to analyze large volumes of data with low latency and a friendly SQL interface.

Advice for fellow engineers

  • When running your PoC, document every finding, no matter how small. It helps in understanding not just the capabilities but also the limitations of each technology.
  • Define clear requirements, performance characteristics, and ways to test them before engaging in a PoC.
  • Engage with vendor support teams; their responsiveness and expertise can be invaluable — they can help with your initial data ingestion/onboarding and help resolve your query performance issues.
  • Finally, keep an open mind — we ended up choosing both Firebolt and Snowflake for different use cases. Sometimes, the best solution is a combination of technologies tailored to your specific requirements.

--

--

Andrew Schetinin
Andrew Schetinin

Written by Andrew Schetinin

Tech enthusiast, passionate about software development and people using technologies to make the world a better place.

Responses (1)