Comparing Trino, ClickHouse, and Apache Doris: Architectures, Use Cases, and Performance
Datalakes and Data Platforms are going through another cycle of change and evolution. In recent years I have been implementing solutions using Trino and ClickHouse(other than Hive) — but in recent months Apache Doris is showing promise for building ‘tighter’ solutions.
This article explores each of these systems, comparing their architectures, use cases, and performance characteristics to help guide decision-makers in selecting the optimal solution for their needs. We also take a look at Apache Iceberg support since it is fast becoming a key feature for Datalakes.
Trino: Distributed SQL Query Engine
Formerly known as PrestoSQL, Trino is a distributed SQL query engine designed for running interactive analytic queries against data sources of all sizes, ranging from gigabytes to petabytes. Trino’s architecture allows it to query data where it lives, including heterogeneous sources, without requiring data movement. This makes it ideal for organisations that need to analyse data spread across multiple databases and storage systems.
Key Features:
- Federated Queries: Ability to execute queries across various data sources, including NoSQL and RDBMS.
- In-Memory Processing: Enhances query performance by processing data in-memory.
- Scalable and Flexible: Scales horizontally by adding more nodes to the cluster.
Use Cases:
- Real-time analytics across multiple data sources.
- Situations where data needs to remain in its original data store.
- Large-scale data environments where query flexibility is crucial.
Disadvantages:
- Resource Intensive: Trino can be quite resource-intensive, especially in terms of memory usage. Because it holds data in memory during query execution, large queries can require significant amounts of RAM, which might not be ideal for smaller setups or environments with limited resources.
- No Data Storage: Trino acts purely as a query engine and does not manage data storage. This requires an additional layer of data management, as you need to maintain other systems for actual data storage, which can increase complexity and overhead.
- Not Optimised for Updates: Trino is not designed for high-frequency small updates or transactions. It is primarily optimized for read-heavy workloads and batch insertions, making it less suitable for applications that require real-time data updates or transactional processing.
ClickHouse: Columnar Database Management System
ClickHouse is an open-source, column-oriented DBMS primarily designed for online analytical processing (OLAP) queries. With its robust data compression and columnar storage, ClickHouse allows for incredibly fast query execution, making it suitable for real-time query processing on large datasets.
Key Features:
- Columnar Storage: Optimises IO operations for typical OLAP queries.
- Real-Time Query Execution: High performance for both read and write operations.
- Vectorised Query Execution: Enhances performance by executing database operations in batches.
Use Cases:
- High-speed data ingestion and real-time analytics.
- Reporting and business intelligence.
- Large datasets where rapid data retrieval is required.
Disadvantages:
- Limited Transaction Support: ClickHouse does not support full ACID (Atomicity, Consistency, Isolation, Durability) transactions, which can be a critical drawback for applications that require strong consistency and transactional integrity.
- Write Performance: While ClickHouse offers exceptional read performance, its write performance can be relatively slower, especially in scenarios involving frequent updates or deletes. This makes it less ideal for use cases where data is constantly being updated.
- Complexity in Horizontal Scaling: Scaling out ClickHouse involves sharding and replication, which can add complexity to the deployment and management of the database, especially compared to more automatically scalable systems.
Apache Doris: MPP Database Engine
Apache Doris is an MPP (Massively Parallel Processing) database optimised for real-time analytics and serving. It is designed to handle large-scale data analysis “from gigabytes to petabytes.” Doris stands out for its high performance, real-time response, and simplicity of use, particularly in environments requiring rapid response times and high concurrency.
Key Features:
- MPP Architecture: Enables efficient query processing and optimization across multiple nodes.
- Real-Time Updates: Supports rapid data updates and real-time analytics.
- Ease of Use: Simple to deploy and manage, with MySQL-compatible syntax.
Use Cases:
- Real-time business intelligence and data warehousing.
- Environments needing frequent data updates and rapid responses.
- Large-scale data aggregation and analysis tasks.
Disadvantages:
- Limited Community and Ecosystem: As a newer and less widely adopted platform compared to other established data warehouse technologies, Apache Doris has a smaller community and ecosystem. This can affect the availability of third-party tools, integrations, and experienced developers.
- Operational Complexity: Although designed for ease of use, managing and optimising Doris for specific workloads can still involve a steep learning curve and operational complexity, particularly in tuning performance for various scenarios.
- Memory Usage: Similar to Trino, Apache Doris can also be demanding in terms of memory, especially when dealing with large datasets and complex queries. Effective memory management is crucial to prevent performance bottlenecks.
Performance Comparison
When comparing Trino, ClickHouse, and Apache Doris, it’s important to consider the specific needs of your data operations:
- Trino excels in federated queries across heterogeneous data stores and is highly scalable, making it suitable for complex analytical environments.
- ClickHouse offers superior speed for read-intensive scenarios, especially with its advanced data compression and fast data retrieval capabilities, ideal for environments where query latency is critical.
- Apache Doris provides a balance between high performance and ease of use in real-time analytics, supporting high query concurrency and rapid data updates.
Iceberg Support
Apache Iceberg is an open-source table format for huge analytic datasets that provides powerful features like schema evolution, hidden partitioning, and full transaction support. Iceberg is fast becoming a de-facto standard for Datalakes, as it can significantly simplify data operations. and hence an important factor for decision making.
Trino, ClickHouse and Apache Doris provide support for Iceberg to varying degrees, enhancing their capabilities in handling big data workloads. As always, it’s crucial to stay updated with the latest releases and community contributions for each of these projects, as support and features can rapidly evolve. Here’s an overview of how each supports Apache Iceberg:
Trino and Apache Iceberg
Trino offers robust support for Apache Iceberg. Trino’s integration with Iceberg allows users to leverage Iceberg’s features, including:
- Time Travel: Access data snapshots at specific points in time.
- Schema Evolution: Safely evolve table schema without affecting existing data.
- Efficient Query Execution: Optimised execution plans using Iceberg’s table metadata.
Trino continuously enhances its capabilities with Iceberg, keeping up with the latest features to ensure that users can manage and query Iceberg tables efficiently.
ClickHouse and Apache Iceberg
ClickHouse offers integration with Apache Iceberg, primarily focusing on providing a read-only connection to pre-existing Iceberg tables stored on Amazon S3. This allows users to utilise ClickHouse for querying data stored in the Iceberg format, leveraging features like schema evolution, snapshots for data versioning, and automatic partitioning for enhanced query performance.
ClickHouse supports Iceberg v1, with plans to include v2 and capabilities to query specific snapshots in future releases, further enhancing its utility for large-scale data analytics.
For more details, you can read further on the official [ClickHouse documentation](https://clickhouse.com/docs/en/engines/table-engines/integrations/iceberg).
Apache Doris and Apache Iceberg
Apache Doris has recently added support for Apache Iceberg. This integration is part of Doris’s efforts to extend its compatibility with more data formats and improve its capabilities in handling large-scale data analytics workloads. With Iceberg, Doris users can benefit from:
- Scalable Metadata Management: Handling large numbers of partitions and files without performance degradation.
- Atomic Writes: Ensuring data integrity during updates.
- Hidden Partitioning: Simplifying data management without needing to manage partitions manually.
The support for Iceberg in Doris is still evolving, so users should check the latest documentation and community discussions to understand the current capabilities and limitations.
Personal Views
The following are some ‘personal views’. Trino is an easy implementation except when you have to deal with S3 Object Storage at which point you end up haveing to deal with what I call the metastore problem. It is a great fit in environments with multiple Databases such as a mix of Oracle and Hive for example. I frankly do not rate the Kafka and ES integrations as highly though — I think they are checks in the boxes. Yet it is the closest to a DataMesh architecture IMHO.
ClickHouse is great when your analytics requirements are tightly defined and you want to integrate these capabilities into your own product — for logging and metrics use cases. I found it also great for micro-batching Kafka and build intraday systems and ingestion landing zones on top of. Clickhouse is definitely a speed monster but requires careful table planning.
Apache Doris I have been recently looking into and I find it quite balanced and easy to work with — not to mention some great velocity in feature implementation. We plan on putting it to a bigger test in next weeks on the extensive log data we capture on micro-services. The advantage against our current ElasticSearch/Kibana being that it is more storage efficient and easier to integrate into our monitoring dashboards ( https://medium.com/@ApacheDoris/from-elasticsearch-to-apache-doris-upgrading-an-observability-platform-1426251a6f72 )