In search of the fastest data processing engine with minimal operational overhead
You have billions of rows/Terabytes of data to process. However, you can’t put this as an excuse to your clients — they want insights from their data and they want it right now. You have only seconds to answer their query — so latency is our primary concern. Based on this requirement we can eliminate MapReduce based systems like Hadoop, since their relatively high latency makes them infeasible to be used as a backend for real-time analytics. The client requirement almost ends here, however, Detavu team, as the service provider, is not to be contented only with this.
The requirements
- Most successful data products nowadays are open source. We believe that the future of data analytics is open source. So, like many other components used within the team, we wanted to put our stake in an open source product. This criterion helps us to eliminate some strong contenders like Amazon Redshift, Google’s Bigquery, Snowflake, IBM BD2, Teradata, Vertica, etc.
- Must support the popular SQL query language and connects well with data exploration and visualization platforms like Apache Superset.
- Should be horizontally scalable (sharding, replication, distributed query, etc ) with minimal operational overhead. We love to manage our workloads in a distributed environment with Kubernetes — a natively supported Kubernetes operator/chart would be a great value addition.
- With respect to storage optimization, columnar structure is preferred, to save space compared to the row-oriented structure. We plan to not maintain cold backups of user data instead we want to maintain multiple online replicas, so storage efficiency is highly appreciated.
- Integration with modern data integration/transformation tools like Airbyte and dbt to support ETL/ELT operations.
- Interoperability with data lake — natively supported storage layers like S3, HDFS and support for popular column-oriented, well-compressed file formats like Apache Parquet, Apache Avro, ORC, etc.
- Fault tolerance and reliability.
The chosen one
The above is not an exhaustive list of all the criteria we have examined but this will give a good starting point for someone evaluating from a similar perspective. After evaluating various open source projects like Apache Druid, TiDB, ClickHouse, TimescaleDB, MariaDB ColumnStore, Trino, Apache Pinot, and Presto I decided to move forward with ClickHouse.
Note: The prior statement is a partial truth. When I got started working with ClickHouse, very soon I became biased and I was so fascinated that I didn’t have the patience to go through the whole checklist for every other option. That’s a big warning sign for the evaluators — Do not get perplexed by ClickHouse and control your emotions!!!
More about the chosen one
Now, what ClickHouse has to offer against my requirements? ClickHouse with Apache-2.0 license is a column-oriented database that enables its users to generate powerful analytics, using SQL queries, in real-time. ClickHouse originated from Yandex (Google of Russia). Interested readers can take a look here to know the history and a short introduction to ClickHouse. It has a single portable C++ binary, the only external dependency for data replication and distributed DDL query execution was ZooKeeper, which has been replaced by the ClickHouse keeper. ClickHouse keeper is written in C++ and implements the Raft consensus algorithm, which typically means that the cluster should be operational even if the minority of nodes fail. ClickHouse supports multi-master asynchronous replication and can be deployed across multiple datacenters. The fact that so complex an OLAP database with Advanced SQL capabilities has so minimal external dependency is truly amazing. This brings about the operational simplicity.
They claim ClickHouse to be the fastest OLAP database on earth. According to the claim, besides regular practices what makes ClickHouse stand out is their “attention to detail”. One such example is, that they avoided using the generic HashTable implementation provided by the programming language, and instead developed 30+ variations of a Hash table appropriate for specific Group By query. For example, for a small number of keys, a two-level Hash table works slower than a single-level Hash table. But for a large number of keys, a two-level Hash table works much faster since it allows for natural parallelization of work. Interested readers can read more over here.
ClickHouse natively supports Amazon S3, Apache Superset, dbt, bidirectional integration with Apache Kafka, RabbitMQ, NATS and language clients for Java, Python, and Go. Some other integrations developed by the community are Airbyte, Apache Spark, etc. ClickHouse supports various input and output formats including the traditional CSV, TSV, JSON, XML, and new CSVs in columnar formats like Avro, Parquet, ORC, Arrow, etc. A comprehensive list of input and output formats is available here.
Storage abstraction is another outstanding addition to ClickHouse, honestly speaking, it was not on my checklist. This has made possible the implementation of hot-cold architecture where data can be relocated based on TTLs and fill rates. So it means we can store frequently accessed/new data in high-performance expensive storage and move old/less frequently accessed data to inexpensive storage like S3 using the same processing engine.
In addition, ClickHouse has a rich set of Data Types; Aggregate, Table, Geo, and various other advanced analytical functions, and powerful data structures like Dictionary. You can also do some machine learning right into your database including Stochastic Gradient Descent (SGD).
As of this writing, Clickhouse has got 25K+ Github stars, thousand plus contributors, and production battle-tested by companies like Yandex, Uber, Ebay, Cloudflare, and Spotify.
What’s next
In my upcoming articles, we will hands-on evaluate, how fast ClickHouse really is and in the process, explore various aspects of this feature-rich database including the Sharding, Replication, Cluster, various Table Engines and functions, etc that I have barely touched in the current article. Obviously, we’ll do everything on Kubernetes.
Originally published at https://www.linkedin.com.
