Reliable apps with (HA)Proxy — Part 5.

Belgian OLAP Cube

Clickhouse is a very interesting open source columnar database, mostly used for online analytical processing (OLAP). It speaks custom native TCP protocol, but also supports HTTP, so you don’t really need any special client or “driver” to talk to it.

There are many great features, too many to mention here, but I’ll mention a few killer ones I’ve used:

  • Replicated and Distributed tables (sharding)
  • Materialized Views (continuous aggregation)
  • Zstandard compression support (columnar databases are great match for compression)

You can read this great blog post about Replicated/Distributed setup. The main takeaways are:

  • You can use a number of servers, grouped in shards (at least two servers in every shard).
  • For each shard you can create Replicated table. Every time you write to a table in the shard, data gets replicated to other tables/servers in the shard group.
  • You can create Distributed table over shards in the cluster, creating scalable and reliable solution (no single point of failure, you can spread reads and writes).
  • When processing the incoming data (ClickHouse likes big data batches, thousands of rows at once), you can process and aggregate them on the fly into Materialized views. It’s really hard to explain how convenient this is, and how much resources it saves. Some analytical queries could take minutes for short time spans of data (hours), where you’d get analytics for free just by using continuous aggregation. You can create hourly/daily/monthly roll-ups of your data easily and enjoy your chocolatey OLAP cube. (If the only thing you have is PostgreSQL, all is not lost, check out the excellent TimescaleDB)

How does HAProxy fit in this story? Very easily, it can help you with data ingestion, monitor and route traffic to the proper ClickHouse server, to balance the data ingestion, and to simulate sticky sessions for your ClickHouse users (or to force them to fresh server every time, so they don’t forget to use Distributed tables, like a real BOFH).

If I wanted to be funny, I’d say there is a great synergy between HAProxy and ClickHouse.

listen clickhouse_http
mode http
bind :8443 ssl crt /path/to/cert.pem alpn h2,http/1.1
option httpchk GET /ping
http-check expect string Ok.
balance roundrobin
server ch01_shard01 10.0.0.1:8443 check ssl verify required ca-file CA_chain.pem
server ch02_shard01 10.0.0.2:8443 check ssl verify required ca-file CA_chain.pem
server ch03_shard02 10.0.0.3:8443 check ssl verify required ca-file CA_chain.pem
server ch04_shard02 10.0.0.4:8443 check ssl verify required ca-file CA_chain.pem
listen clickhouse_tcp
mode tcp
bind :9440 ssl crt /path/to/cert.pem
option httpchk GET /ping
http-check expect string Ok.
balance leastconn
server ch01_shard01 10.0.0.1:9440 check-ssl port 8443 ssl verify required ca-file CA_chain.pem
server ch02_shard01 10.0.0.2:9440 check-ssl port 8443 ssl verify required ca-file CA_chain.pem
server ch03_shard02 10.0.0.3:9440 check-ssl port 8443 ssl verify required ca-file CA_chain.pem
server ch04_shard02 10.0.0.4:9440 check-ssl port 8443 ssl verify required ca-file CA_chain.pem

It is quite unbelievable how effective this simple configuration is. We proxy both, HTTP and TCP traffic to ClickHouse (secured by SSL). Bare TCP can be used by official ClickHouse client like this:

cat ldjson.data | clickhouse client --host 1.2.3.4 --query "INSERT INTO mydb.mytable FORMAT JSONEachRow" --max_insert_block_size=100000

Yep, you can directly insert line delimited json into ClickHouse, life is beautiful, isn’t it? Moreover, as you can see, we use different port and protocol for checks, and we’ve selected leastconn algorithm for balancing (expecting that TCP connections are long lived).

Using HTTP interface is very easy too:

cat ldjson.data | gzip | curl --data-binary "@-" \
-H "Content-Encoding: gzip" \ "https://1.2.3.4:8443?query=INSERT+INTO+mydb.mytable+FORMAT+JSONEachRow"

We can even use client side compression for our own benefit!

What HAProxy provides is automatic routing and sharding, it will pick up one of our four servers and push the data through, ClickHouse will then replicate the data to the other servers in the same shard. HAProxy also helps with daily cluster operations, e.g. you can drain specific servers, shut them down, without impact on regular cluster usage.

There you go. I hope you enjoyed this multi-essay series of "recipes" that explore the ways of bulding reliable applications with (HA)Proxy.If by any chance you missed any of the prior essays, I chronologically outlined them below. Serve yourself:PS. If you have any questions on any part of this series, feel free to share them in the comment section.1. Reliable apps with (HA)Proxy — Intro2. Club sandwich + Double PostgreSQLSolving problems with more layers3. TLS Calzone + Wrap Crispy JWTAuthentication on the Edge4. Docker Cake + Say my nameLog and resolve doubts5. Service Discovery Ice creamThe long and winding road to K8s6. Belgian OLAP Cube — (you’re here now) Turn the tables on failures with shards

--

--