Using ClickHouse for Financial Charts

How to stop worrying and embrace eventual consistency

Intro

While preparing this article, a book by the late professor Niklaus Wirth came to mind — “Algorithms + Data Structures = Programs”. While written in different times (50 years ago!), I think it still holds today.

The data consumed or produced by programs defines the architecture, operations and life-cycle of the application. It’s not enough to know your algorithms, to expand your engineering skillz you should leave the safe waters of conventional data backends (relational databases, plain old files) and embrace innovative and different systems like ClickHouse (or Kafka, Cassandra, S3). All of these have different tradeoffs, but also open up a world of possibilities for your app.

On the surface, ClickHouse is just a columnar variant of a relational database system, but there’s more than meets the eye. From the very roots at Yandex, the defining property of ClickHouse is practicality, the features have been driven by the need of real companies and systems. It also helped that another Internet behemoth, Cloudflare, started using, improving and blogging about ClickHouse soon after it became freely available.

While it started as proprietary software, ClickHouse was eventually released to the world under permissive Apache license. I can’t emphasize this enough. If you can help it, keep the data in the open source databases, in open formats, no hostages allowed!

If you have the time, read some ClickHouse intro in our HAProxy series.

ClickHouse — TLDR;

Use ClickHouse …

  • if you have analytical, aggregation heavy data set (OLAP, Pivot table)
  • if you don’t care about persisting individual rows one by one, but do care about processing all the rows
  • if a risk of having some data duplicates is acceptable
  • if your queries process large number of rows, but relatively small number of columns
  • if you want to process and save vast amount of data (petabytes!)
  • if your data source is a continuous stream of records/messages.

Don’t use ClickHouse …

  • if you can describe your data as “transactions” (OLTP)
  • if you mutate a lot of data in place (SQL DELETE/UPDATE)
  • if you can’t live without referential integrity and normalized data
  • if you need database indices to optimize access to individual rows (or a small set of resulting rows).

Intro Quest — Charts

Talk is cheap, let’s get our hands dirty (and done cheap) with sample data. Imagine we’re exploring a trading system (forex, shares, crypto, etc.). Before committing ourselves, we first want to learn from history and see if we can interpolate future trends (strictly speaking, past performance does not predict the future, but we have vivid imagination around here).

First, lets download some historical data for our playground. We’ll download forex data for EURUSD, 1min signals for the whole year 2023. In real life, your signal would be more frequent, you’d collect multiple events per second, but this will suffice for now.

2023.01.01,17:04,1.069700,1.069740,1.069700,1.069700,0
2023.01.01,17:05,1.069730,1.069780,1.069700,1.069710,0
2023.01.01,17:06,1.069660,1.069660,1.069660,1.069660,0
2023.01.01,17:08,1.069700,1.069740,1.069700,1.069740,0
2023.01.01,17:10,1.069750,1.069800,1.069720,1.069720,0
...
2023.12.29,16:58,1.103660,1.103660,1.103610,1.103610,0

The general format of such data is timestamp, followed by open, high, low and closing values for the observed interval. These are then used to construct various variants of candlestick charts.

Candles

Single candle

Let’s get to it, given 1min candles for, make us one day candles:

❱ clickhouse local -q "SELECT count() FROM file(DAT_MT_EURUSD_M1_2023.csv)"
322638

❱ clickhouse local -q "DESCRIBE file(DAT_MT_EURUSD_M1_2023.csv)"
c1 Nullable(Date)
c2 Nullable(String)
c3 Nullable(Float64)
c4 Nullable(Float64)
c5 Nullable(Float64)
c6 Nullable(Float64)
c7 Nullable(Int64)

❱ clickhouse local --output-format PrettyCompact -q \
"SELECT
toDate(timestamp) AS date,
argMin(open, timestamp) AS open,
max(high) AS max,
min(low) AS min,
argMax(close, timestamp) AS close
FROM (
SELECT
toDateTime(toString(c1) || 'T' || c2 || ':00') AS timestamp
c3 AS open,
c4 AS high,
c5 AS low,
c6 as close
FROM file(DAT_MT_EURUSD_M1_2023.csv)
)
GROUP BY date
ORDER BY date
LIMIT 5
"
┌───────date─┬────open─┬─────max─┬─────min─┬──────────────close─┐
│ 2023-01-01 │ 1.0697 │ 1.07087 │ 1.06788 │ 1.06993 │
│ 2023-01-02 │ 1.06994 │ 1.06997 │ 1.06499 │ 1.0667200000000001 │
│ 2023-01-03 │ 1.06673 │ 1.06706 │ 1.05194 │ 1.05727 │
│ 2023-01-04 │ 1.05726 │ 1.06354 │ 1.05594 │ 1.0604 │
│ 2023-01-05 │ 1.06041 │ 1.06288 │ 1.05148 │ 1.05243 │
└────────────┴─────────┴─────────┴─────────┴────────────────────┘

This is almost too cute to be true, our first magic trick, ClickHouse allows us to query CSV file without importing data first. The barrier of entry of using it is very low, and even if clickhouse-local is your only take-out from this article I sincerely hope it could become valuable little tool for daily use.

Also note the usage of argMin/argMax functions in the above example. ClickHouse has a rich selection of aggregate functions, so you don’t need to write complex aggregations / window function combinations all by yourself.

Aggregations

And now for the next magical trick. We’ve seen how to summarize minute data with a query, so let’s work on that idea further and make the machine work for us in summarizing candles for other intervals:

CREATE TABLE forex (
symbol String,
timestamp DateTime('UTC'),
open Float64,
high Float64,
low Float64,
close Float64
)
ENGINE = Null

The `Null` table engine is a black-hole where data enters but never leaves. There are now primary keys, ordering or similar, but it defines a sieve which can filter our data.

clickhouse local --output-format CSVWithNames -q \
"SELECT
'EURUSD' AS symbol,
toDateTime(toString(c1) || 'T' || c2 || ':00') AS timestamp,
c3 AS open,
c4 AS high,
c5 AS low,
c6 as close
FROM file(DAT_MT_EURUSD_M1_2023.csv)" > eurusd_2023.csv

clickhouse client -q 'INSERT INTO forex FORMAT CSVWithNames' < eurusd_2023.csv

clickhouse client -q 'SELECT count() FROM forex FORMAT PrettyCompact'
┌─count()─┐
│ 0 │
└─────────┘

We’ve slightly adapted data to include symbol column, and format the timestamp. However, the data is still not actually saved, this is just a framework for the next step:

CREATE TABLE forex_1h_agg
(
interval DateTime,
symbol LowCardinality(String),
open AggregateFunction(argMin, Float64, DateTime),
open_timestamp SimpleAggregateFunction(min, DateTime),
close AggregateFunction(argMax, Float64, DateTime),
close_timestamp SimpleAggregateFunction(max, DateTime),
high SimpleAggregateFunction(max, Float64),
low SimpleAggregateFunction(min, Float64)
)
ENGINE = AggregatingMergeTree()
ORDER BY (interval, symbol)
PARTITION BY toYear(interval)
SETTINGS index_granularity = 8192;

CREATE MATERIALIZED VIEW forex_1h_agg_mv
TO forex_1h_agg
AS
SELECT
toStartOfInterval(timestamp, toIntervalHour(1)) AS interval,
symbol,
argMinState(open, timestamp) AS open,
min(timestamp) AS open_timestamp,
argMaxState(close, timestamp) AS close,
max(timestamp) AS close_timestamp,
greatest(max(t.open), max(t.close)) AS high,
least(min(t.open), min(t.close)) AS low
FROM forex t
GROUP BY
symbol,
interval
ORDER BY
symbol ASC,
interval ASC;

-- Run the above INSERT INTO forex FORMAT CSVWithNames command
-- and checkout the results

SELECT count() FROM forex_1h_agg;
┌─count()─┐
│ 5409 │
└─────────┘

SELECT * FROM forex_1h_agg ORDER BY interval LIMIT 5;
┌────────────interval─┬─symbol─┬─open──────┬──────open_timestamp─┬─close────┬─────close_timestamp─┬────high─┬─────low─┐
│ 2023-01-01 18:00:00 │ EURUSD │ �H}�c │ 2023-01-01 18:04:00 │ ��c │ 2023-01-01 18:59:00 │ 1.07066 │ 1.06788 │
│ 2023-01-01 19:00:00 │ EURUSD │ �
�� ʱc │ 2023-01-01 19:00:00 │ W[���c │ 2023-01-01 19:59:00 │ 1.07047 │ 1.06829 │
│ 2023-01-01 20:00:00 │ EURUSD │ ;��0رc │ 2023-01-01 20:00:00 │ �[ �� │ 2023-01-01 20:59:00 │ 1.07041 │ 1.06924 │
│ 2023-01-01 21:00:00 │ EURUSD │ Ut<f �@� │ 2023-01-01 21:00:00 │ �����c │ 2023-01-01 21:59:00 │ 1.07047 │ 1.06962 │
│ 2023-01-01 22:00:00 │ EURUSD │ ����P�c │ 2023-01-01 22:00:00 │ {���$c │ 2023-01-01 22:59:00 │ 1.06977 │ 1.06871 │
└─────────────────────┴────────┴───────────┴─────────────────────┴──────────┴─────────────────────┴─────────┴─────────┘

-- What's up with that output, binary? Let's fix it:

SELECT
interval,
symbol,
argMinMerge(open) AS open,
max(high) AS high,
min(low) AS low,
argMaxMerge(close) AS close
FROM forex_1h_agg
GROUP BY
interval,
symbol
ORDER BY interval ASC
LIMIT 5

Query id: 5f5be484-e89f-426b-ac4e-6fb4d36cb9c3

┌────────────interval─┬─symbol─┬────open─┬────high─┬─────low─┬───close─┐
│ 2023-01-01 18:00:00 │ EURUSD │ 1.0697 │ 1.07066 │ 1.06788 │ 1.06929 │
│ 2023-01-01 19:00:00 │ EURUSD │ 1.06896 │ 1.07047 │ 1.06829 │ 1.07005 │
│ 2023-01-01 20:00:00 │ EURUSD │ 1.07007 │ 1.07041 │ 1.06924 │ 1.0704 │
│ 2023-01-01 21:00:00 │ EURUSD │ 1.07041 │ 1.07047 │ 1.06962 │ 1.06977 │
│ 2023-01-01 22:00:00 │ EURUSD │ 1.06977 │ 1.06977 │ 1.06871 │ 1.06904 │
└─────────────────────┴────────┴─────────┴─────────┴─────────┴─────────

This a lot to unpack, have some coffee now.

First, we’ve created a table to hold our aggregated data — forex_1h_agg, and then used a few interesting ClickHouse features to sculpt the data:

  • We’ve used LowCardinality to encode the information about symbol column. Usually, it’s a three letter (or more) character column. ClickHouse doesn’t really have the usual database indices (b-tree, hash, etc), but sparse primary and data skipping indices. Since we know that symbol set is limited in number (at most a few thousands symbols), we can instruct ClickHouse to encode the data in more efficient dictionary format (ClickHouse even has enums, but for string data, LowCardinality is better).
  • The next revelation is about Aggregate Function column type. This is the real deal which distinguishes ClickHouse from other databases.
    open AggregateFunction(argMin, Float64, DateTime)
    Here we’ve instructed ClickHouse that our open column is of the Float64 type, but the data is not final (i.e. the column will be updated). In addition to that, we’ve noted that the data uses argMin(Float64, DateTime) aggregation (i.e. the open value is dependent on other column of the DateTimetype)
    With this, ClickHouse is able to deduce when to update the column depending on the value of (open, timestamp) pair (for given time interval, we always want for “oldest” value to be open value, whenever the data is inserted into ClickHouse.
  • For easier debugging, we’ve saved the timestamp of open value ourselves in open_timestamp column. The aggregation used here is “simple”, so we have a special case for Aggregate Function, so ClickHouse can persist the data in more convenient format (the storage type of max(Float64)column is equivalent to plainFloat64.
  • The resulting data set for hourly candles has shrank to 5 thousand rows. As we insert the data, the ClickHouse filters and aggregates the data on the fly. The insert batch is also aggregated in the Materialized View query, and then it is aggregated again when merging with the data on the disk, hence the AggregatingMergeTree table type. If the insert batch contained 100k rows with nanosecond timestamp, our MV query would aggregate it away into per minute chunks before hitting the forex_1h_agg backing table (and disk).
  • Our forex_1h_agg_mv materialized view is triggered on every insert into forex_1h table. That allows us to use Nulltable engine, and avoid saving unecessary data (we only care about OHLC)
  • We’ve also observed that the AggregatedFunction data is in some kind of binary format (it looks like it is space efficient, since it is in our case String + DateTime), and that for exposing the data we need to add Mergesuffix to unveil the value (argMinMerge/argMaxMerge).

To recap, you can ingest millions of data points into ClickHouse, reduce the data set on the fly, and persist the aggregations. The fine folks at Timescale DB call this continuous aggregatates which beautifully describes the concept.

If it is not already obvious, this means that ClickHouse can also be used as a time series database. Where you can save all your analytical data in one place and query it efficiently.

Charts

The last, but not the least, before we fall to sleep, lets fire up our candles!
If you have used a time series database before, you’ve probably used Grafana, a wonderful visualization and alerting tool. It has good support for ClickHouse data source and it also has candlestick charts out of the box.

Our data table is slightly complicated for Grafana QueryBuilder, but we can dive in with a custom query.

SELECT
$__timeInterval(interval) as timestamp,
argMinMerge(open) AS open,
toFloat64(max(high)) AS high,
toFloat64(min(low)) AS low,
argMaxMerge(close) AS close
FROM "default"."forex_1h_agg"
WHERE ( timestamp >= $__fromTime AND timestamp <= $__toTime )
GROUP BY
timestamp,
symbol
ORDER BY timestamp ASC
LIMIT 1000

The dollar variables are for templating, so you can select interval in the Grafana dashboard; For some reason, Grafana could not detect our max(high)and min(low)so we had to typecast it manually.

The end result is surprisingly good

2023-01-02T18:00:00 — 2023-01-03T18:00:00

Grafana can also alert on various conditions etc, so it is just perfect for exploring and observing the live data.

What’s next?

We’ve just begin our journey with ClickHouse:

  1. Learn about ClickHouse!
  2. Charts! ← you are here
  3. Profit(s)!

Before reaching our goal, we’ll have two more side quests so you learn how to stitch a useful real life application:

  • How to feed the ClickHouse machine with Kafka
  • How to cross the Kafka message stream with Java

Both technologies, Kafka and Java are enterprise(tm), big data, and would compliment everyone's resume.

Ministry of Programming is a supercharged remote-optimized venture builder specialized in building startups and new products💡 We were voted in the top 1000 fastest growing companies in Europe by Financial Times.

We offer product management, design, development, and investment services to support entrepreneurs and startups towards product success.

If you want to work with us on your startup feel free to contact us at — https://ministryofprogramming.com/contact/

--

--