Exploratory data analysis with Clickhouse

Big data exploratory analysis with Clickhouse

Published in

DataDenys

5 min readNov 4, 2022

Any analysis starts with exploratory stage — when we try to understand the big picture of the data itself. We usually do things like geting medians or finding distributions during this step, which Python Numpy/Pandas allows us to easily accomplish. But if we talk about large data volumes (Terabytes+) we can’t simply use tools which rely on RAM. But Clickhouse can help, it has a powerfull set of tools to dig into data beyond trivial min/max/avg analysis, let’s take a look at them.

Test table and data

We’re going to use the following table and clickhouse.cloud service for tests:

CREATE TABLE test (
  `time` DateTime, `a` UInt64, `b` Float64,
  `c` UInt32, `d` UInt32, `e` UInt32
) ORDER BY time

Yeah, we can skip ENGINE for new versions of Clickhouse, which is 22.10.1.17597 on our test cloud service. Alos, we have 1 billions rows of randomly generated values in our table:

INSERT INTO test
SELECT now() - number, rand64()/1000000000,
       round(rand32()/1000000)/100,
       rand32()/1000000, rand32()/10000, rand32()/100
FROM numbers(1000000000);

Min, Max, Avg, Median

First things first, so let’s see what we have in terms of basic aggegations:

SELECT min(a), max(a), avg(a), median(a) FROM test

Which will give us the following overiew of a column:

Note, that median function is an approximate implementation (for the sake of speed), so consider using medianExact() if you need exact reults.

For columns of the same type, we can use *ForEach combinator to show us aggregations results for all listed columns:

WITH [a, c, d, e] AS col
SELECT maxForEach(col), minForEach(col),
       avgForEach(col), medianForEach(col)
FROM test\G

Note, we had to skip b column, because it’s Float, not Int. The latest query will return multiple arrays with values positioned accordingly to columns listed in WITH clause:

Weighted average

In some cases we want to understand weighted averages, when we have a value in one column (let it be a column) and its weights in another (let’s use c column for that):

SELECT avgWeighted(a, c) FROM test

Now the result is different from standard average value because c column has different value (weight) for each row:

Quantiles

Values distribution analysis starts with understanding most popular quantiles values, which can be accomplished using quantile() function:

SELECT quantile(0.01)(a), quantile(0.05)(a), quantile(0.1)(a),
       quantile(0.5)(a),  quantile(0.9)(a),  quantile(0.95)(a),
       quantile(0.99)(a)
FROM test\G

Which gives us following quantiles for a column:

Numeric argument of quantile() function represents which quantile we need to calculate, values can go from 0 to 1, where:

0.01 is a 1% percentile (meaning 1% of all column values that are less then calculated quantile value),
0.99 is a 99% percentile,
0.5 is a 50% percentile (or a median).

Histogram

Clickhouse has histogram() function which can calculate number of values within given number of bins:

SELECT histogram(10)(a) FROM test\G

Which gives us:

Since this doesn’t look too useful for evaluation, we can use bar() function to plot that:

WITH    count(*) AS m, histogram(10)(a) AS h
SELECT  m, arrayJoin(h).3 AS v, bar(v, 0, m / 10, 50)
FROM    test

Here, we calculate histogram for a columns based on 10 bins and plot results by using calculated histogram “height” values (v) as a bar width (having 50 symbols as a maximum width of bar). Now we can see our values are randomly distributed

Skewness

In order to understand how symmetric our data is we can use skewness evaluation:

SELECT skewPop(a), skewPop(b), skewPop(c), skewPop(d) FROM test

And we can see we have pretty symmetrical datasets (dataset is considered to be symmetric if skewness falls into -0.5…0.5 range):

Unique values

As we know Clickhouse uses memory efficient approximate unique values algorythm which is more than ok during exploratory analysis for most cases:

SELECT uniq(a), uniq(b), uniq(c), uniq(d), uniq(e) FROM test\G

Which will give us the following:

Top popular values

Sometimes we need to understand which values are most popular in our datasets. Though it seems to be simple with standard SQL, e.g. for a column:

SELECT a, count(*) AS total FROM test
GROUP BY a ORDER BY total DESC LIMIT 5

… we can actually use topK() function, which does the same in a much more efficient way:

SELECT topK(5)(a) FROM test

This will return 5 most popular values from a column, but more importantly it takes 10x less time to execute:

Variance

Variance helps us understand the scale of deviation of values in a dataset. We can calculate that in Clickhouse as well:

SELECT varPop(a), varPop(c), varPop(d) FROM test

Which gives us the variance of 3 specified columns:

Clickhouse uses the following formula to calculate variance:

Covariance

We can also calculate covariance of 2 columns:

SELECT covarPop(c,d) FROM test

Which returns:

The following formula is used for calculations here:

Summary

Clickhouse math and aggregation functions can be used to do initial explaratory data analysis on large datasets. Clickhouse uses optimized approximation algorythms for some calculations (like uniques or quantiles) to help get faster results, which are usually ok during initial analysis.

List of popular functions used for data exploration:

min(), max(), avg() and median() for most basic stuff,
avgWeighted(col, weights) to calculate average with weights,
quantile(level)(col) to get quantiles,
histogram(bins)(col) and bar() to plot histogram,
skewPop(col) to understand how symmetric data is,
uniq(col) for approximate (but fast) uniques calculation,
topK(N)(col) to find most popular N values,
varPop(col) to estimate values deviation.