Big data exploratory analysis with Clickhouse
Any analysis starts with exploratory stage — when we try to understand the big picture of the data itself. We usually do things like geting medians or finding distributions during this step, which Python Numpy/Pandas allows us to easily accomplish. But if we talk about large data volumes (Terabytes+) we can’t simply use tools which rely on RAM. But Clickhouse can help, it has a powerfull set of tools to dig into data beyond trivial min/max/avg analysis, let’s take a look at them.
Test table and data
We’re going to use the following table and clickhouse.cloud service for tests:
CREATE TABLE test (
`time` DateTime, `a` UInt64, `b` Float64,
`c` UInt32, `d` UInt32, `e` UInt32
) ORDER BY time
Yeah, we can skip ENGINE
for new versions of Clickhouse, which is 22.10.1.17597
on our test cloud service. Alos, we have 1 billions rows of randomly generated values in our table:
INSERT INTO test
SELECT now() - number, rand64()/1000000000,
round(rand32()/1000000)/100,
rand32()/1000000, rand32()/10000, rand32()/100
FROM numbers(1000000000);
Min, Max, Avg, Median
First things first, so let’s see what we have in terms of basic aggegations:
SELECT min(a), max(a), avg(a), median(a) FROM test
Which will give us the following overiew of a
column:
Note, that median
function is an approximate implementation (for the sake of speed), so consider using medianExact()
if you need exact reults.
For columns of the same type, we can use *ForEach
combinator to show us aggregations results for all listed columns:
WITH [a, c, d, e] AS col
SELECT maxForEach(col), minForEach(col),
avgForEach(col), medianForEach(col)
FROM test\G
Note, we had to skip b
column, because it’s Float
, not Int
. The latest query will return multiple arrays with values positioned accordingly to columns listed in WITH
clause:
Weighted average
In some cases we want to understand weighted averages, when we have a value in one column (let it be a
column) and its weights in another (let’s use c
column for that):
SELECT avgWeighted(a, c) FROM test
Now the result is different from standard average value because c
column has different value (weight) for each row:
Quantiles
Values distribution analysis starts with understanding most popular quantiles values, which can be accomplished using quantile()
function:
SELECT quantile(0.01)(a), quantile(0.05)(a), quantile(0.1)(a),
quantile(0.5)(a), quantile(0.9)(a), quantile(0.95)(a),
quantile(0.99)(a)
FROM test\G
Which gives us following quantiles for a
column:
Numeric argument of quantile()
function represents which quantile we need to calculate, values can go from 0
to 1
, where:
0.01
is a1%
percentile (meaning1%
of all column values that are less then calculated quantile value),0.99
is a99%
percentile,0.5
is a50%
percentile (or a median).
Histogram
Clickhouse has histogram()
function which can calculate number of values within given number of bins:
SELECT histogram(10)(a) FROM test\G
Which gives us:
Since this doesn’t look too useful for evaluation, we can use bar()
function to plot that:
WITH count(*) AS m, histogram(10)(a) AS h
SELECT m, arrayJoin(h).3 AS v, bar(v, 0, m / 10, 50)
FROM test
Here, we calculate histogram for a
columns based on 10
bins and plot results by using calculated histogram “height” values (v
) as a bar width (having 50
symbols as a maximum width of bar). Now we can see our values are randomly distributed
Skewness
In order to understand how symmetric our data is we can use skewness evaluation:
SELECT skewPop(a), skewPop(b), skewPop(c), skewPop(d) FROM test
And we can see we have pretty symmetrical datasets (dataset is considered to be symmetric if skewness falls into -0.5…0.5
range):
Unique values
As we know Clickhouse uses memory efficient approximate unique values algorythm which is more than ok during exploratory analysis for most cases:
SELECT uniq(a), uniq(b), uniq(c), uniq(d), uniq(e) FROM test\G
Which will give us the following:
Top popular values
Sometimes we need to understand which values are most popular in our datasets. Though it seems to be simple with standard SQL, e.g. for a
column:
SELECT a, count(*) AS total FROM test
GROUP BY a ORDER BY total DESC LIMIT 5
… we can actually use topK() function, which does the same in a much more efficient way:
SELECT topK(5)(a) FROM test
This will return 5
most popular values from a
column, but more importantly it takes 10x
less time to execute:
Variance
Variance helps us understand the scale of deviation of values in a dataset. We can calculate that in Clickhouse as well:
SELECT varPop(a), varPop(c), varPop(d) FROM test
Which gives us the variance of 3 specified columns:
Clickhouse uses the following formula to calculate variance:
Covariance
We can also calculate covariance of 2 columns:
SELECT covarPop(c,d) FROM test
Which returns:
The following formula is used for calculations here:
Summary
Clickhouse math and aggregation functions can be used to do initial explaratory data analysis on large datasets. Clickhouse uses optimized approximation algorythms for some calculations (like uniques or quantiles) to help get faster results, which are usually ok during initial analysis.
List of popular functions used for data exploration:
min()
,max()
,avg()
andmedian()
for most basic stuff,avgWeighted(col, weights)
to calculate average with weights,quantile(level)(col)
to get quantiles,histogram(bins)(col)
andbar()
to plot histogram,skewPop(col)
to understand how symmetric data is,uniq(col)
for approximate (but fast) uniques calculation,topK(N)(col)
to find most popularN
values,varPop(col)
to estimate values deviation.