Efficient Approximation in BigQuery With HyperLogLog++

Seyhan DEMİRCAN
Google Cloud - Community
5 min readJul 30, 2023
Photo by Gabe on Unsplash

Introduction

In data-related problems, from time to time, we need to count distinct values based on different requirements. For instance, the following examples can be given;

  • Cardinality
  • How many unique users visit use our system for a predetermined time interval?
  • How many unique customers bought a specific product or products within a particular category?
  • Trend Analyses
  • Geographical Segmentation

In most cases, sacrificing some precision over performance is acceptable. We can follow several ways to achieve that, and HyperLog++ is the most optimized one.

Possible Ways

Before we go further, let’s populate some data.

DECLARE i INT64 DEFAULT 0;

create or replace table `test.generated_array` as
select
id,
case when mod(id,4) = 1 then 'TR'
when mod(id,4) = 2 then 'DE'
when mod(id,4) = 3 then 'GB'
else 'US'
end as country,
current_timestamp() as _load_time
from unnest(generate_array(1,1000000)) as id;


REPEAT
insert into test.generated_array select * from test.generated_array;
SET i = i + 1;
UNTIL i >= 7
END REPEAT;

With this script, we created uniformly distributed 128M rows.

Now, everything is ready to explore possible ways and their pros/cons.

COUNT(DISTINCT)

select
COUNT(distinct id) from `test.generated_array`
WHERE country = 'TR'

Pros;

  • The result is precise.

Cons;

  • Performance impact
  • Not iterable; we must recalculate it each time.

APPROX_COUNT_DISTINCT

select
approx_count_distinct(id) from `test.generated_array`
WHERE country = 'TR'

Pros;

  • Performance advantage

Cons;

  • Not iterable; we must recalculate it each time.
  • The result is error-prone.
    Error Margin = |(Real Value — Predicted Value) / Real Value| * 100
    Error Margin = |(250,000–248,348) / 250,000| * 100
    Error Margin ≈ 0.6608

HyperLogLog++

The HyperLogLog++ algorithm (HLL++) estimates cardinality from sketches.

HLL++ is another approximate function but works differently. Instead of producing a result directly, it creates sketches in BigQuery’s BYTE format.

SELECT
HLL_COUNT.INIT(id) AS hll_sketch
FROM `test.generated_array`
WHERE country = 'TR'

This situation gives us several benefits.

  • Low resource consumption
  • Iterable calculation
  • More precise results

To see the result from this sketch;

select HLL_COUNT.EXTRACT(HLL_sketch)as cnt from(
SELECT
HLL_COUNT.INIT(id) AS hll_sketch
FROM `test.generated_array`
WHERE country = 'TR'
)

APPROX_COUNT_DISTINCT VS HLL++

First Run

You may notice that both functions produced the exact value, and their resource usage was almost identical. What is the issue here? Let’s take a closer look at APPROX_COUNT_DISTINCT’s query plan;

The optimizer is already using HLL++ in the background. Should we use HLL++ over approx_count_distinct? The answer is yes.

Precision

We already examined the error margin of approx_count_distinct. HLL++ allows us to work with different precision levels.

select HLL_COUNT.EXTRACT(HLL_sketch)as cnt from(
SELECT
HLL_COUNT.INIT(id,24) AS hll_sketch --with different precision
FROM `test.generated_array`
WHERE country = 'TR'
)

The precision level can be anywhere between 10 and 24.

Error Margin (%) = |(250,000–249,996) / 250,000| * 100
Error Margin (%) ≈ 0.0016

Unleash The Beast

This section will examine the long-term effect of using the sketch over aggregation. This approach will allow us to see the real power of this algorithm. Before moving on, we will store this sketch in a table for subsequent iterations.

create or replace table `test.tmp_generated_array_sketch` as
SELECT
HLL_COUNT.INIT(id,24) AS hll_sketch
FROM `test.generated_array`;

We insert another 1M rows;

insert into test.generated_array 
select
id,
case when mod(id,4) = 1 then 'TR'
when mod(id,4) = 2 then 'DE'
when mod(id,4) = 3 then 'GB'
else 'US'
end as country,
current_timestamp() as _load_time
from unnest(generate_array(2000000,3000000)) as id;

Now, we have already initialized the first run, and we have that sketch in a table. Since then, only another 1M rows have been inserted. Let’s unleash the beast!

select HLL_COUNT.MERGE(HLL_sketch)as cnt from(
SELECT
hll_sketch
FROM `test.tmp_generated_array_sketch` --previous sketch
union all
SELECT
HLL_COUNT.INIT(id,24) AS hll_sketch --new sketch
FROM `test.generated_array`
where _load_time > timestamp_sub(current_timestamp(), interval 15 minute)
);

It took only 2 seconds slot time! This approach remains constant regarding resource consumption, even if we imagine the table growing over time. In other words, as the table size increases, the performance of this method stays almost unchanged. On the other hand, the count method will use more and more resources. The benefits are even more significant when we consider data sizes in the terabytes or petabytes range, as the performance advantage becomes more crucial with more extensive data volumes, and the precision advantage adds extra value.

Side Notes

  • For every query given in this article, the cache result mechanism was disabled.
  • For every HLL++ iteration, updating the sketch table with the new sketch is advised.

--

--