HyperLogLog in Google BigQuery

Timothy Carbone
Unsplash Blog
Published in
4 min readMay 15, 2019

Counting and reporting uniques is always a challenge as it usually requires a full scan of the dataset to count the number of distinct values we have. On small datasets it’s fine but when dealing with larger volumes, it quickly becomes a performance and resource issue. We recently ran into that problem when trying to measure the number of unique users reached by Unsplash images.

Photo by Joanna Kosinska on Unsplash

Uniques can’t be aggregated

The uniqueness of a value also depends on the time range used and ranged counts can’t be aggregated. If you have 2M distinct identifiers per day for a week, it doesn’t mean that you have 14M distinct identifiers for that week. Some of these identifiers will appear in different days, making the true weekly count lower than the sum of the daily counts.

In a time partitioned table, the full scan is a problem because whenever you need the unique count over a different time range, you need to scan the entire time range.

Example for a daily partition:
1 day = 1 table = 1 scan to count uniques
1 week = 7 tables = 7 scans
1 month = 31 tables = 31 scans
etc…

Table scans are both slow and expensive, so we want to avoid them as often as possible.

What is HyperLogLog (HLL)?

I won’t describe the algorithm itself, you can probably find a better explanation over here. What’s important for us is that it allows to calculate a precise estimate of the number of uniques values in a set of values. It still requires a full table scan because you need to input all the values for the algorithm to work, of course. Note that you don’t have to fit the data in memory and you can stream it through HyperLogLog, only keeping the uniques count (and HLL structure) at all times.

Google’s implementation of HLL in BigQuery

In BigQuery’s standard SQL, HLL is available to speed up distinct counts if you’re willing to trade off a little bit of accuracy.

Another benefit of using HLL in BigQuery is that it allows you to make a single scan of you daily table, even if you need both daily, weekly, monthly, quarterly and yearly uniques. It solves the uniques aggregation issue and helps saving a crazy amount of volume processing.

This works because Google’s implementation allows you to save and load HyperLogLog schemas. HLL schemas are intermediary results that you can query, save and load. You can count uniques, save the HLL schema, load it the next day and start counting uniques for the next day, still considering the unique values that HLL processed the previous day. If a value is present in both days, it will be counted only once and not twice like a simple aggregation would do. You can also merge multiple daily schemas and the behaviour would be the same.

The math is pretty simple. If you need a daily, monthly and yearly unique count, you’d have to make 3 full scans and pay:

cost = 3 * daily volume * 365 * price per volume
cost = 1095 * daily volume * price per volume

But with the usage of HLL presented here, with a single table scan it would only cost you:

cost = daily volume * 365 * price per volume
cost = 365 * daily volume * price per volume

Since the price of merging HLL sketches is negligible, you’re saving 67% on your queries processing costs.

Practical example of how we use BigQuery’s HLL at Unsplash

Coming back to our original problem: we’re trying to estimate how many people and devices Unsplash photos reach so we can report it to our contributors. We collect the logs of all our photo views tied to an anonymous device identifier. Counting how many distinct identifiers we have in our logs helps us understand how many devices we reach.

Photo views logs are stored in daily partitioned tables in BigQuery. Each day, we count the number of distinct device identifiers with a true count and we estimate it with HLL. For that day, we store:

  • The true distinct count
  • The HLL estimate
  • The HLL schema (that we can encode in Base64 for example)
#standardSQL
SELECT
exact_count,
HLL_COUNT.EXTRACT(hll_sketch) as hll_estimate,
hll_sketch
FROM (
SELECT
COUNT(DISTINCT identifier) exact_count,
HLL_COUNT.INIT(identifier) hll_sketch
FROM `daily-logs-20190101`
)

The true distinct count allows us to draw the daily evolution of our reach. The HLL estimate tells us how precise the estimate is by comparing it to the true count. We can estimate precision with something like:

precision = 1 - (|exact_count - hll_estimate| / exact_count)

To chart the monthly evolution, we leverage the daily HLL schemas we stored. At the end of the month, we merge the 31 schemas and get the monthly estimate. We store the estimate and the new schema resulting of the merging.

#standardSQL
SELECT
HLL_COUNT.EXTRACT(monthly_hll_sketch) monthly_estimate
FROM (
SELECT HLL_COUNT.MERGE_PARTIAL(hll_sketch) monthly_hll_sketch
FROM `daily-hll-sketches-january`
)

At the end of the quarter, we merge the 3 monthly schemas. At the end of the year, we merge the 12 monthly schemas … or the 4 quarterly ones.

The impact on processing is huge. A single daily table scan is enough to count (estimate) uniques over any time period. The rest of the processing is simply merging HLL schemas which is very cheap.

--

--