Building petabyte-scale analytics with BigQuery and HLL

Published in

Permutive

11 min readAug 6, 2019

One of Permutive’s most valuable features is providing analytics to our publishers; from the simple number of unique visitors to a page to multi-axis filtering by things like country, visited page and intersections with multiple segments. Think of it as advanced Google Analytics.

As we’ve developed this functionality we’ve learned that providing performant, cost-efficient analytics while maintaining high accuracy is really hard. We’ve learned a lot during this process, particularly about BigQuery and HyperLogLog.

To illustrate some of the complexities we’ll start with the simplest use-case: counting the number of unique users for one of our customers.

At the core of this query is the page view event and a user id. With this in mind, and assuming a SQL like database, we could run a very simple query: SELECT COUNT(DISTINCT user_id) FROM pageview_events. This should work great and get an accurate number relatively quickly, but the complexity starts when you need to work with tens of millions, hundreds of millions or billions of users.

Enter BigQuery.

We won’t go in-depth on what BigQuery is, suffice it to say that it’s a managed database available in Google Cloud which allows for the storage and querying of petabytes of data. Importantly, it’s not a relational database — it doesn’t have the concept of a primary or foreign keys and it’s built to be horizontally scalable, which gives it the ability to query massive amounts of data.

With BigQuery in hand we can run the previous query on hundreds of millions or billions of users. In fact, running this exact query on a single month’s worth of pageview events for one of our customers (Publisher A, ~800GB of data) we get a result back in 16.9 seconds with an indicated count of 95,135,851 unique users. Not too bad.

Unfortunately, BigQuery comes at a BigCost™️. The current pricing model for querying data is 5 USD per 1TB of billable bytes, and that adds up quickly. Imagine a dashboard that shows multiple charts — distribution of users by country, distribution of visitors by domain, total number of unique visitors, unique pageviews, etc. In the most naïve case, every chart is a query in its own right, meaning we’ve spent ~$25 just by opening a single page for Publisher A!

Fortunately, BigQuery helps out by being column oriented, which means you’re only billed for the selected columns. So in our basic example, for Customer A, if we wanted to only count unique users we only need to select the user_id column. This means our 800GB query becomes 27GB (we store user ids as UUID v4).

In addition, BigQuery has two more cost-saving features:

Data is partitioned by day — which means you’re only billed for the partitions you query
BigQuery is able to analyse and optimise queries which contain common subqueries. So if you reference the same data over and over again in the same query, you’ll only be billed once for it

As you might expect our customers aren’t satisfied with a single number of “unique visitors” they expect rich and insightful analytics. This means selecting more columns which costs more money and (at our scale) this becomes a big problem. Finally, there’s the the issue of performance. Even though BigQuery is capable of querying petabytes of data, it doesn’t mean it does it very quickly. Customers expect results in real-time.

A New Hope — HyperLogLog.

HyperLogLog (HLL) is an algorithm to count distinct items in a multi-set. An interesting primer can be found in this blog post if you’re interested, but for the purposes of this article all you need to know is that while it counts distinct items very efficiently, due to it being a statistical approximation you trade some accuracy. This accuracy can be controlled by a precision parameter, which in BigQuery’s case goes up to 24, and by default is 15 (. The tradeoff of increasing accuracy is you need to store bigger sketches, which once again, increase the cost as the amount data that needs to be queried increases.

A simple HLL query in BigQuery would look as follows:

A simple query using HLL in BigQuery

There are two important functions in the preceding query:

INIT — this initialises the HLL using a provided column (with the output of it as a sketch that only represents the elements in it). The second parameter is precision where the higher the number, the better the approximation is
EXTRACT— this function gets the actual count of distinct elements, more on this later

In our example for Publisher A, this query returns an indicated count of 95,242,513 in 12.0 seconds; a saving of 4.9 seconds while still being within 0.11% of the exact count.

HLL’s Key Strength — Composability.

So thats looks OK, but where’s the interesting part? Well, the INIT function generates a sketch and these compose. We love functional programming at Permutive — so we love things that compose and HLL can do just that via the MERGE and MERGE_PARTIAL functions.

Composability allows us to continue counting where we left off and merge the results of two counts to form a more complex count.

Imagine a table that stores pre-aggregated sketches, each row containing an HLL sketch¹ that represents all user ids that have visited a given page on Publisher A’s website:

The page_users table

There’ll be users that have visited both the homepage and a given article, but we still want to get a count of distinct user ids. To do that we’ll first need to merge the sketches:

Merging HLL sketches

This results in a new sketch that represents user ids from both pages without double counting users that have visited both pages. As we did before, we can use the EXTRACT function to get the final count.

BigQuery provides a convenience function called MERGE that combines the two steps of MERGE_PARTIAL followed by EXTRACT into a single step:

Using the single step EXTRACT function

Composability vastly enhances what we can do with HLL. What if we could take a step further and use it to maximise performance and minimise cost? We could pre-aggregate sketches based on some predefined attributes and compose them to build extremely complex queries.

Pre-aggregation and periodic sketch generation

BigQuery supports scheduled queries — the name says everything — it’s a query that runs periodically and appends its result into another table. In our case, these queries run daily and append their results into another table.

We can use cost saving features in BQ to generate them. As previously mentioned, BigQuery optimises queries which have multiple subqueries referencing the same tables — you are billed for these bytes only once even though the data is used in multiple places in the query.

The most obvious solution would be to generate a table that enumerates all possible combinations:

Enumerate tables

But there are a couple of issues:

The sketches become increasingly tiny sets, and we wrote about small sets and their issues previously
Another issue is cost — in order to do the equivalent of a SELECT DISTINCT(COUNT user_id) with no filtering, you have to select all the sketches in the table and merge them
Finally, merging potentially millions of sketches quickly becomes very slow

So we need a better solution, once that utilises HLL’s composability better:

Composable sketches

We call these tables “master tables” as they contain all the fields necessary but as separate sketches. We don’t query these tables directly, instead using views. We then use scheduled queries and generate sketches for a given day.

This is cost-efficient not only because we selected the data once in order to insert it into this pre-aggregated table, but also because BigQuery only bills for selected columns. This means that if we want to only select the sketches of some countries, we could do it by adding a WHERE country IS NOT NULL and only select country and country_hll . This way no other columns are not referenced and not billed.

Finally, the master table’s structure might allow us to write really small and efficient queries if BigQuery’s MERGE function accepted multiple arguments:

A better solution, that sadly isn’t possible

Unfortunately, this is currently not supported. We’ve made a feature request to Google which hasn’t so far been taken up.

Querying the data

So far we‘ve talked about two main things:

How to use HLL and its benefits (performance, composability)
Utilising its composability to pre-aggregate reusable sketches, which in turn can be used to create more complex queries

However, so far we’ve only done unions of sets. This isn’t enough for an insightful report, so we want to do some filtering — for example, let’s say we have another table which groups users by country:

Grouped by country

Using a traditional (non-HLL) approach, we could do something like:

By country using a traditional approach

The trouble is we’re storing the sketches separately so we can’t do that anymore — we need to do an intersection (we’ll talk more about storing sketches later on in the post, for now assume the sketches are stored separately for ‘reasons’). Unfortunately, HLL doesn’t natively support intersections, but luckily, we have a solution.

Inclusion-exclusion principle.

To quote Wikipedia, the inclusion-exclusion principle is a:

counting technique which generalises the familiar method of obtaining the number of elements in the union of two finite sets

While the main use case of this principle is the union, the equation can be rearranged to get the intersection instead. In our case, it can be expressed as:

|A ∩ B| = |A| + |B| - |A ∪ B|

Inn other words, the size of the intersection of set A and set B is equal to the sum of their individual sizes minus the size of their union. Fortunately, HLL does have all of these operators built-in, but what does it look like?

Intersections using HLL

It doesn’t look great. The MERGE function is limited to a single parameter in BigQuery— which means the sketches have be on the same column. Ideally we could do something like the following, but it’s not currently possible in BigQuery.

A second parameter being passed to MERGE

Going back to the result of the intersection — we end up with the final count and not a sketch. This is important as it means we lose HLL’s comparability, hence there must always be a final step in the query.

The problem with intersections.

First, since the inclusion-exclusion principle is basically a loop, its complexity depends on the number of sets used in the equation. If you want to get an intersection of three sets (the equivalent of 3 WHERE conditions) the formula becomes significantly larger:

|A ∩ B ∩ C| = |A| + |B| + |C| - |A ∪ B| - |A ∪ C| - |B ∪ C| + |A ∪ B ∪ C |

As you can imagine, it gets worse with every term that you add — while the query with 1 filter is 16 lines of SQL (nicely formatted), 3 filters becomes 81 lines.

Is this a problem? From one side — no, execution time is still reasonable but we’ve entered the domain of unmaintainable SQL. The more filters you add, the more terms there are, the more complex the query. This is very error prone but luckily it’s easily automatable through code generation. It’s still a problem though, as it’s hard to maintain machine generated code.

Another caveat of intersections is error rate. Every intersected set compounds the error rate. We’ve found this to be the case especially with small sets (and even more so when a huge set is intersected with a tiny set). The publisher Schibsted has found similar issues, although it’s important to note they do not use Google’s implementation of HLL, hence Schibsted’s numbers cannot be directly compared to ours.

Improving performance and reducing the error rate

After using BigQuery, HLL and intersections we still have two problems we need to mitigate:

Slow queries, due to BigQuery limitations
Elevated error-rates due to the number of intersections

While the former can’t currently be mitigated, the latter can be to a degree. To do so we’ll use another BigQuery optimisation — automatically removing branches which are evaluated to always be FALSE. For example, the following query will do nothing and cost nothing:

The query, it does nothing

Since the WHERE always evaluates to FALSE no rows will be returned. BigQuery is smart enough to understand this, even in complex situations. Using the master_precomputed table, we can pick our base data as the union of all page sketches:

A union of page sketches

But now we want to do filtering based on country as well:

Country filtering in action

We can now utilise a little bit of pre-processing and do something like:

Country filtering, with preprocessing

The pre-processor will find instances that look like "[X|1]"="1" and evaluate them.

If the filter is not set, a default value is used, hence the first template is evaluated into "1"="1" and the second one into "1"!="1" . Since the second query is statically evaluated to never return any rows, BigQuery automatically removes it and the whole query simply becomes (note the second query completely missing):

BigQuery optimising the query

If the filter is set (e.g. to UK) the first template is evaluated into "UK"="1" and the second into "UK"!="1". This time the first query is always false, meaning it’s optimised out and not used, resulting in the following query:

Further BigQuery optimisations

This means that we only do intersections when necessary — only when the filter is set.

Final optimisations

One final step we can take optimise the cost and performance of the queries is addressing the issue of including today’s data. As previously mentioned, our pre-aggregation’s executed once a day and runs for the previous day. This means we’re always missing some data.

By utilising our filter pre-processor and HLL, we have the ability to control both the cost and performance of the queries.

We’ve have added a couple of things: a new subquery that retrieves today’s data from the raw table and generates the HLL sketch. Since these users are additional users we may or may not have seen, we need to add them to all terms. Finally, we filter with out pre-processor to dynamically include or exclude the subquery.

This means that our customers can choose to exclude today’s data, which means their queries are faster (as they only query pre-aggregated data) and the queries are cheaper to run for us — a win-win.

Does all of this work?

Yes! We can optimise a dashboard with10 charts (all of which run independently) and compare the two versions — one using precomputation and HLL, and one querying the raw data.

Originally the report took >5 minutes to generate, for a period of 30 days. In fact, it often time time out after being billed for 10 TB (!) of data.
For the same date range, but excluding today’s data, the query takes less than 20 seconds and consumes only 21GB of data. This means a saving of 4.895 USD per query, and crucially, a vastly improved user experience.
Including today’s data depends varies based on the customer. For one of our larger customers, it could mean an increase of 100GB and a query length of 20 seconds

What’s next?

Unfortunately, this solution isn’t without flaws. Our filtering optimisation has issues: a subquery is needed for every combination filter e.g. if you have three possible filters, the query will contain eight subqueries.

At this point we hit another BigQuery restriction:

Resources exceeded during query execution: Not enough resources for query planning — too many subqueries or query is too complex.

The queries (with all of our cost-optimisations) become too complex and are immediately rejected. Having contacted Google support, this is apparently a restriction in order to “protect BigQuery” and is not configurable per-project.

So what do we do? All-in-all, HLL in BigQuery works well, it’s very cost-efficient and reasonably accurate, but there are too many caveats. The machine-generated SQL is difficult to maintain and we’re hitting a BigQuery imposed query complexity limit.

Google has recently released their HLL code as a Java library called zetasketch. It’s compatible with the sketches stored in BigQuery, meaning we could still use features such as scheduled queries, but are no longer limited to the awkward syntax of SQL and limitations such as MERGE only accepting a single parameter. This means a hybrid solution where BigQuery’s still responsible for pre-aggregation, but intersections are handled in an external service, avoiding the syntax limitations, but also introducing inflexibility as we move out of the world of SQL.

Perhaps we can take a step further: we could develop a custom DSL which removes the restrictions imposed by SQL? Or we could develop a streaming solution, similar to Reddit? We could even look at a solution other than HyperLogLog, Yahoo has developed an algorithm called theta sketch which natively supports intersections.

We plan on publishing a follow-up article how we can take HyperLogLog and related algorithms and further improved our analytics products. This means better performance, lower cost, improved accuracy with smaller datasets.

¹ HLL sketches in BigQuery are stored as Base64-encoded protobuffs.