BigQuery’s HyperLogLog++ as a Snowflake Java UDTF

Counting unique snowflakes is fun. Thanks to Google open sourcing ZetaSketch and Snowflake’s ability to run native Java UDFs, we can achieve sketch compatibility for approximate counts with other ecosystem players.

“Counting unique Snowflakes”, image generated by AI

In 2019 Google open-sourced ZetaSketch, a “collection of libraries for single-pass, distributed, approximate aggregation and sketching algorithms”. This has allowed other projects to enable compatibility for approximate counts and partial results:

Even before Google open sourced theirs, other projects had written their own implementations of HyperLogLog++(which they could now switch to the open source library):

The paper “HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm” is an interesting read on HLL and the HLL++ improvements:

  • A useful property of the HyperLogLog algorithm is that the memory requirement does not grow linearly with L, unlike other algorithms such as MinCount or LinearCounting. Instead, the memory requirement is determined by the number of registers and the maximum size of %(w) (which is stored in the registers)
  • Compared to the practical variant of HyperLogLog from [7], the accuracy is significantly better for large range of cardinalities and equally good on the rest. For precision 14 (and p 0 = 25), the sparse representation allows the average error for cardinalities up to roughly 12000 to be smaller by a factor of 4. For cardinalities between 12000 and 61000, the bias correction allows for a lower error and avoids a spike in the error when switching between sub-algorithms due to less steep error curves. The sparse representation also allows for a more adaptive use of memory; if the cardinality n is much smaller than m, then HyperLogLog++ requires significantly less memory

While Snowflake has its own fast APPROX_COUNT_DISTINCT and HLL functions, here we are going to leverage Snowflake Java UDFs to get HyperLogLog++ running on your Snowflake account.

Step 1: Set up

Download and build a JAR

To bring a copy of ZetaSketch into Snowflake, first download it and compile:

I had some problems compiling the fat JAR, but they were easy to solve:

  • Could not initialize class org.codehaus.groovy.reflection.ReflectionCache”: Just edit gradle/wrapper/gradle-wrapper.propertiesand change gradle-5.5-bin to gradle-6.3” (h/t: Stack Overflow).
  • Later Snowflake might tell you “class file has wrong version 59.0, should be 55.0”. Don’t despair, just add to build.gradle the line targetCompatibility = 11.

Stage JARs into Snowflake

With SnowSQL you can stage both JARs into Snowflake:

put ‘file://build/libs/zetasketch-0.2.0-SNAPSHOT.jar’ @~ auto_compress=falseput ‘file://fastutil-8.5.8.jar’ @~ auto_compress=false

Step 2: Create Snowflake Java UDFs

Read external HLL++ sketches in Snowflake

Let’s first create a function that understands HLL++ sketches created in other platforms. For example, with BigQuery you can create one with SELECT HLL_COUNT.INIT(1) FROM (SELECT 1).

We should note that BigQuery default output is to encode binaries as base64 strings, which is the BigQuery output for HLL_COUNT.INIT(). This binary is a serialized proto message that the Java code above parses with ‘forProto()’. Hence we want to write a UDF that can extract an approximate count out of any BigQuery encoded HLL++ sketch within Snowflake like :

select hll_count_extract(
base64_decode_binary('CHAQARgCIAiCBwsQARgPIBQyA5ajLg==')
);

Note that Snowflake can decode the base64 string to binary, and our new Java UDF should be able to receive binaries. Let’s write this function:

create or replace function hll_count_extract(x binary)
returns number
language java
imports = ('@~/zetasketch-0.2.0-SNAPSHOT.jar', '@~/fastutil-8.5.8.jar')
handler='MyClass.doit'
as
$$
import com.google.zetasketch.HyperLogLogPlusPlus;
class MyClass {
public long doit(byte[] x) {
return HyperLogLogPlusPlus.forProto(x).result();
}
}
$$;
Decoding a binary proto into an HLL++ unique count

That was easy: The function is simple, it brings in the 2 JARs we staged earlier, then it just HyperLogLogPlusPlus and it can parse the binary input as a proto — finally returning a number.

Compute your own HLL++ approximate counts in Snowflake

Now let’s get to the fun part: Computing approximate counts with HLL++ in Snowflake. It’s easy once you get the basics of defining a Java UDTF in Snowflake.

We are going to write a tabular function that goes through every row in an input table, adding the unique value to the HyperLogLogPlusPlus object with .add(value). Then HyperLogLogPlusPlus will keep internally a bounded representation that allows it to produce approximate distinct counts of an unbounded number of elements.

The code:

create or replace function hllpp_count_strings(x string)
returns table(hll_count number)
language java
target_path='@~/OptionalOptimizationHlppCountStrings.jar'
imports = ('@~/zetasketch-0.2.0-SNAPSHOT.jar', '@~/fastutil-8.5.8.jar')
handler='MyClass'
as
$$
import java.util.stream.Stream;
import com.google.zetasketch.HyperLogLogPlusPlus;
class OutputRow {
public final long hll_count;
public OutputRow(long outputValue) {
this.hll_count = outputValue;
}
}
class MyClass {
private final HyperLogLogPlusPlus<String> _hll;

public MyClass() {
_hll = new HyperLogLogPlusPlus.Builder().buildForStrings();
}

public static Class getOutputClass() {
return OutputRow.class;
}

public Stream<OutputRow> process(String inputValue) {
_hll.add(inputValue != null ? inputValue : "");
// Return nothing until whole partition computed.
return Stream.empty();
}

public Stream<OutputRow> endPartition() {
return Stream.of(new OutputRow(_hll.result()));
}
}
$$;

Note that HyperLogLogPlusPlus.Builder().buildForStrings() sets up some default parameters — that you can change depending on your own use case. For example — once you’ve considered the trade offs between precision, memory usage, and size of sketches used you could ask for Builder()
.normalPrecision(13).sparsePrecision(19).buildForStrings()
instead.

As a sample query, these are the approximate unique authors for the top 100 tags on Stack Overflow:

with data as (
select owner_user_id::string id, tags[0]::string tag
from temp.public.stackoverflow_posts202203
where type_id=1
)
select hll_count, tag
from data, table(hllpp_count_strings(id) over(partition by tag))
order by 1 desc
limit 100

Optimization: Partial sketches for partition parallelization

The code above doesn’t perform well with whole tables or huge partitions, as each partition processing in a UDTF doesn’t get parallelized by Snowflake.

The following code is able to leverage Snowflake’s automatic partitioning when dealing with whole tables (when no PARTITION BY is specified), by computing partial sketches and then merging them.

In Java a similar UDTF outputs partial sketches instead of counts:

create or replace function hllpp_count_strings_sketch(x string)
returns table(sketch binary)
language java
target_path='@~/OptionalOptimizationHlppCountStringsSketch.jar'
imports = ('@~/zetasketch-0.2.0-SNAPSHOT.jar', '@~/fastutil-8.5.8.jar')
handler='MyClass'
as
$$
import java.util.stream.Stream;
import com.google.zetasketch.HyperLogLogPlusPlus;
class OutputRow {
public final byte[] sketch;
public OutputRow(byte[] outputValue) {
this.sketch = outputValue;
}
}
class MyClass {
private final HyperLogLogPlusPlus<String> _hll;

public MyClass() {
_hll = new HyperLogLogPlusPlus.Builder().buildForStrings();
}

public static Class getOutputClass() {
return OutputRow.class;
}

public Stream<OutputRow> process(String inputValue) {
_hll.add(inputValue != null ? inputValue : "");
return Stream.empty();
}

public Stream<OutputRow> endPartition() {
return Stream.of(
new OutputRow(_hll.serializeToByteArray()));
}
}
$$;

Then a Java UDF can merge an array of sketches into another binary sketch:

create or replace function hll_merge(x array)
returns binary
language java
target_path='@~/OptionalOptimizationHlppMerge.jar'
imports = ('@~/zetasketch-0.2.0-SNAPSHOT.jar', '@~/fastutil-8.5.8.jar')
handler='MyClass.doit'
as
$$
import java.util.Base64;
import com.google.zetasketch.HyperLogLogPlusPlus;
class MyClass {
public byte[] doit(String[] sketches) {
HyperLogLogPlusPlus hll = null;
for (String sketch : sketches) {
if(hll==null) {
hll = HyperLogLogPlusPlus.forProto(Base64.getDecoder().decode(sketch));
} else {
hll.merge(Base64.getDecoder().decode(sketch));
}
}
return hll.serializeToByteArray();
}
}
$$;

Then in SQL we can compute a unique count for a whole table, by letting Snowflake create its own partitioning, merging these sketches, and then using the previously created function to transform a sketch into a count:

select hll_count_extract(
hll_merge(
array_agg(to_char(sketch, 'base64'))))
from data, table(hllpp_count_strings_sketch(id::string));
Aggregating sketches out of Snowflake’s default partitioning for a whole table distinct HLL++ count.

The above screenshot shows that Snowflake has 4,444,953 unique users asking questions, according to HLL++. Meanwhile the exact number is 4,471,161, while Snowflake’s internal APPROX_COUNT_DISTINCT says it’s 4,472,075. Thanks, algorithms.

Performance observations

With Java UDFs we can expect slower performance than a native implementation, and these are the numbers I got when counting unique Stack Overflow users (22m rows) partitioned by tag:

  • HLL++ in a Snowflake UDTF, XL-WH: ~5.80s.
  • Exact COUNT DISTINCT in BigQuery: ~2.70s.
  • HLL++ in BigQuery: ~1.50s.
  • Exact COUNT DISTINCT in Snowflake, XL-WH: ~0.06s
  • Native Snowflake HLL, XL-WH: ~0.02s.

These results are interesting: Counting exact distinct numbers in Snowflake ended up being >10x faster than counting approximate numbers in BigQuery. Remember to take these numbers with a grain of salt, as when doing these comparisons what really matters are the numbers that you get with your data.

We should also note that HLL++ gives better approximations than Snowflake’s native HLL — but that comparison is moot when calculating exact distinct counts are so fast. But if you need HLL++ compatibility in Snowflake: Take the code I shared here, and make the best out of it.

Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.

--

--

Felipe Hoffa
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.