BigQuery’s HyperLogLog++ as a Snowflake Java UDTF
Counting unique snowflakes is fun. Thanks to Google open sourcing ZetaSketch and Snowflake’s ability to run native Java UDFs, we can achieve sketch compatibility for approximate counts with other ecosystem players.
In 2019 Google open-sourced ZetaSketch, a “collection of libraries for single-pass, distributed, approximate aggregation and sketching algorithms”. This has allowed other projects to enable compatibility for approximate counts and partial results:
- BigQuery: “Counting uniques faster in BigQuery with HyperLogLog++” (Fun fact: I wrote that one).
- Apache BEAM: “HyperLogLog++ Integration with Apache Beam”.
Even before Google open sourced theirs, other projects had written their own implementations of HyperLogLog++(which they could now switch to the open source library):
- Elastic: HyperLogLogPlusPlus
- Spark: HyperLogLogPlusPlus.scala
The paper “HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm” is an interesting read on HLL and the HLL++ improvements:
- “A useful property of the HyperLogLog algorithm is that the memory requirement does not grow linearly with L, unlike other algorithms such as MinCount or LinearCounting. Instead, the memory requirement is determined by the number of registers and the maximum size of %(w) (which is stored in the registers)”
- “Compared to the practical variant of HyperLogLog from [7], the accuracy is significantly better for large range of cardinalities and equally good on the rest. For precision 14 (and p 0 = 25), the sparse representation allows the average error for cardinalities up to roughly 12000 to be smaller by a factor of 4. For cardinalities between 12000 and 61000, the bias correction allows for a lower error and avoids a spike in the error when switching between sub-algorithms due to less steep error curves. The sparse representation also allows for a more adaptive use of memory; if the cardinality n is much smaller than m, then HyperLogLog++ requires significantly less memory”
While Snowflake has its own fast APPROX_COUNT_DISTINCT and HLL functions, here we are going to leverage Snowflake Java UDFs to get HyperLogLog++ running on your Snowflake account.
Step 1: Set up
Download and build a JAR
To bring a copy of ZetaSketch into Snowflake, first download it and compile:
- Get a copy with
git clone https://github.com/google/zetasketch.git
- Then compile a fat JAR with
./gradlew shadowJar
. - The fat JAR does not include all dependencies, so you’ll also need to download
fastutil-8.5.8.jar
.
I had some problems compiling the fat JAR, but they were easy to solve:
- “Could not initialize class org.codehaus.groovy.reflection.ReflectionCache”: Just edit
gradle/wrapper/gradle-wrapper.properties
and changegradle-5.5-bin
togradle-6.3
” (h/t: Stack Overflow). - Later Snowflake might tell you “class file has wrong version 59.0, should be 55.0”. Don’t despair, just add to
build.gradle
the linetargetCompatibility = 11
.
Stage JARs into Snowflake
With SnowSQL you can stage both JARs into Snowflake:
put ‘file://build/libs/zetasketch-0.2.0-SNAPSHOT.jar’ @~ auto_compress=falseput ‘file://fastutil-8.5.8.jar’ @~ auto_compress=false
Step 2: Create Snowflake Java UDFs
Read external HLL++ sketches in Snowflake
Let’s first create a function that understands HLL++ sketches created in other platforms. For example, with BigQuery you can create one with SELECT HLL_COUNT.INIT(1) FROM (SELECT 1)
.
We should note that BigQuery default output is to encode binaries as base64 strings, which is the BigQuery output for HLL_COUNT.INIT()
. This binary is a serialized proto message that the Java code above parses with ‘forProto()’. Hence we want to write a UDF that can extract an approximate count out of any BigQuery encoded HLL++ sketch within Snowflake like :
select hll_count_extract(
base64_decode_binary('CHAQARgCIAiCBwsQARgPIBQyA5ajLg==')
);
Note that Snowflake can decode the base64 string to binary, and our new Java UDF should be able to receive binaries. Let’s write this function:
create or replace function hll_count_extract(x binary)
returns number
language java
imports = ('@~/zetasketch-0.2.0-SNAPSHOT.jar', '@~/fastutil-8.5.8.jar')
handler='MyClass.doit'
as
$$
import com.google.zetasketch.HyperLogLogPlusPlus;class MyClass {
public long doit(byte[] x) {
return HyperLogLogPlusPlus.forProto(x).result();
}
}
$$;
That was easy: The function is simple, it brings in the 2 JARs we staged earlier, then it just HyperLogLogPlusPlus
and it can parse the binary input as a proto — finally returning a number.
Compute your own HLL++ approximate counts in Snowflake
Now let’s get to the fun part: Computing approximate counts with HLL++ in Snowflake. It’s easy once you get the basics of defining a Java UDTF in Snowflake.
We are going to write a tabular function that goes through every row in an input table, adding the unique value to the HyperLogLogPlusPlus
object with .add(value)
. Then HyperLogLogPlusPlus
will keep internally a bounded representation that allows it to produce approximate distinct counts of an unbounded number of elements.
The code:
create or replace function hllpp_count_strings(x string)
returns table(hll_count number)
language java
target_path='@~/OptionalOptimizationHlppCountStrings.jar'
imports = ('@~/zetasketch-0.2.0-SNAPSHOT.jar', '@~/fastutil-8.5.8.jar')
handler='MyClass'
as
$$
import java.util.stream.Stream;
import com.google.zetasketch.HyperLogLogPlusPlus;class OutputRow {
public final long hll_count;public OutputRow(long outputValue) {
this.hll_count = outputValue;
}
}class MyClass {
private final HyperLogLogPlusPlus<String> _hll;
public MyClass() {
_hll = new HyperLogLogPlusPlus.Builder().buildForStrings();
}
public static Class getOutputClass() {
return OutputRow.class;
}
public Stream<OutputRow> process(String inputValue) {
_hll.add(inputValue != null ? inputValue : "");
// Return nothing until whole partition computed.
return Stream.empty();
}
public Stream<OutputRow> endPartition() {
return Stream.of(new OutputRow(_hll.result()));
}
}
$$;
Note that HyperLogLogPlusPlus.Builder().buildForStrings()
sets up some default parameters — that you can change depending on your own use case. For example — once you’ve considered the trade offs between precision, memory usage, and size of sketches used you could ask for Builder()
instead.
.normalPrecision(13).sparsePrecision(19).buildForStrings()
As a sample query, these are the approximate unique authors for the top 100 tags on Stack Overflow:
with data as (
select owner_user_id::string id, tags[0]::string tag
from temp.public.stackoverflow_posts202203
where type_id=1
)select hll_count, tag
from data, table(hllpp_count_strings(id) over(partition by tag))
order by 1 desc
limit 100
Optimization: Partial sketches for partition parallelization
The code above doesn’t perform well with whole tables or huge partitions, as each partition processing in a UDTF doesn’t get parallelized by Snowflake.
The following code is able to leverage Snowflake’s automatic partitioning when dealing with whole tables (when no PARTITION BY is specified), by computing partial sketches and then merging them.
In Java a similar UDTF outputs partial sketches instead of counts:
create or replace function hllpp_count_strings_sketch(x string)
returns table(sketch binary)
language java
target_path='@~/OptionalOptimizationHlppCountStringsSketch.jar'
imports = ('@~/zetasketch-0.2.0-SNAPSHOT.jar', '@~/fastutil-8.5.8.jar')
handler='MyClass'
as
$$
import java.util.stream.Stream;
import com.google.zetasketch.HyperLogLogPlusPlus;class OutputRow {
public final byte[] sketch; public OutputRow(byte[] outputValue) {
this.sketch = outputValue;
}
}class MyClass {
private final HyperLogLogPlusPlus<String> _hll;
public MyClass() {
_hll = new HyperLogLogPlusPlus.Builder().buildForStrings();
}
public static Class getOutputClass() {
return OutputRow.class;
}
public Stream<OutputRow> process(String inputValue) {
_hll.add(inputValue != null ? inputValue : "");
return Stream.empty();
}
public Stream<OutputRow> endPartition() {
return Stream.of(
new OutputRow(_hll.serializeToByteArray()));
}
}
$$;
Then a Java UDF can merge an array of sketches into another binary sketch:
create or replace function hll_merge(x array)
returns binary
language java
target_path='@~/OptionalOptimizationHlppMerge.jar'
imports = ('@~/zetasketch-0.2.0-SNAPSHOT.jar', '@~/fastutil-8.5.8.jar')
handler='MyClass.doit'
as
$$
import java.util.Base64;
import com.google.zetasketch.HyperLogLogPlusPlus;class MyClass {
public byte[] doit(String[] sketches) {
HyperLogLogPlusPlus hll = null;
for (String sketch : sketches) {
if(hll==null) {
hll = HyperLogLogPlusPlus.forProto(Base64.getDecoder().decode(sketch));
} else {
hll.merge(Base64.getDecoder().decode(sketch));
}
}
return hll.serializeToByteArray();
}
}
$$;
Then in SQL we can compute a unique count for a whole table, by letting Snowflake create its own partitioning, merging these sketches, and then using the previously created function to transform a sketch into a count:
select hll_count_extract(
hll_merge(
array_agg(to_char(sketch, 'base64'))))
from data, table(hllpp_count_strings_sketch(id::string));
The above screenshot shows that Snowflake has 4,444,953 unique users asking questions, according to HLL++. Meanwhile the exact number is 4,471,161, while Snowflake’s internal APPROX_COUNT_DISTINCT
says it’s 4,472,075. Thanks, algorithms.
Performance observations
With Java UDFs we can expect slower performance than a native implementation, and these are the numbers I got when counting unique Stack Overflow users (22m rows) partitioned by tag:
- HLL++ in a Snowflake UDTF, XL-WH: ~5.80s.
- Exact COUNT DISTINCT in BigQuery: ~2.70s.
- HLL++ in BigQuery: ~1.50s.
- Exact COUNT DISTINCT in Snowflake, XL-WH: ~0.06s
- Native Snowflake HLL, XL-WH: ~0.02s.
These results are interesting: Counting exact distinct numbers in Snowflake ended up being >10x faster than counting approximate numbers in BigQuery. Remember to take these numbers with a grain of salt, as when doing these comparisons what really matters are the numbers that you get with your data.
We should also note that HLL++ gives better approximations than Snowflake’s native HLL — but that comparison is moot when calculating exact distinct counts are so fast. But if you need HLL++ compatibility in Snowflake: Take the code I shared here, and make the best out of it.
Want more?
- Try this out with a Snowflake free trial account — you only need an email address to get started.
I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.