Snowflake
Published in

Snowflake

BigQuery’s HyperLogLog++ as a Snowflake Java UDTF

Counting unique snowflakes is fun. Thanks to Google open sourcing ZetaSketch and Snowflake’s ability to run native Java UDFs, we can achieve sketch compatibility for approximate counts with other ecosystem players.

“Counting unique Snowflakes”, image generated by AI
  • A useful property of the HyperLogLog algorithm is that the memory requirement does not grow linearly with L, unlike other algorithms such as MinCount or LinearCounting. Instead, the memory requirement is determined by the number of registers and the maximum size of %(w) (which is stored in the registers)
  • Compared to the practical variant of HyperLogLog from [7], the accuracy is significantly better for large range of cardinalities and equally good on the rest. For precision 14 (and p 0 = 25), the sparse representation allows the average error for cardinalities up to roughly 12000 to be smaller by a factor of 4. For cardinalities between 12000 and 61000, the bias correction allows for a lower error and avoids a spike in the error when switching between sub-algorithms due to less steep error curves. The sparse representation also allows for a more adaptive use of memory; if the cardinality n is much smaller than m, then HyperLogLog++ requires significantly less memory

Step 1: Set up

Download and build a JAR

  • Could not initialize class org.codehaus.groovy.reflection.ReflectionCache”: Just edit gradle/wrapper/gradle-wrapper.propertiesand change gradle-5.5-bin to gradle-6.3” (h/t: Stack Overflow).
  • Later Snowflake might tell you “class file has wrong version 59.0, should be 55.0”. Don’t despair, just add to build.gradle the line targetCompatibility = 11.

Stage JARs into Snowflake

put ‘file://build/libs/zetasketch-0.2.0-SNAPSHOT.jar’ @~ auto_compress=falseput ‘file://fastutil-8.5.8.jar’ @~ auto_compress=false

Step 2: Create Snowflake Java UDFs

Read external HLL++ sketches in Snowflake

select hll_count_extract(
base64_decode_binary('CHAQARgCIAiCBwsQARgPIBQyA5ajLg==')
);
create or replace function hll_count_extract(x binary)
returns number
language java
imports = ('@~/zetasketch-0.2.0-SNAPSHOT.jar', '@~/fastutil-8.5.8.jar')
handler='MyClass.doit'
as
$$
import com.google.zetasketch.HyperLogLogPlusPlus;
class MyClass {
public long doit(byte[] x) {
return HyperLogLogPlusPlus.forProto(x).result();
}
}
$$;
Decoding a binary proto into an HLL++ unique count

Compute your own HLL++ approximate counts in Snowflake

create or replace function hllpp_count_strings(x string)
returns table(hll_count number)
language java
target_path='@~/OptionalOptimizationHlppCountStrings.jar'
imports = ('@~/zetasketch-0.2.0-SNAPSHOT.jar', '@~/fastutil-8.5.8.jar')
handler='MyClass'
as
$$
import java.util.stream.Stream;
import com.google.zetasketch.HyperLogLogPlusPlus;
class OutputRow {
public final long hll_count;
public OutputRow(long outputValue) {
this.hll_count = outputValue;
}
}
class MyClass {
private final HyperLogLogPlusPlus<String> _hll;

public MyClass() {
_hll = new HyperLogLogPlusPlus.Builder().buildForStrings();
}

public static Class getOutputClass() {
return OutputRow.class;
}

public Stream<OutputRow> process(String inputValue) {
_hll.add(inputValue != null ? inputValue : "");
// Return nothing until whole partition computed.
return Stream.empty();
}

public Stream<OutputRow> endPartition() {
return Stream.of(new OutputRow(_hll.result()));
}
}
$$;
with data as (
select owner_user_id::string id, tags[0]::string tag
from temp.public.stackoverflow_posts202203
where type_id=1
)
select hll_count, tag
from data, table(hllpp_count_strings(id) over(partition by tag))
order by 1 desc
limit 100

Optimization: Partial sketches for partition parallelization

create or replace function hllpp_count_strings_sketch(x string)
returns table(sketch binary)
language java
target_path='@~/OptionalOptimizationHlppCountStringsSketch.jar'
imports = ('@~/zetasketch-0.2.0-SNAPSHOT.jar', '@~/fastutil-8.5.8.jar')
handler='MyClass'
as
$$
import java.util.stream.Stream;
import com.google.zetasketch.HyperLogLogPlusPlus;
class OutputRow {
public final byte[] sketch;
public OutputRow(byte[] outputValue) {
this.sketch = outputValue;
}
}
class MyClass {
private final HyperLogLogPlusPlus<String> _hll;

public MyClass() {
_hll = new HyperLogLogPlusPlus.Builder().buildForStrings();
}

public static Class getOutputClass() {
return OutputRow.class;
}

public Stream<OutputRow> process(String inputValue) {
_hll.add(inputValue != null ? inputValue : "");
return Stream.empty();
}

public Stream<OutputRow> endPartition() {
return Stream.of(
new OutputRow(_hll.serializeToByteArray()));
}
}
$$;
create or replace function hll_merge(x array)
returns binary
language java
target_path='@~/OptionalOptimizationHlppMerge.jar'
imports = ('@~/zetasketch-0.2.0-SNAPSHOT.jar', '@~/fastutil-8.5.8.jar')
handler='MyClass.doit'
as
$$
import java.util.Base64;
import com.google.zetasketch.HyperLogLogPlusPlus;
class MyClass {
public byte[] doit(String[] sketches) {
HyperLogLogPlusPlus hll = null;
for (String sketch : sketches) {
if(hll==null) {
hll = HyperLogLogPlusPlus.forProto(Base64.getDecoder().decode(sketch));
} else {
hll.merge(Base64.getDecoder().decode(sketch));
}
}
return hll.serializeToByteArray();
}
}
$$;
select hll_count_extract(
hll_merge(
array_agg(to_char(sketch, 'base64'))))
from data, table(hllpp_count_strings_sketch(id::string));
Aggregating sketches out of Snowflake’s default partitioning for a whole table distinct HLL++ count.

Performance observations

  • HLL++ in a Snowflake UDTF, XL-WH: ~5.80s.
  • Exact COUNT DISTINCT in BigQuery: ~2.70s.
  • HLL++ in BigQuery: ~1.50s.
  • Exact COUNT DISTINCT in Snowflake, XL-WH: ~0.06s
  • Native Snowflake HLL, XL-WH: ~0.02s.

Want more?

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Felipe Hoffa

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.