Snowflake UDFs for Hash Validation, Part 2: Java

Snowpark UDFs support Java, Javascript, SQL, and Python

Recap

A few weeks ago, I published an article about helping a customer with a Python function to generate a hash value that would match the MD5_NUMBER_LOWER_64 hash function we support in Snowflake SQL.

Our customer needs to divide their users more-or-less randomly into roughly equal-size groups, so they want a function that gives them a more-or-less random number based on some user information. They need to do this both inside Snowflake, where they can just use the function we already provide, and outside Snowflake, where they want to use Python. The Python they wrote wasn’t giving the same numbers, so we gave them a Python function that does, and used our new Python UDF feature (still in Private Preview right now) to confirm with some confidence that the results are consistent.

Will you…

A friend of mine, who shall remain nameless, promptly asked me: “will you create a Java version of the same to help other customer personas?” My initial answer was “probably not,” but then I decided I might as well. Java is a key part of Snowpark, where we allow you to build Java code to suit your requirements, deploy a jar to Snowflake, and then execute that code directly in Snowflake SQL queries.

So, back to Yoshi Matsuzaki for the Java version.

CREATE OR REPLACE FUNCTION MD5_NUMBER_LOWER64_JAVA(msg VARCHAR)
RETURNS NUMBER
LANGUAGE JAVA
HANDLER='Example.md5_number_lower64'
TARGET_PATH='@~/example.jar'
AS
$$
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.nio.charset.StandardCharsets;
import java.util.Arrays;
import java.math.BigInteger;
class Example {
public static BigInteger md5_number_lower64(String msg) throws NoSuchAlgorithmException {
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] rawDigest = md.digest(msg.getBytes(StandardCharsets.UTF_8));
return new BigInteger(1, Arrays.copyOfRange(rawDigest, 8, 16));
}
}
$$;

You can see that this is very similar to the Python version. The initial definition portion tells Snowflake what to call the function, what arguments it takes, what type of value it returns, and where to find it. In this particular case, Yoshi used only functions that are built in to the base Java packages available to the UDFs, and the code for our hash is simple enough to be easily included inline. If we needed custom functions or the actual UDF were very complex, we could deploy custom JAR files (or Python packages) to Snowflake instead.

The actual Java code is a bit more complex than the Python code, but ultimately it does the same : import some libraries and provide a function to execute. One thing some Java programmers may notice: although the NoSuchAlgorithmException class has to be imported, I haven’t bothered to catch that exception. I know the MD5 algorithm exists, and didn’t want the extra lines. I would’ve removed the import, if I could!

One little problem we ran into initially: Yoshi’s first Java version didn’t have the 1st argument (1) in the BigInteger constructor. This resulted in Yoshi’s test, using the word ‘Snowflake’, returning the correct result, while my first test, using a few different words, only returned the correct result some of the time.

And how’d it do?

As before, I ran a simple SQL query to validate the hashes of 598 GPS tracks between the internal function and the Java function (and I left the Python function in as well). At this point, it will come as no surprise that there were no rows returned, meaning that all the Java-generated hash values matched the values from the built-in Snowflake function.

SELECT MD5_NUMBER_LOWER64(TRACK) 
= MD5_NUMBER_LOWER64_PY(TRACK) PY_MATCH,
MD5_NUMBER_LOWER64(TRACK)
= MD5_NUMBER_LOWER64_JAVA(TRACK) JAVA_MATCH,
COUNT(*) NUM_ROWS
FROM RAW_TRACKS
GROUP BY PY_MATCH, JAVA_MATCH

For these matching queries, I’ve changed the way I’m doing them so that they return a number of rows that match or don’t match; I think this is better than just returning no results.

For kicks, I also ran the incorrect version and got about 50% correct answers. Considering that the code difference has to do with how the number is represented (signed or two’s complement), that makes sense: slightly more than half of the results in signed ints were negative.

Next up? The much larger test against the TPCH_SF1000 database, also with no differences:

WITH CTE AS (
SELECT c_name || '/' || c_address CUST_ID
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000.CUSTOMER
)
SELECT MD5_NUMBER_LOWER64(CUST_ID)
= MD5_NUMBER_LOWER64_JAVA(CUST_ID) MATCH,
COUNT(*) NUM_ROWS
FROM CTE
GROUP BY MATCH;

In this query, I’ve not only changed to the more definitive return of a count of matching rows, I’ve also switched to using a CTE so that I can include the assembly of the "CUST_ID" only one time. In a real use case where there might be multiple columns being collected, a slight mismatch among multiple uses of that calculation might cause mistakes.

By the way, I probably should mention that Snowflake includes 2 TPC-DS and 4 TPC-H sample data sets available for use in all of our deployments. The TPC-H samples run from 150k customers with 6m line items across 1.5m orders to 150m customers with 6 billion line items across 1.5b orders. Couple that with all the data sets available in the Snowflake Data Marketplace, some of them free like the COVID-19 data being used to power the State of California’s COVID-19 website, and you’ve got a treasure trove of interesting data you can use seamlessly!

Need More STATS!

My friend, Mr. Nameless, though he knows who he is, also asked “if all the buckets are within 1% of the median, what’s the standard deviation?” and “I wonder how many collisions there are?” He actually wanted to know about collisions with real customer data, but I don’t have that, so I’m sticking with TPC-H.

WITH BUCKETS AS (
SELECT MD5_NUMBER_LOWER64(c_name || '/' || c_address) % 1000,
COUNT(*) users
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000.CUSTOMER
GROUP BY 1)
SELECT MEDIAN(users) - AVG(users) "MEDIAN DEV",
MIN(users) - AVG(users) "MIN DEV",
MAX(users) - AVG(users) "MAX DEV",
ROUND(STDDEV(users)) "STDDEV" FROM BUCKETS;

This query tells us the deviation of the median, smallest, and largest buckets from the precise average of 150,000 customers per bucket, as well as the standard deviation. Note that, in this case, I do need the double quotation marks around 3 of the column aliases, since I chose to include spaces in those names.

Now, to be clear, I picked 1000 buckets because it was convenient. We could have used a larger or smaller number, or we could even have computed a number of buckets based on the approximate number of people we wanted in each.

Meanwhile, the following query returns no rows, telling us that there are no hash collisions among 150 million essentially random strings. I also ran it against the 10,000 scale factor version of TPC_H, which has 1.5 billion customer records and also got no collisions.

SELECT MD5_NUMBER_LOWER64(c_name || '/' || c_address), 
COUNT(*) users
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000.CUSTOMER
GROUP BY 1
HAVING COUNT(*) > 1

Let me Sum Up

The Python and Java UDFs I’ve presented are meant both as examples of how to use custom UDFs in Snowflake, as well as how UDFs can help avoid having to move data into or out of Snowflake for validation purposes. These hash values can be used for validating, grouping, or even keying data¹, so knowing that the Python and Java functions can be used outside of Snowflake to generate the same hash is very useful for this customer’s use case.

¹ There is a risk of collisions with MD5, so some people may worry that it isn’t ideal for keying, but the likelihood of a collision is pretty low. MD5 hashes are commonly used as keys in Data Vault modeling, for example.

--

--

Dylan Kaufman
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Principal Sales Engineer at Snowflake; biker, father, reader, programmer, long-time analytics consultant