Validating a Python Hash Function Inside Snowflake

Using Snowpark with Python to group users

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

9 min readDec 16, 2021

Generating Even Buckets Based on a Hash

A customer of mine uses MD5 hashing to group their users into a bunch of groups, randomly but fairly evenly. The idea here is that the hash will be unique if computed across enough pieces of user information to define a unique key for the user. Applying a modulo division function to the effectively random value of the hash as an integer will put each person into one of some number of buckets at random. The distribution may not be perfectly even, but it’ll probably be pretty close across enough users.

Crucially, my customer also needs to be able to do this both inside Snowflake, and outside Snowflake so that they have the same hash values, and the same buckets.

It’s important to understand that although MD5 was originally intended for cryptographic purposes, it isn’t secure enough for that use any longer. It can still be used for my customer’s bucketing use case, which is very similar to the common partitioning usage in Data Vault data modeling, among others.

It can also still be used to identify corruption in files, as is commonly done with software downloads, including our drivers and connectors: download the file and the MD5 hash, run the MD5 function against the file yourself and compare it to the original hash. Although the hash values aren’t guaranteed to be unique for different inputs, if you get the same value as the one you downloaded, the file most likely hasn’t been corrupted.

As a really quick example of how to do this, I used the MD5_NUMBER_LOWER64 hash function. I picked that one because it directly returns an integer, rather than a hex string.

In this example, I ran it on two strings, one with my name and one with another name followed by company name and company address. One could easily imagine having a bunch of other information in the input string, such as title, phone number, etc. Presumably, it would be as many attributes as necessary to form a fairly unique key for the user.

SELECT 'Me' "WHO", MD5_NUMBER_LOWER64('Dylan Kaufman/Snowflake/450 Concar Dr, San Mateo, CA 94402') "HASH"
UNION
SELECT 'Mgr', MD5_NUMBER_LOWER64('Alex Rottinghuis/Snowflake/450 Concar Dr, San Mateo, CA 94402');

Table showing output of example query. — Snowsight output of the example query

Supposing we want 1000 buckets of users, since “mod 1000” is very easy to do in your head, that would put me in bucket 378 and Alex in bucket 664. If we only wanted 750 groups, we’d use MOD("HASH",750) and then I’d be in group 128 and Alex in 414.

This is all well and good, but as I mentioned before, my customer needs to be able to generate the exact same hash outside of Snowflake as well. They were trying to do this in Python themselves, but were not getting the same results as generated by our LOWER64 function. Then they asked me for help.

MD5 Hashing Support in Snowflake SQL

Snowflake supports a few MD5 hashing functions. The standard MD5 function returns a hexadecimal string, as does MD5_HEX. MD5_BINARY returns it as a binary string, which we default to displaying as a hexadecimal string (differing from the MD5/MD5_HEX version only in that it uses capital letters, rather than all lower-case). We also have the MD5_NUMBER_LOWER64 and MD5_NUMBER_UPPER64 versions, which return part of the hash as an integer, as I showed earlier with the MD5_NUMBER_LOWER64 function.

This SQL script shows the output of each of those functions (I left out MD5_HEX, since it is the same as MD5), using the very cool UNPIVOT function to convert it into an easier format for reading here.

SET ME = 'Dylan Kaufman/Snowflake/450 Concar Dr, San Mateo, CA 94402';WITH CTE AS (
  SELECT MD5($ME) "MD5",
    TO_VARCHAR(MD5_BINARY($ME)) "MD5_BINARY",
    TO_VARCHAR(MD5_NUMBER_UPPER64($ME)) "MD5_NUMBER_UPPER64",
    TO_VARCHAR(MD5_NUMBER_LOWER64($ME)) "MD5_NUMBER_LOWER64"
 )
SELECT * FROM CTE
  UNPIVOT("HASH" FOR "FUNCTION" IN 
         (MD5, MD5_BINARY, MD5_NUMBER_UPPER64, MD5_NUMBER_LOWER64));

Table showing output of four different MD5 hash functions — Output of 4 different MD5 hash functions

It is perhaps worth mentioning that Snowflake also supports cryptographically secure hashing with SHA1 and SHA2, each of which also has a _BINARY version. Also, when you’re using these functions, though I didn’t bother to do it here, you may want to TRIM your input strings to make sure there’s no whitespace at the end of it; that whitespace will not be easy to see when you’re looking at the data, and it will affect the hash. If you’re trying to compare the output between two systems and one of them isn’t trimming, you’re going to get a little frustrated!

Since the customer needs to do this in Python, they need either to figure out how to fix their implementation or to start using one that is known to work. I asked around internally about whether we had a Python example I could pass along and soon had a small Python script from Yoshi Matsuzaki, complete with the matching outputs of a single run each, of Yoshi’s Python script and of our internal function. Felipe Hoffa promptly turned that example into a Stack Overflow post.

One thing that concerned me, though, was the very small sample size in the example. Does doing a single hash on one word prove that the little Python implementation will consistently return the same value as the built-in function?

I don’t actually know how far off the results were for my customer. For all I know, the results from their code are different on every input and getting the right result on one input is sufficient. On the assumption that their code sometimes gives the same result, I decided that I should run some other tests to try to get more certainty.

Generating Some Hash Values to Test

As it happens, I have 598 GPS tracks from bike rides, recorded on a Garmin GPS or with the Strava app, already loaded into a table. The XML of the .gpx and .tcx files is stored in a VARIANT column, and they range in size from 1,887 bytes (must’ve been a very short ride!!) to almost 46 MB. The Snowflake VARIANT column converts automatically to a string, so running it through one of the MD5 functions is trivial:

SELECT MD5_NUMBER_LOWER64(TRACK) FROM RAW_TRACKS;

The output of that function, which I’m not going to bother showing here, is unsurprisingly, 598 integer values. Throwing a DISTINCT on there still got me 598, so I know they’re unique. It’s important to remember that MD5 isn’t guaranteed unique.

Now I need to run the same XML strings through a Python function to do the same thing, and I need to be able to tie the output of the Python function to the output of the Snowflake function so that I can tell whether they’re the same.

I don’t yet have the names of the files associated with their rows in the RAW_TRACKS table, so writing a Python program on my computer to generate a list of file names and hashes and then doing something to compare that to the list from Snowflake isn’t really practical right now.

Generating Hashes using Python inside Snowflake

Instead, what better way than to run the Python code directly inside Snowflake, against the same table, using Snowflake Python UDFs? Python UDFs, part of Snowpark, are a new feature currently in Private Preview, so not yet available to everyone. As the name suggests, this new feature allows users to create user-defined functions that use Python code; as shown in the diagram below, creating a UDF with Python installs Python bytecode into a secure sandbox for execution.

Basic Python UDF operation diagram in Snowflake

Here is what the SQL code to create the Python UDF looks like in Snowsight, with syntax highlighting, at least for the SQL parts. You can also find it farther down in this article if you want to copy it.

Screenshot from Snowsight of the Python UDF with syntax highlighting — SQL code to create the Python UDF

The first part of this code is similar to creating functions in the other languages we already support: SQL, Javascript, and Java. That section:

names the function,
provides the input argument names and types,
specifies the return type,
identifies the language (not necessary for SQL)
the runtime version (for Python)
and the name of the handler function in the code (for Python and Java), since the code may include many functions

The rest of the code, between the $$ markers, is the actual Python code. As you can see that code is pretty simple. The md5 function comes right out of Python’s built-inhashliblibrary, and we have to encode the input, hash it, and then manipulate the result a little to get the lower 64 bits of numeric data and return that number as an integer.

Validating the Hashes with a SQL Query

Now that I have the Python UDF created, it’s a simple matter to create a query that computes the hash using both the built-inMD5_NUMBER_LOWER64 function and the UDF and returns only those combinations that don’t match:

SELECT MD5_NUMBER_LOWER64(TRACK) INTERNAL, 
       MD5_NUMBER_LOWER64_PY(TRACK) UDF
FROM RAW_TRACKS
WHERE "INTERNAL" <> "UDF";

I ran this against all 598 GPS tracks, and didn’t get back any rows, so we now know that all the hashes are the same, at least for these few sample strings.

598 GPS tracks isn’t really all that big a sample size… I wonder if we can do a much larger data set? Not as if we don’t have plenty available. For example, Snowflake shares a SAMPLE_DATA database with every account. This database includes several different sizes of the TPC-H and TPC-DS testing databases, which can have huge amounts of data in them. I decided to use TPC-H SF1000, which has a customer table consisting of 150 million randomly generated customer names and addresses.

The customer names are “Customer#” followed by a number and the addresses are completely random text. When combined, this gives us 150 million unique text values such as Customer#044207085/9ON0dUnisbCaXsMDB3XBFrhmAWWPWWkSr33. After running the query below, there are still no mismatches between the internal function and the Python UDF, so while still not definitive proof that they will always return the same value, it’s enough to make me feel pretty confident.

SELECT MD5_NUMBER_LOWER64(c_name || '/' || c_address) INTERNAL, 
       MD5_NUMBER_LOWER64_PY(c_name || '/' || c_address) UDF
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000.CUSTOMER
WHERE "INTERNAL" <> "UDF";

Oh, and going back to my comment earlier about not getting perfectly even groups out of this, I ran another query to see how many of our TPC-H customers fall into each of the 1000 buckets:

SELECT MD5_NUMBER_LOWER64(c_name || '/' || c_address) % 1000,          
       COUNT(*)
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000.CUSTOMER
GROUP BY 1
ORDER BY 2;

The resulting buckets contain from 148,572 to 151,156 users per bucket, with a median of 149,998. Every bucket is within 1% of the average of 150k.

Here’s the code for the UDF in-line in case you want to copy it:

CREATE OR REPLACE FUNCTION MD5_NUMBER_LOWER64_PY(msg STRING)
RETURNS INTEGER
LANGUAGE PYTHON
RUNTIME_VERSION=3.8
HANDLER = 'COMPUTE_HASH'
AS
$$
import hashlib
def compute_hash(msg):
   return int.from_bytes(hashlib.md5(msg.encode('utf-8')).digest()[8:], 'big')
$$;

And don’t worry, I’ll be adding the .fit files from my Garmin as well, but those are binary and require more conversion to be useful, so I haven’t gotten to them yet. That’s a story for another day, though.

Updates

A colleague pointed out that using VALUES can make for simpler queries in some of my examples above.

SELECT $1 AS WHO, 
       MD5_NUMBER_LOWER64($2) AS HASH
FROM VALUES
       ('Me', 'Dylan Kaufman/Snowflake/450 Concar Dr, San Mateo, CA 94402'),
       ('Mgr', 'Alex Rottinghuis/Snowflake/450 Concar Dr, San Mateo, CA 94402');

In this case, using two rows of VALUES allows me to avoid the UNION I used, or UNION ALL, which is technically more efficient when you know there are no duplicate rows.

Similarly, VALUES allows a single statement where I originally used a SET command and then embedded a variable in the query:

WITH CTE AS (
  SELECT MD5($1) MD5,
    TO_VARCHAR(MD5_BINARY($1)) MD5_BINARY,
    TO_VARCHAR(MD5_NUMBER_UPPER64($1)) MD5_NUMBER_UPPER64,
    TO_VARCHAR(MD5_NUMBER_LOWER64($1)) MD5_NUMBER_LOWER64
    FROM VALUES
      ('Dylan Kaufman/Snowflake/450 Concar Dr, San Mateo, CA 94402')
 )
SELECT * FROM CTE
  UNPIVOT(HASH FOR FUNCTION IN 
         (MD5, MD5_BINARY, MD5_NUMBER_UPPER64, MD5_NUMBER_LOWER64));

His version of the matching query will always give some result, either showing that all the samples match, or showing how many do and how many don’t, rather than returning no rows like mine:

SELECT (MD5_NUMBER_LOWER64(TRACK) = MD5_NUMBER_LOWER64_PY(TRACK)) AS IS_MD5_MATCH,
       COUNT(*) AS NUM_ROWS
FROM RAW_TRACKS
GROUP BY 1
ORDER BY 1;

Also, I used double-quotes " to mark various identifiers, which wasn’t really necessary.