Custom Function Validation in Snowflake, Part 3: JavaScript or _____

Staging external hash results to validate in Snowflake

Previously On…

In two earlier articles, I’ve described how to create a Python or Java UDF in Snowflake to replicate one of Snowflake’s built-in hashing functions, and then query various amounts of data to verify with some confidence that the UDF and the built-in function are generating the same result.

This is by far the easiest way to validate such a script: because everything happens inside Snowflake, we don’t have to import or export any data, there’s no risk of the data being left somewhere we don’t want it, and we don’t have to worry about finding or supplying compute resources.

Having done that, we can have reasonable confidence that using the same Python or Java code outside Snowflake will get us the same result as running the built-in function inside Snowflake. Indeed, this was the original request from a customer:

We need a Python version of MD5_NUMBER_LOWER64, so that we can bucket users both in Snowflake and outside Snowflake, but the one we wrote isn’t working; can you help?

Not an exact quote.

What about JavaScript?

Since Snowflake has long supported JavaScript for UDFs and stored procedures, it would seem very logical to do the same thing in JavaScript that I’ve already done with Python and Java.

Unfortunately, JavaScript doesn’t include a built-in crypto library and our JS UDFs don’t support importing an external crypto library such as CryptoJS. Although it would very likely be possible to include the code from an external library in the UDF, that seems like a lot of work just for this validation.

If you really want to do this with JavaScript, you can follow the general instructions in my earlier articles, but build a UDF in JS that includes the full code for the MD5 hash function, so that you can do it all in Snowflake.

I’m going to leave that as an exercise for readers, though. That said, you could use JavaScript as your language of choice for the method I’m going to describe below.

Well, OK, but can I use <fill-in-the-blank>?

As I’ve mentioned before, Snowflake supports UDFs with Python, Java or JavaScript. It is easy to imagine, though, that you want to generate a value in some language we don’t support. Maybe you have an app in Go and you need to generate the same hash in the app that’s going to be available in Snowflake. Or maybe you use Ruby. Or Julia. Or TypeScript, even if that is basically JavaScript.

Fair enough. Snowflake provides connectors and drivers for a bunch of different languages; between the language-specific interfaces and the JDBC & ODBC drivers, it should be possible to connect to Snowflake from just about anything.

There are some other ways to do it, but the method I’m going to describe is as shown in this diagram: run a hash on customer data, load the customer keys and hash into Snowflake, and then compare the custom hash to the Snowflake hash for each customer key on the fly.

Diagram showing data flow and processing
Diagram showing data flow and processing
  1. Create the hash function in the language of your choice
  2. Write a script that will:
    - connect to Snowflake and query for the sample data,
    - use the hash function to generate the hash for each sample value,
    - write the sample value & hash out to a CSV or JSON file (or a collection of files, if there’s a lot of data involved), and
    - deliver the data file(s) to Snowflake
  3. Run a query in Snowflake to apply the built-in function to each sample value and compare to the hash, returning rows (or a count) if there are any discrepancies

I chose this way because it is relatively few steps, we’re only moving data to Snowflake, and this is generally simpler than some of the other options.

Generating External Hash Data

Let’s say I need to use PYFL (Pick-Your-Favorite-Language) to generate the hash in a non-Snowflake compute engine. I need to make sure that the PYFL version of the hash will match the Snowflake version of the hash. To do that validation, I wrote a program in PYFL that connected to my Snowflake account and ran a query like the following against the TPC SF1000 database:

SELECT c_name || ‘/’ || c_address AS INPUT
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000.CUSTOMER;

Next, it took the 150 million records and generated a hash for each one, using a PYFL version of the MD5_NUMBER_LOWER64. The result looks something like this:

Then, it wrote all the rows into a collection of CSV files, split the data into many files called hashtest/csv/data_#_#_#.csv, and compressed them with gzip, just to reduce the amount of data being moved.

Finally, it delivered the resulting 328 csv.gz files into a named internal Snowflake stage called TESTDB.PUBLIC.TEST. Snowflake’s LIST command lets us see what’s in a stage; the screenshot below shows 5 of the files, as displayed by LIST @TESTDB.PUBLIC.TEST/hashtest/csv;:

Delivering the External Hash Data to Snowflake

At this point, we have a list of 150 million values with their PYFL hash values, and I need to generate the hash value with the Snowflake function and then compare it to the PYFL value.

In order to run the Snowflake internal function on the sample data and compare the results to the external data within Snowflake, we have to get the data into Snowflake. I’ve mentioned delivering it to a named internal stage, but we could also use a user or table stage, which are also types of internal stages, or we could even use an external stage on AWS S3, Azure Blob Storage, or Google Cloud Storage. Here’s a quick summary of the pros and cons:

  1. user stage: every user has a user stage accessible only to them; nothing to create, just upload the files to it
  2. table stage: only exists with a table, and is intended specifically for files meant & formatted for that table; any user with access to the table can use the stage
  3. internal named stage: easy to create and easy to use, just upload the files to it once it is created; anyone with access to the stage can use it
  4. external stages: these require an account in one of the 3 supported cloud storage services, and take a bit more effort to create than an internal named stage, but it may be easier to write data files directly to the blob storage and avoid writing them locally and then uploading them; anyone with access to the stage can use it

If you’re writing the files to blob storage, “delivering the data to Snowflake” just means setting up the stage object in Snowflake, if you haven’t already done so. If you’re writing the files locally and then moving them into one of the internal stages, or writing directly to a table, you have a bunch of choices, including SnowSQL or one of Snowflake’s many connectors and drivers.

In the case of my PYFL program, as I mentioned, it split the data into a set of gzipped CSV files and delivered them to an internal named stage.

Comparing with the Internal Function

With the files sitting in the stage, the next step for a general data pipeline use case would likely be to ingest the data into tables for further querying and ELT usage. In this particular case, though, we’ll keep it simple by querying directly against the stage:

SELECT TOP 50 $1, $2
FROM @TEST_DB.PUBLIC.TEST/hash_test/csv;

The $1 and $2 tokens simply represent the two columns in the files; if there were more columns, each would have a number. We can run the validation query that way too, without actually bothering to ingest it into a Snowflake table first:

SELECT $2 = MD5_NUMBER_LOWER64($1) AS MATCH,
COUNT(*) AS NUM_ROWS
FROM @TESTDB.PUBLIC.TEST/hash_test/csv
GROUP BY MATCH;

This query will return one or two rows, telling us how many of the hashes match and how many don’t. If our PYFL function is doing what we want, there will be one row with the MATCH column showing TRUE.

Just to be clear — we’re actually running this query directly against the compressed CSV files, without taking the time to load them into a table.

If there are hashes that don’t match, they will show up on a row with the MATCH column set to FALSE. In that case, we may want to see some of the values where the hashes don’t match. For that, we’ll run a slightly different query, where we display all the input values where the internal and external hashes don’t match, and both of the hashes.

SELECT $1 AS INPUT,
$2 AS EXT_HASH,
MD5_NUMBER_LOWER64($1) AS INT_HASH,
METADATA$FILENAME
FROM @TESTDB.PUBLIC.TEST/hash_test/csv
WHERE EXT_HASH <> INT_HASH;

I’ve also thrown in another potentially interesting piece of information: the METADATA$FILENAME column will display the name of the file the input row came from, which could be useful for debugging in some cases. There’s also a METADATA$FILE_ROW_NUMBER column, which will show the row within each file.

Naturally, the pretend PYFL function I wrote worked flawlessly the very first time, so the first query returns only a single row showing 150m matches, while the second returns none.

Pay No Attention to the Man Behind the Curtain

Hey, remember when I said “let’s pretend” earlier? Well, I didn’t actually write a function in another language and write a wrapper around it to connect to Snowflake and all that stuff. I’d have included all the code, if I had!

No, instead, I took advantage of my original Python UDF and Snowflake’s COPY INTO <location> feature, which allows writing any query out to data files in a stage in CSV, JSON, or Parquet format.

Here’s the query I used; I kept this pretty simple, just telling it to prefix all the files with hash_test/csv/ and format them as CSV.

COPY INTO @TESTDB.PUBLIC.TEST/hash_test/csv/
FROM (SELECT c_name || ‘/’ || c_address AS INPUT,
TESTDB.PUBLIC.MD5_NUMBER_LOWER64_PY(INPUT) AS HASH
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000.CUSTOMER)
FILE_FORMAT = (TYPE = CSV);

This simulated the behavior of the PYFL code I described: use a custom function to generate a hash and put the data into a bunch of files that can be read in Snowflake.

If you’re curious, the numbers in the file names, shown in the LIST output earlier, represent the node number of the compute node within the virtual warehouse, the CPU core number, and a sequential number for how many files were written by that node. _0_1_38 means, since they are all zero-based, it is the 39th file written by node 0’s CPU core 1.

But Wait! There’s More!

With these three articles, we’ve seen that whether you’re using Python, Java, or some other language, there is a way to validate functions that need to be run both inside and outside Snowflake. This is true even if the function running inside Snowflake is not provided by Snowflake, in fact.

Here are the scenarios we’ve covered specifically:

  • Use Python or Java inside Snowflake to compare the outputs of a custom function and a built-in Snowflake function, entirely within Snowflake
  • Use Snowflake to validate data generated by a custom function in just about any language outside Snowflake against a built-in Snowflake function

We aren’t limited to those uses, though. Here are some more possibilities, just in the ‘validating function output’ category:

  • Use Snowflake to generate data for validation outside Snowflake against a custom function in any language: generate hashes in Snowflake, export the sample data & hash values to files, then validate with a function written outside of Snowflake. This requires moving data out of Snowflake, which can take time and is a potential security risk. It also requires having compute resources on which to receive and process the data.
  • Snowflake also supports External Functions, which are a type of UDF that is actually configured to call out to a 3rd party service, whether it is a custom function you’re running in AWS Lambda (or similar) or a function actually provided by someone else. In this case, it mirrors my original Python and Java samples, except that the data is being sent over to the external service provider and results are being sent back (over https, so it should be fairly secure, as long as you trust the external service).
  • We don’t have to be validating against a built-in function: if we need to be able to generate the same output with Python and/or Java, and another language, we could combine these techniques — deploy the Python and Java functions into Snowflake, load data from the third language’s function into Snowflake, and then use a single query to validate all three against each other.
  • We don’t even necessarily need to be planning to use Snowflake in general. The ability to run a custom function against a substantial amount of data makes it a great option for this sort of validation, even for functions that won’t be run in Snowflake other than for the validation. Suppose you have two versions of an app, one in Python and the other in Java, and you’ve written a function meant to do the same thing in each — again, deploy the two functions into Snowflake, then query them against a large amount of data to validate that they are returning the same values.

Ultimately, the many ways Snowflake supports running complex functions can be used to do a great deal more than this sort of validation, such as:

  • machine learning scoring
  • inference (such as sentiment analysis)
  • parsing unstructured or complex semi-structured data into something more easily queried
  • image or video analysis (such as manufacturing QA or equipment damage analysis)
  • audio analysis (such as transcribing physician’s recorded visit notes)
  • converting PDF or other formats into semi-structured or tabular formats
  • and many more

“Just One More Thing” — Columbo

If you loaded a big chunk of data into a stage for the validation, you’ll probably want to remove it when finished, rather than let it take up that space. Don’t forget…

--

--

Dylan Kaufman
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Principal Sales Engineer at Snowflake; biker, father, reader, programmer, long-time analytics consultant