Sharing Data with Snowflake Data Sharing

Robert Fehrmann
Snagajob Engineering
5 min readJun 26, 2017

--

If you have been working as a data engineer for a while you probably have come across a request to build a one-time, quick and dirty data extract to share data sets with another company. No big deal, right?

That’s what I thought as well when one of my fellow co-workers recently asked me if we could share some data with a new third party vendor our marketing department wanted to evaluate. Traditionally, we would implement something similar to the following steps:

  • Identify the data elements we want to share (let’s assume the data is in a relational database)
  • Extract the data set via some client tool
  • Zip the output into one file so it can be easily attached to an email
  • Send the output via email to our partner

And at the partner’s side:

  • Save the attachment
  • Build a table with the appropriate columns and data types
  • Unzip the attachment
  • Import the data set into the target database

And viola, we are all done. But are we really?

Usually, what happens next is that the real work is just about to start.

  • Of course, the data set is much bigger than we originally thought. And all of a sudden there are problems with the data extraction process due to the size of the dataset. Maybe we have to break up the data set into multiple parts. Let’s pick our favorite scripting language and start to automate the process.
  • Then we discover that the data set includes sensitive information, i.e., the output files need to be encrypted. More coding is needed.
  • Of course, email doesn’t work anymore and we have to switch the process to ftp which requires more coding.
  • Over time, we discover that we need to share more attributes. So we probably have to change the file format multiple times. Of course, the table structure on the other side has to change as well. This is usually a slow and painful process.
  • Then we discover that the integration would be so much more valuable if the dataset could be shared in a more real-time fashion, rather than being sent only once per night. Thus, more coding is needed.
  • Then the import process starts to act up and we find that the data isn’t as clean as we thought it was. Maybe there are some special characters in the extract or quotes that should have been escaped. So we need to build a more sophisticated data extraction process. Now even more coding is needed.
  • And last, but not least, and after the process has failed multiple times on the extraction and/or import side, we include re-try logic and monitoring.

Does this sound familiar? Well, consider yourself lucky if it doesn’t. For me unfortunately, I’ve been there at least a few times and I was wondering if there was a better way? A better way for sharing data that doesn’t require so much development and that would be much more reliable, secure, and in real-time.

About a year ago, Snagajob transitioned to a new data warehousing platform called Snowflake. We tested and concluded that Snowflake is the first and only data warehousing platform specifically born in the cloud and built for the cloud. One of the key difference compared to other high performance data warehousing platforms is that data in Snowflake isn’t stored with the compute layer (like, for instance, in Hadoop). By contrast, compute and storage layers are completely separated with Snowflake. In AWS terms, the compute layer is implemented via EC2 instances and the storage layer is implemented via S3. This key difference is the foundation for almost all of the incredible properties of a Snowflake data warehouse service. For instance, “elasticity”. Not only can Snowflake scale up and out, but it can also scale down and contract just by using simple API calls. It is just as easy to start multiple warehouses on exactly the same data sets, which can be used to separate traffic between internal and external users. Snowflake now takes this further with a brand new feature called Snowflake Data Sharing.

So what is Data Sharing? As mentioned previously, data in Snowflake is stored in S3 and therefore can support not only one, but multiple warehouses within the same Snowflake instance. Data Sharing essentially extends this concept from one Snowflake instance to multiple, external Snowflake instances. Brilliant.

Technically, the object that makes Data Sharing possible is a “share”. A “share” is a handle that is exported on the source instance and imported on the target instance. The magic is that Snowflake Data Sharing encapsulates all essential objects, like what tables or views we want to share and with whom we want to share those objects. Details on how to configure Snowflake Data Sharing can be found here.

And that’s it. It was as easy as advertised. All I had to do to share data with our marketing partner was to create a “share” with the proper permissions for the data to be shared. Though it is possible to share the base tables, we decided to create and share a view based on the abstraction layer instead of sharing the base tables. That way, we could hide complexity in our own data model and have better control since we didn’t want to share all data in the base tables.

Let’s see how Data Sharing compares to the challenges listed above:

  • Source and target system are using exactly the same metadata definition:
  • This means no more metadata inconsistencies between source and target, and no more latency between schema updates in the source and target system
  • Source and target system are using exactly the same data sets:
  • This means no more stale data. Data changes don’t have to propagate from source to target since both systems are using the same data
  • This means no need for data extraction, transmission, or import processes which introduce latency and the possibility for failure
  • This also means data doesn’t travel between source and target and thereby reduces the risk for data leakage

Summary:

Snowflake Data Sharing is a fantastic new feature that not only dramatically reduces the need for coding or building your own application to share data, but Snowflake Data Sharing also dramatically increases performance, reliability, security, and agility around sharing data. In this particular case I’d estimate that total implementation time was reduced from about a week to just hours. That not only saves money, but it becomes more and more important in a world where the fast wins against the slow, instead of the big against the small!

--

--