Connecting SWRCB to the Open Data Portal using Snowflake Data Sharing

The California State Water Resources Control Board (SWRCB) has collaborated with Snowflake and the California Department of Technology (CDT) to bring Open Water Data to the Snowflake Marketplace and the California Open Data Portal.

As a second installment to my previous article, “Connecting Snowflake to CKAN for publishing to the Open Data Portal”, we worked with SWRCB to expand the work we did at CDT. As a brief review, our original objective was to link Snowflake tables to the Portal in a more automated fashion and replace the python script and server that were required previously.

Original Implementation

This diagram shows the good work that was done by the teams. A streamlit interface allowed users to pick tables, add metadata and kick off the job that would publish the data table. To do this we needed to use an External Function that interacted with an Azure API Management service that was configured in CDT’s Azure account. We were then able to access the CKAN API and post a presigned url to the portal. The original UI contained all the required metadata to create a new CKAN package and resource and associate a data table to it.

SWRCB maintains many datasets on the portal and was interested in trying out this pipeline. Their current process involves manually extracting and uploading files via the Open Data Portal UI. And while this was mostly fine for them, the one big pain they would experience is that, as data changes, the data steward must constantly update and manage the resources. So we set out to automate this pipeline and it changed the process in two ways:

  1. SWRCB Data Stewards like manually creating the CKAN Package and the Resource via the Open Data Portal UI because it gives them control over the look and feel of the page as well as explicit control over the metadata.
  2. SWRCB does not have a significant Azure presence so configuring the Azure API Management service was going to be a challenge.

Changing the Pipeline

To solve our first problem we created a new github repository forked from the original CDT repo. Instead of having the streamlit UI provide all the package and resource information and push it over the API in one request, we simplified the process to only updating the resource url. The program staff would create the package and the resource in the Open Data Portal. From there, IT staff needed to identify the table to publish in Snowflake and link it to the resource uuid in CKAN. Originally, we wanted to have a drop down experience from the Streamlit UI, but we ran into the same issue that we couldn’t configure an External Function at SWRCB. In addition, because they have a preference for Microsoft and the .NET framework, we couldn’t install or support open source streamlit on client devices at SWRCB. We will solve that later with the Native App framework, but for now it means IT Staff would execute an INSERT into the resources table to kick of the process.

DDL for the resources table

As you can see in the DDL, we’re now just mapping a Snowflake table to the package and resource id in CKAN. Once we insert a record into the resource table, a stored procedure unloads the data to a file in an internal stage where we can leverage a presigned url to post to the resource in CKAN.

But we had one last problem, we weren’t able to use an External Function from the SWRCB side, so how do we make the API calls from the SWRCB Snowflake account?

Data Sharing to the Rescue

This is a fantastic example of data sharing in action that helped connect two CA State departments without the usual extract and upload process. Instead we created a Snowflake Data Share from SWRCB (publisher) to CDT (Consumer). SWRCB shares the resources table to CDT. CDT has a stream on the shared table and, any time SWRCB makes an update, CDT sees the changes in real time. A CDT task kicks off the External Function that was already configured to post the URL to the Open Data Portal. Awesome! We are able to reuse the work that CDT did, keep the data under SWRCB’s control, and any time a public user clicks the link on the Open Data Portal, the link points them directly to the SWRCB Snowflake account.

Data Sharing to reuse a programmatic interface

You can see the results for yourself. The SAFER Drinking Water dataset identifies Failing and At-Risk community water systems. You can see the metadata and explanation of the data itself at the package level, but at the resource level (where the link goes) the link to download the data actually pulls directly from Snowflake. It’s seamless and all the infrastructure is abstracted away.

Sharing on the Snowflake Marketplace

The CA Open Data Portal is a great place to get all kinds of data. However, one of the problems with the data on the portal is that you have to download it yourself, load it into your own data system, and transform the columns into appropriate data types. In addition, when datasets get popular, it can create strain on the hardware underlying the Open Data Portal and creates situations like when Elon Musk tweetstormed (is it an X-storm now?) about COVID-19 hospital data and crashed the HHS portal. On the Snowflake Marketplace, data consumers broker the compute for leveraging the data which makes sharing data on the Marketplace extremely scalable and cost efficient. In addition, there is no extract, transform and load process while working in Snowflake. Because the data that is shared to you is real-time and strongly typed data, there’s no need for the usual ETL process. For the departments that leverage Snowflake, we can grab that data freely from the Marketplace, and for the others that don’t, the same data is made available in the Open Data Portal.

The Future State for the Open Data Connector

This effort has been great for providing an automated way of updated the data on a regular basis. Whenever data changes in Snowflake, it’s now updated to the Open Data Portal within the minute. However, it’s not without its quirks. If you review the repository, you’ll see that there are a handful of stream, tasks, and tables that all need to be deployed in the consumer account. You also have to hand create the data share between the agencies to get access to CDT’s External Functions. If we were to scale this out to more departments, then we have to lay down all those objects in the other departments account. When code changes get made based on request or the identification of a bug, then we would need to change that for all the “spoke” accounts. This would be hard to manage. In addition, installing the Streamlit app became a hurdle due to security and support. However, Snowflake has introduced a couple of new features that are going to help us solve that problem moving forward. Currently in Private Preview (as of Nov 11 2023) are the Native App Framework, Streamlit in Snowflake and External Access. This will help us solve the code management problem as well has the External Function requirement. More on this in a future article.

Thank you

A big thank you to both the teams at CDT and the teams at SWRCB for their partnership, collaborative attitudes, and for thinking different about how we can approach government data solutions. 🙏

--

--