CKAN Open Data Connector in Snowflake using Native App Framework

Gabriel Mullen
10 min readFeb 3, 2024

--

CKAN Open Data Connector App on the Snowflake Marketplace

Snowflake has collaborated with the California Department of Technology (CDT) and the California State Water Resources Control Board (SWRCB) to implement the CKAN Open Data Connector using the Snowflake Native App Framework. The California Open Data Portal is a hub for making government data publicly available and is implemented using the open source software known as CKAN. In combination with Snowflake Secure Data Sharing, this connector will reduce infrastructure internally at CDT for publishing data as well as extend that capability to all CA State Departments on Snowflake. The process automates data synchronizations to the Open Data Portal.

Scalable application development and deployment with Snowflake

Previously, we have done work as described in the “Connecting Snowflake to CKAN for publishing to the Open Data Portal” to connect to the Open Data Portal using a series of bespoke streams, tasks, data sharing, and external functions. We expanded this to SWRCB as posted in “Connecting SWRCB to the Open Data Portal using Snowflake Data Sharing” which allowed Water Boards to piggyback on the External Function deployed at CDT. While this worked well between the two departments there were two primary pain points that we aimed to solved:

  • Code needs to be deployed in the consumer Snowflake account. For multiple deployments this would become difficult to scale and impossible to update.
  • There is a dependency on CDT being the middle man between Data Providers and the Open Data Portal which creates an additional point of failure if errors happen.

To solve this, we used the Snowflake Native App Framework to centralize application development and scalably deliver to an unlimited number of consumers. Now California state departments can install the Open Data Connector directly from the Snowflake Marketplace for free and publish data directly to the Open Data Portal. With the codebase centralized and delivered in a single channel, application updates can be made in one place and delivered to all consumers at one time. This solves the problem of manageability of the code base in multiple snowflake accounts. Furthermore, with the use of External Access, we can remove the dependency on CDT for delivering the API call and do it directly from the data providers’ Snowflake account.

Once a file is in a published state, there’s no further interaction to keep the data fresh and accessible to the portal.

Snowflake Data Share

Snowflake data sharing allows organizations to securely share data across their business ecosystem. With Snowflake Secure Data Sharing, users can access live data across the organization, control governed access, and publish data for discovery and controlled access. With Snowflake, organizations can be data consumers, data providers, or both. Data sharing excels in business to business collaboration. Many California state departments, including California Energy Commission, California Department of Motor Vehicles, California Air Resources Board, California Department of Technology, California Department of Social Services, and others, use this technology today to securely share data without the usual middleware to copy and transform data.

Scalable Data Sharing

During the pandemic, California Local Health Jurisdictions benefited from real-time access to test results from the state as a result of this feature. But for the State of California, citizen data belongs to the citizens and the state is just stewards of that data. Government transparency, reliability, privacy and fairness are guiding factors that the state uses to make decisions to support the public. As such, and especially during intense and politically filled moments like the pandemic, the state needs to make that data available as broadly as possible. So CDT looks to the Open Data Portal as an additional channel to make the data available to the public. The Open Data Connector App allows for the dual channeling of making data available through a data share for Snowflake users, but also directly available to the public.

This same data sharing mechanism is the foundation for the distribution of our Snowflake Native App. Let’s review what the app looks like and explore the components along the way.

Installing the Open Data Connector

Install the application with the click of a button

In the past iteration of the Connector, deploying it meant manually creating tables, streams, and tasks in the Data Provider’s account. If we had matured that a bit, we may have been able to deploy it out of a github repo with a couple of parameters. But then there would have been many variables to setup in order to scale out to different consumers properly. With the Native App Framework, developers publish the code to the Marketplace or in a private listing. For testing purposes, the screenshot above shows a Private Listing that’s ready to be installed. Once you “get” the application, Snowflake begins executing the setup steps. The documentation for Native Apps is great, so if you want to get started then that’s a go-to resource. But, that is all that is necessary for the initial install. Snowflake will send us an email when it’s complete.

A successful installation

The User Interface

As part of the Native App, a Streamlit in Snowflake (SiS) interface is created to navigate users through the process of picking data in their Snowflake account and associating it to a pre-configured resource in the Open Data Portal.

Introduction Page

SiS is a lightweight python based UI package that results in fewer lines of code for dynamic interactions. During my career as a CA State employee or working for systems integrators in Sacramento, I’ve written a lot of .NET applications that were deployed on IIS. Anything other than that, and I took more of a backseat towards the “middle layer” or database administration. Until recently, I haven’t had much experience with python, yet I was able to make quick work of developing a Streamlit UI thanks to the many native widgets and the hands off approach to deployment. There were no virtual machines to instantiate, no bucket configurations, no integrations between layers. Snowflake is already a well architected solution and by using the framework, we’re standing on the shoulders of giants.

Configuring the app

There are a few initial setup steps that the consumer must go through in order to allow the app enough permissions and visibility to do its work. This is a result of the strong security model that is in place as part of the framework, but it requires some out-of-band processing; meaning the consumer will need to open a worksheet and run some GRANT statements to allow the app the appropriate access.

First, I am going to ask the consumer to grant me access to a particular Virtual Warehouse. I could have used the application manifest to allow, as part of the install, the app to create a serverless task. This would have simplified the process, but we wanted to be transparent about the compute cost that would be leveraged in the account. As such, we thought it was a good idea to allow the consumer to identify the VWH in their account so they can either consolidate workloads into a single VWH or separate it out for cost tracking. In this case we leave that preference up to the consumer. Type in an existing warehouse name, run the generated GRANT statements in a separate worksheets and hit CREATE.

Creating a task on consumer VWHs

Next, the app uses External Access to call out to the underlying CKAN API that the California Open Data Portal is based on. This requires authenticating with an API key that is generated via the Data Portal. The app stores this key in a SECRET object on the consumer side. The provider does not know anything about this value.

Password protected UI widget passes the value to an encrypted SECRET object

In the third step, we use the secret object we just created and integrate it with the external access integration object which allows access to the Open Data Portal (and only the open data portal) api endpoint. The endpoint values are currently hardcoded in the app as an network rule that deploys into the consumer account.

Create External Access Integration

In addition, the last piece of code executes the FINALIZE stored procedure. This is a bit of clever technical magic that I learned from Jim Pan. There are two problems that we solve with this SP. First, we can’t create the necessary User Defined Functions, which rely on the External Access Integration, until after we create said Integration. As such, we need to run the UDF DDL after that fact. In addition, we can’t run dynamic code (code that calls code) in a native app because its disallowed for externally distributed apps because of security concerns. Thus any dynamic code calling that we develop won’t pass the automated security scan at the end of the day. By requesting that the consumer execute the SP out-of-band, we are able to deploy the necessary UDFs. The stored procedure uses Snowpark to grab the files and execute the DDL in the consumer’s native app.


CREATE OR REPLACE PROCEDURE CONFIG.FINALIZE(EXTERNAL_ACCESS_OBJECT string)
returns string
LANGUAGE PYTHON
RUNTIME_VERSION = '3.10'
PACKAGES = ('snowflake-snowpark-python')
HANDLER = 'create_functions'
AS
$$
import os
def create_functions(session, external_access_object):
try:
files = ['get_orgs.sql','package_search.sql','resource_update.sql']
for f in files:
create_function(session,external_access_object,'/scripts/function_ddls/' + f)
return "Finalization complete"
except Exception as ex:
logger.error(ex)
raise ex

def create_function(session, external_access_object, filename):
file = session.file.get_stream(filename)
create_function_ddl = file.read(-1).decode("utf-8")
create_function_ddl = create_function_ddl.format(external_access_object)
session.sql("begin " + create_function_ddl + " end;").collect()
return f'{filename} created'
$$;

Lastly, while the security permissions framework allows an app provider to make requests through the app to access a table, the Open Data Portal Connector needs additional access to the table names in order to correctly match a table name to a CKAN resource. So for any table that we want to publish to the Portal, the consumer needs to allow access to them by executing a GRANT on that table.

Table Grants

Publishing a dataset

The use case we are solving means that a business or program user will first use the Open Data Portal to create a package and a resource. This is a prerequisite that was desired by Water Resources Board because they like developing all the metadata with the WYSIWYG Editor. But the process of uploading data regularly was laborious. We aim to fix that pain with automation.

With the initial one-time configuration out of the way, the rest of the app is pretty straightforward. The Publish Resources page uses your API key to authenticate you and retrieve all your CKAN Organizations via the API endpoints. Select an Organization and the packages related to the organization are populated. Each step forward is selecting options in a drop down box reducing user error and improving navigation.

External Access dynamically calls out to the CKAN API to gather information

A fully populated page shows the CKAN Organization, Package, and Resource (uuids are managed by the app) mapped to a Snowflake table.

Associating a CKAN Resource with a Snowflake table

Pushing the final PUBLISH button results invokes the end to end process. Because CKAN is a file based system, the app unloads the selected table to an internal stage based on user settings. This means that no data leaves the encrypted and secure boundary of Snowflake and there remains only one copy of the data. After writing the file(s) to the stage, a presigned url is generated to register with CKAN. The maximum lifetime of a presigned url is 1 week. As such, there is a weekly process that regenerates the presigned url and issues an update for all the published files. In addition, there is a daily process that checks if the original base tables have changed. If they have, a new file is generated and pushes an update to CKAN. This way, once a file is in a published state, there’s no further interaction to keep the data fresh and accessible to the portal.

Native App Workflow

The Future State of the Open Data Connector

With all the core automated logic now contained in a centralized location, any California state department will be able to get the app directly from the Marketplace with no intervention from CDT. Updates occur directly through the framework automated as part of the app provider’s release process.

Feedback has already rolled in about some departments desiring to create the packages and resources directly from this UI, so we’ll be looking to add that in the future. In addition, what about other states or organizations that want to provide a secondary channel for data sharing outside of Snowflake? We’ll be parameterizing the CKAN endpoint and making a generic CKAN Connector available in the marketplace too! But let’s not forget the native power of the Snowflake data share. What if we could publish to both the portal AND the Snowflake Marketplace at one time? This helps realize the vision of sharing in multiple capacities both directly on the Snowflake Marketplace and on the Open Data Portal. One platform to ease the burden and expense of traditional infrastructure while democratizing data access across the enterprise.

The Snowflake Data Cloud — Mobilize Data, Apps, and AI

And if you’re interested in building the above visualization, check out how you can do it directly in Snowflake.

Thank you

I can’t thank enough the teams at CDT, particularly Sam Hayashi, SWRCB, and my peers at Snowflake for helping with making this all possible.

--

--

Gabriel Mullen

Sales Engineer at Snowflake. All content are solely that of Gabriel Mullen.