Importing Semi-Structured Data from Google Sheets to Snowflake

TL;DR

We published a small tool you can use to import semi-structured data from Google Sheets to Snowflake, taking advantage of Snowflake’s variant type.

Background

There are a variety of cases where you might want to import data provided by business users into your data warehouse — for example, inventory cost information or logs of marketing spend. These probably won’t come directly from your application, or from a third-party integration, but rather from the operations team, the finance team, the growth team, etc.

If your organization uses Google Sheets a lot, that’s probably a fairly convenient way for business users and input and manage the data.

So, why not just pull the information directly from a sheet? One possible challenge is that these sheets are likely to change over time, with columns being added and removed. We can take advantage of Snowflake’s support for semi-structured data, in particular the variant type, to keep things flexible.

Usage example

Suppose you have a Google Sheet, accessible via a service account, that

looks something like this:

Using this module, the result of importing the sheet above would look like this:

By default, all items are imported as strings, but a “ — coercions” argument is supported to specify fields that should be interpreted as numbers, dates, and timestamps.

You can then use Snowflake’s support for querying semi-structured data to work with the “data” object conveniently.

The repository has additional information on initial setup and supported options.

Limitations

The most important limitation is that this tool replaces the entire table every time, so if historical information is removed from the sheet, it will also be removed from the database.

Alternatives

While this is a convenient technique, it won’t be appropriate for every situation. We also import data into traditional tables, import from CSVs, and import similarly but in an append-only fashion, some or all of which we’ll likely expand on in future commits and articles.

In terms of implementation strategy, you could export CSVs to, say, S3, and then import them from there. Or use an ETL/sync tool, some of which support reading from Google Sheets or files stored in Google Drive or an SFTP server. Yet another option would be to use something like Airtable.

Whichever approach you take, we hope this is an interesting example of dealing with semi-structured data from heterogeneous sources while taking advantage of non-relational features in databases.