Snowflake External Tables — Quick Setup Guide

Access your cloud data without bringing it into Snowflake’s storage layer

Kedar Waval
Hashmap, an NTT DATA Company
8 min readSep 29, 2021

--

Snowflake external tables provide a unique way to access your cloud datasets without actually moving them into Snowflake. This opens up interesting possibilities of handling some low-volume, less-frequently accessed, and ad-hoc datasets that need quick turnaround times in terms of availability on the analytics platform. The setup process is straightforward and very easy to follow.

In this guide, I’ll walk you through a general overview of Snowflake external tables and a quick setup using Snowflake’s storage integration object and AWS SQS. So, let’s get started!

What are External Tables?

  • In a typical table, the data is stored in the database
  • In an external table - the data is stored in files in an external stage like Amazon S3, GCP bucket, or Azure blob storage
  • External tables store file-level metadata about the data files, such as the filename, a version identifier, and related properties
  • This enables querying data stored in files in an external stage as if it were inside a database
  • External tables are read-only, so no DML operations can be performed on them
  • External tables can be used for query and join operations
  • Views can be created against external tables
  • Querying data stored external to the database is likely to be slower than querying native database tables, but materialized views based on external tables can improve query performance

(Source: Snowflake Documentation)

There are some interesting use cases possible with external tables:

  1. Files uploaded by business teams or customers: We live in a world where not everything is automated (yet!). More often than not there are requests from business or customer teams to “quickly” make some data available in the analytics layer, e.g. Fiscal Calendars, Budgets, etc.
  2. Small data exports from HR type of applications: Your organization might be using applications that do not provide an easy way to push/pull data to your data pipelines. The best they could do is to export delimited flat files to a storage location.
  3. Making data available to analysts/modelers before it has been integrated through data pipelines: You may have some files accumulated in your cloud storage and haven’t yet figured out when/how you want to bring them to your analytics platform. There could be pending data modeling activities or business approvals. You may want to make this data accessible to your team for data modeling or for making certain decisions related to the ELT process.
  4. Accessing data from a data lake: You may want to make datasets in your data lake accessible via Snowflake without actually moving the data. This could be because of some compliance regulations or strategic decisions that prohibit moving data or making copies of the data at different places.

I recently worked on a use case where there was a need to make the following types of data available in Snowflake for further analysis using Looker:

  1. Customer Success and Product Experience data
  2. HR/Recruitment Data — Applicant Tracking, Onboarding, Employee Database, Time-off Management
  3. Revenue Attribution data
  4. Employee Engagement data

Snowflake external tables turned out to be a good option for some of the datasets involved. This article explores how these external tables were set up using Snowflake’s storage integration object and AWS SQS for auto-refresh.

Quick Setup using Storage Integration and AWS SQS

There are several ways with different configuration options available for building external tables. Snowflake provides comprehensive documentation here.

This quick setup assumes that the files are staged in AWS S3 and the Snowflake environment uses AWS as the cloud provider. The following steps should give a good understanding of the underlying concepts and enable you to easily understand other cloud provider combinations and configuration options.

Setup process and environments visualized:

External Tables Setup

9 Step Setup Process:

  • 1. Create/decide on an S3 bucket and paths for different types of files. This bucket is governed by an IAM policy that is assigned to an IAM role. The IAM policy allows GetObject, GetObjectVersion, and ListBucket permissions which are needed for Snowflake to be able to access the files in the bucket:
IAM Access Policy on the bucket
IAM Role
  • 2. Create Storage Integration object in Snowflake:
Storage Integration in Snowflake
  • 3. Trusted Relationship — Upon successful creation of the Storage Integration object, Snowflake makes its USER_ARN and EXTERNAL_ID available via DESC command. Use this user and external id to establish a trusted relationship between the AWS IAM Role and Snowflake:
Trust Relationship between AWS IAM Role and Snowflake
  • 4. Create External Stage Object — Create an external stage using the storage integration object. Note that this object is like a pointer to the S3 location. i.e. it only points to the data and does not contain the data by itself.
External Stage Object
  • 5. Create a file format — A file format tells Snowflake about the structure of the data files. In most cases the default type CSV gets the job done but you may need to create a file format object to override certain defaults or utilize specific options like field_optionally_enclosed_by. This configuration was useful in our use case as it provided a way of handling occurrences of the delimiter character within certain data fields.
File Format Definition
  • 6. Define the external table — Snowflake provides two pseudo-columns with every external table viz. VALUE and METADATA$FILENAME. The VALUE column structures each row as an object with elements identified by column position (i.e. {c1: <column_1_value>, c2: <column_2_value>, c3: <column_1_value> ...}). You may want to parse the objects in the VALUE column to define the virtual columns of your external table. Please note that this step requires some knowledge of the underlying file structure and data types, etc.
External Table Definition
  • 7. Partitioning — Data folder/partitioning structure defined in the S3 storage could be leveraged to register partitions in the extremal table metadata. Helpful for scanning only required partitions. E.g. your datasets might be stored in different sub-folder hierarchies based on customer-ids or dates, etc. If the external table is made aware of this partitioning scheme, reads against the external tables would be faster when they use filtering based on customer-ids or date components.
  • 8. Materialized Views — Materialized views act as a layer of cache thereby providing faster data access as the data actually gets materialized within Snowflake and thus benefits from clustering and query result caches as well. Useful if external tables are accessed frequently and there is a need to cache the results for better performance.
Materialized View created on top of the External Table
  • 9. Setup Auto Refresh — The external table created, needs to be manually refreshed in order for Snowflake to update the metadata about the underlying files. You should consider setting up the auto-refresh process so that any new/changed files are automatically detected by the external table. In order to do that, Snowflake provides an SQS channel, details of which could be found with the “show external tables” command (notification_channel column). This SQS ARN needs to be configured in the notification settings on the S3 bucket (for ObjectCreate (All) and ObjectRemoved events).
S3 events notification to SQS

Important Notes:

  • Privileges: Some steps in the setup process need elevated privileges on Snowflake and AWS. You may need to work with different admins/teams within your organization to complete such steps.
  • Data Volume & Access Frequency: External Tables do not necessarily apply to all data-access use cases. If you are dealing with high-volume datasets and/or if they are going to be accessed frequently from the Snowflake layer, it would be worth considering other data ingestion patterns.
  • Parsing of the VALUE column: This may need some trial and error based on the data values. e.g. presence of delimiter character within data values, null or empty strings, non-standard date formats, etc. You may want to start with an external table definition with just the VALUE column and then write your parsing logic using SELECT queries against that base definition. Once the parsing logic is figured out, the table definition could be revised.

Closing Thoughts

Snowflake external tables feature provides a unique way to access your cloud data without actually bringing it into Snowflake’s storage layer. Data access performance improvement options, like partitioning and materialized views, are worth exploring.

I hope my guide gave you a good understanding of the underlying concepts and enables you to easily understand other cloud provider combinations and configuration options!

Ready to Accelerate Your Digital Transformation?

At Hashmap, we work with our clients to build better, together.

If you are considering moving data and analytics products and applications to the cloud or if you would like help and guidance and a few best practices in delivering higher value outcomes in your existing cloud program, then please contact us.

Hashmap, an NTT DATA Company, offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud service offerings. We would be glad to work through your specific requirements.

Other Tools and Content For You

Kedar Waval is a Senior Architect at Hashmap, an NTT DATA Company, and provides Data, Cloud, ELT, and BI solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--