Why You Should Use Snowflake’s External Tables

A practical example showcasing the value of Snowflake’s external tables for building a Data Lake

Chinmayee Lakkad
Hashmap, an NTT DATA Company
8 min readJul 28, 2021

--

Overview

Snowflake in its modern avatar is a data cloud that offers a collection of services built atop a highly elastic distributed processing engine. Among the multiple services it offers like data engineering, data warehousing, data sharing, data science, and data lakes, it also paves the way for a connected ecosystem of data applications that can be accessed from a single stop.

The applications written for the platform are typically powered by E-T-L or E-L-T pipelines running in the background to ensure that data is moved
into Snowflake and then across it through various logical layers of “data preparedness” until it is ready to be served or consumed.

When designing data infrastructure with Snowflake it is important to evaluate all options. The purpose of this post is to share an option for designing a Raw data layer to power the higher-level models in data pipelines.

The Context

So far, all data that is intended to be processed within Snowflake has to be onboarded using the COPY INTO command or via PIPE (Snowpipe). This works out well if all the processing on that data was to be done within Snowflake alone.

An alternative approach or secondary solution to onboarding data focuses on the data lake. With this approach, data can be stored outside the boundaries of Snowflake in cloud object stores as flat files (in a variety of formats) thereby making the data available for any present or future tool/technology without explicit data extraction.

With Snowflake this approach is implemented via the EXTERNAL TABLE feature which gives a tabular window of access over the flat files that reside externally within the cloud.

So, what are External Tables?

These are Snowflake objects that overlay a table structure on top of files stored in an EXTERNAL STAGE. They provide a “read-only” level of access for data within these remote files straight from the object store.

These tables store metadata (name, path, version identifier, etc) in order to facilitate this type of access, which itself is made available through VIEWs and TABLEs in the INFORMATION_SCHEMA.

Snowflake provides DDL statements to create, alter, drop, and show external tables within a SCHEMA.

Fittingly, external tables can also be configured with “auto-refresh” so that any new files dropped at the connected EXTERNAL STAGE can get registered automatically. This does require enabling another type of Snowflake object (i.e. a NOTIFICATION INTEGRATION, which uses a cloud vendor’s messaging service to be notified of the file events in the object store).

Side Note: Without using this setup, file detection can still be triggered using the ALTER EXTERNAL TABLE…REFRESH statement which will need to be manually executed periodically.

A real-life scenario

Let's run through a scenario and take this feature for a quick spin.

Imagine there is a network of sensors that captures the device statuses periodically and pools all the information together in a CSV file to be dumped into object storage at every hour. A real-life frequency could be even lower considering the import of the activity being “sensed” through the devices (…and maybe not routed via an object store, but its just an example).

Assume that every file contains the device’s identification, its location, and present status/condition.

This file now can be queried directly from the object store and inside Snowflake via an external table just like any other table as:

CREATE or REPLACE EXTERNAL TABLE EXT_TABLE_ARRAYOFTHINGS_LOCATIONS
(
year_part varchar as split_part(metadata$filename, '/', 1),
month_part varchar as split_part(metadata$filename, '/', 2),
day_part varchar as split_part(metadata$filename, '/', 3)
)
PARTITION BY (year_part, month_part, day_part)
WITH LOCATION = @NAME_OF_YOUR_EXTERNAL_STAGE
AUTO_REFRESH = FALSE
FILE_FORMAT = (
TYPE = CSV
SKIP_HEADER = 1
SKIP_BLANK_LINES = TRUE);

The difference between an internal table and an external table would be in the number and type of columns that external tables make available straight out of the box, namely ~

- Value : A VARIANT type column that contains the entire record as json

- Metadata$Filename: Name of the file(s) present at the object store, optionally preceded by any partition directories defined.

It is possible to provide additional column definitions when creating the table, but they would be virtual and calculated using expressions based on the Value column. Virtual column definitions could also be provided to create partitions based using expressions based on Metadata$Filename.

Using this data it is now possible to build a model on top that stores the history (say, Type-2 SCD) of device statuses such that every change in the state of a device over the course of its entire lifetime on the network gets captured on a per-record basis.

Over time as files land in the object store, data will build-up; and to understand status changes at regular intervals for each sensor, CDC (Change Data Capture) should be performed on this accumulating data. To be specific, the main point of interest should focus on capturing the most recent file(s) containing possible updates to devices.

At this point, you can add a STREAM on top of the external table to identify the records from the latest file or set of files dropped in the bucket.

CREATE OR REPLACE STREAM STREAM_EXT_TAB_ARRAYOFTHINGS_LOCATION
ON EXTERNAL TABLE EXT_TABLE_ARRAYOFTHINGS_LOCATIONS
INSERT_ONLY = TRUE;

Please note that STREAMs on external tables are “insert-only” which means that they track new files as and when they get added. However, they will not keep track of any files that get removed.

So how are “updates” or “deletes” recorded when all the data is captured by the stream as insert-only?

First, by designing the file contents in such a way that the changes are captured in a meaningful manner recording all kinds of DML ops.

In this example scenario a possible design option could be to always include one row of data (status, location etc) from each sensor whether it’s state changes or not. In this case, a sensor going down or taken off grid (delete op) will always be *explicitly* called out in the row with a corresponding status value; as will any other change to its state (update op) or even a new sensor being added to the network (insert op). This provides ‘at-least once’ semantics with one row per sensor in each file.

Second, by writing the MERGE statements that use the stream in such a way that all types of DML operations get handled leveraging the knowledge of file semantics.

Assuming the file semantics from before, the MERGE statements will always have to factor in the sensor state to decide if its time to upsert a record for a sensor or not.

Also, if the file contains multiple statuses or if multiple files get accumulated before the history model is refreshed, the merge statement would also need to factor in the time of status change and apply that in order to build the right history. This necessitates the presence of a timestamp-based watermark column.

Side note: For those familiar with dbt, another approach as opposed to writing MERGE statements can be to use ‘snapshots’ to create models with history:

{% snapshot device_status_history %}{{ config( unique_key='device_id', strategy='timestamp', updated_at='updated_on') }}selecthash(value:c1, value:c2, value:c3, value:c6, value:c7) as device_id,value:c1 as device_name,parse_json(concat('\'{','"lat":',value:c6, ',' ,'"long":', value:c7, '}\'')) as location,value:c5 as status,casewhen ifnull(lower(status),'unknown') = 'live' then trueelse falseend as is_live,current_timestamp() as updated_onfrom stream_ext_tab_arrayofthings_location{% endsnapshot %}

Overall, using external tables is as straightforward as using internal or local tables with just a few notes on how they are created/configured.

However, there are a few caveats, and pros/cons to be pondered over.

My Perspective

Pros

- This approach leverages the single copy of data already extracted from sources into the object storage.
- The format of the data remains open for other tools/technologies to process in case the need arises in the future.
- Raw data storage only needs to be managed at one level which also simplifies cost management for storage.

Cons

- External tables are less performant compared to local tables and therefore not a suitable choice for higher level tables; especially tables that directly feed BI reports or other service layers to consumers.
- Additional supportive objects are needed for enabling notification integration and auto refresh.

Caveats

- Since external tables are not as performant as local Snowflake tables, using them at any higher data zone might not be the best choice. However, using them to make the Raw zone for data that is batch processed would still be a good choice.
- Use them if the auto-refresh latencies are acceptable and in line with your data processing/publishing SLAs.
- When designing CDC over external tables, first design how the changes (especially deletes) will be captured from a source in the file extracts that get placed in the object store.

Conclusion

All things considered, using external tables can be a viable approach to building a data lake with Snowflake. It saves one hop in an ETL/ELT pipeline and best of all, that second copy of data is no longer needed within Snowflake which means one less cost profile to manage.

The window of opportunity to process data within Snowflake just got wider and more accessible.

Note: The dataset shown in the example was taken from data.world’s open dataset “ArrayOfThings”.

Ready to Accelerate Your Digital Transformation?

At Hashmap, we work with our clients to build better, together.

If you are considering moving data and analytics products and applications to the cloud or if you would like help and guidance and a few best practices in delivering higher value outcomes in your existing cloud program, then please contact us.

Hashmap, an NTT DATA Company, offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud service offerings. We would be glad to work through your specific requirements.

Feel free to share on other channels, and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Additional Resources

Chinmayee Lakkad is a Regional Technical Expert and Cloud/Data Engineer at Hashmap, an NTT DATA Company, and provides Data & Cloud solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--