How Snowflake enables its customers to create a unified source of truth by extending their Data lake to their internal sources using External Tables — Data Lake & Data Sharing ?

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

9 min readOct 3, 2022

Opinions expressed in this post are solely my own and do not represent the views or opinions of my employer.

Introduction

Most organizations with multiple heterogeneous systems, whether on-premises or in the cloud, are looking to move their workloads to the cloud to take advantage of performance, scalability, elasticity and with minimal administration effort.

However, when it comes to migration, some are questioning their existing data lake, especially the complexity and cost of migration. Others do not want to migrate all their data because they want to keep some of it in their on-premise environment.

Snowflake Data Cloud answers to the definition and the requirements related to the complexity and the volumetry of the data which do not cease growing. Snowflake offers a set of features and services to simplify the access, exploitation and analysis of data in the Cloud by allowing these customers to implement different types of workloads: Data Engineering, Data Science, Cybersecurity, Data Collaboration/Sharing.App, …

Snowflake gives these customers the ability to leverage and extend their data lake to sources and systems outside the data cloud. These systems can be in the Cloud and soon on-premise thanks to the partnership with PureStorage.

End-to-end Platform for Snowflake

What’s External Tables and how do they work?

Considering External tables, data is stored in files in an external stage. External tables store file-level metadata about the data files, such as the file name, a version identifier, and related properties. This allows data stored in files in an external stage to be queried as if it were in a database.

External tables can access data stored in any format supported by COPY INTO <table> statements.

External tables are read-only, so no DML operations can be performed on them . However, SELECT and join query operations are possible and views can be created on them using Snowflake.

Querying these tables is likely to be slower because performance depends on the source systems. Snowflake proposes a solution to overcome this problem by using materialized views based on external tables that can improve query performance.

Data is not ingested/copied into Snowflake and stays on the External Stage. You can create virtual columns from the current one.

Note : External tables support external stages only such as : S3, Azure, or GCS, …and not support internal (i.e. Snowflake) stages.

How to create an External Table on Snowflake?

To create External Tables, you only need to have some knowledge of the file format and record format of the source data files. It is not necessary to know the schema of the data files.

Note that SELECT * always returns the VALUE column, in which all regular or semi-structured data is converted to variant rows.

CREATE OR REPLACE EXTERNAL TABLE event
(event_id int as (value:c2::int),
event_name string as (value:c3::string),
event_date datetime as (value:c4::datetime))
LOCATION=@event_stage/path_prefix
FILE_FORMAT(type=CSV);

Why does it matter?

Ability to extend your Data Lake to an external storage.
Snowflake complements your Data Lake by fitting in with existing architectures.
No need to redesign or re-architect your Data Lake.
Automatically refresh external table metadata using your cloud notification service.
Reduced maintenance overhead for notification event management and table metadata refresh.
External data is always up to date and available to users from a single governed data platform.
Query in place — As a tool to query data that cannot or will not be moved into Snowflake.
Import for data migrations — More easily import data into Snowflake and leverage the full power of the Data Cloud.

What external system is supported?

Below a macro architecture with the use of External Tables and the different supported source systems :

Amazon S3

Create a named stage object that references the external S3 bucket in which your data files are made available. Then create an external table that references the named stage and integration.

Thereafter, the S3 event notifications trigger the metadata refresh automatically once the S3 event configuration is done.

Azure Blob Storage

Create a named stage object that references the external Azure Blob Storage in which your data files are made available. Then create an external table that references the named stage.

Thereafter, Azure Event Grid notifications will automatically trigger the metadata refresh once the configuration is done.

Google Cloud Storage

Create a named stage object that references the external GCS bucket in which your data files are made available. Then create an external table that references the named stage and integration.

Thereafter, Pub/Sub notifications will automatically trigger the metadata refresh once the Google Pub/Sub configuration is done.

Dell Technologies (coming soon)

Dell Technologies and Snowflake Data Cloud, will work together to enable Dell and Snowflake customers to use on-premises data stored on Dell Object storage with the Snowflake Data Cloud and having the choice to keep their data local or move it to public clouds (AWS, Azure or CGP).

Pure Storage (coming soon)

Pure Storage FlashBlade is a unified, dynamic, high-performance object platform designed for modern analytics, delivering real-time data response with cloud-like agility. Snowflake customers will be able to analyze multiple data sources on Pure Storage FlashBlade with the Snowflake Data Cloud, while keeping their data local.

This will allow organizations to extract more value from their on-premises data and eliminate the need to create separate copies of data shared with other tools or workflows, while leveraging Snowflake’s analytics and governance capabilities.

MinIO (coming soon)

MinIO is a high-performance, cloud-native object store. It will run on all versions of Kubernetes (up-stream, EKS, AKS, GKE, etc.) as well as on virtual machines like public cloud VMs, VMWare, etc. and bare metal hardware.

With the widespread use of Amazon S3, many vendors and customers have adopted the S3 API, such as Dell, Pure Storage, MinIO and many others. This is why we chose the S3 API in our design: It is standard across a number of vendors. The storage device provided by the vendor must offer a highly compliant S3 API. Some vendors and software may not work because their S3 API is not highly compliant.

What input file format is supported?

Let’s see the standard formats supported by external tables :

CSV

A CSV file typically stores tabular data. Each column is separated by the same delimiter and each line is a data record consisting of one or more fields (columns).

XML

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human and machine readable.

Apache ORC

Apache ORC (Optimized Row Columnar) is a free, open-source, column-oriented data storage format. It is similar to the ORC format and is a very efficient way to store Hive data. It was designed to overcome the limitations of other Hive file formats. Using ORC files improves performance when Hive reads, writes and processes data.

JSON

JSON is an open standard file format and data exchange format consisting of attribute-value pairs and arrays.

Apache Avro

Unlike Apache Parquet, Avro is a line-oriented remote procedure call and data serialization framework (Apache Hadoop Project. It uses JSON to define data types and protocols, and serializes data into a compact binary format.

Apache Parquet

Apache Parquet is a free and open-source column-oriented data storage format in the Hadoop ecosystem. It is similar to ORC and compatible with most data processing frameworks around Hadoop. It is one of the most optimized formats, especially when data needs to be transferred over the network.

Apache Hive Metastore

Snowflake supports Apache Hive metastore integration using External Tables. The Hive connector detects events in the metastore and passes them to Snowflake to keep the External Tables synchronized with the Hive metastore.

How does it work with Snowflake?

What table format is supported by External Tables?

Snowflake

Snowflake table is a native, performed and optimized table created by Snowflake and resembles Iceberg Tables.

Apache Iceberg

Iceberg is an open source table format that was developed by Netflix and then donated to the Apache Software Foundation . In addition to concurrency, concurrency, basic schema support and better performance, Iceberg offers a number of additional benefits such as support for multiple file types, strong community as well as its adoption by a wide range of companies.

To create an external table that references a Delta Lake, set TABLE_FORMAT = ICEBERG in the CREATE EXTERNAL TABLE statement.

Delta

is a table format on your data lake that supports, among other things, ACID (atomicity, consistency, isolation, durability) transactions. All Delta Lake data is stored in Apache Parquet format.

To create an external table that references a Delta Lake, set TABLE_FORMAT = DELTA in the CREATE EXTERNAL TABLE statement.

Optimize query External Tables using Materialized views

Materialized Views on External Tables can provide faster performance than equivalent queries on the underlying portal table. This performance difference can be significant for frequent or complex queries.

Refresh file-level metadata in all queried portals so that data in materialized views is up-to-date.

In order to create a Materialized view, you have just to a similar query :

CREATE MATERIALIZED VIEW MAT_view AS
SELECT
event_id, event_,name, event_date
FROM event
WHERE event_date > ‘2022–09–01’ ;

Enable Data Sharing on your Data Lake

Data Sharing capabilities for Snowflake are available for Snowflake customers whether their data are stored on their system (External stages) or on Snowflake Data Cloud (Internal stages).

Best practices

Manually refresh the portal metadata once, after creating the portal using this command : ALTER EXTERNAL TABLE MYTABLE REFRESH
Organize your underlying data using logical paths that include date, time, country or similar data dimensions.
Define a partitioning scheme by leveraging the METADATA$FILENAME pseudo-column and the underlying logical paths. This increases the performance of keeping the portal metadata in sync.
To optimize the number of parallel scan operations when querying external tables, we recommend that :
Parquet files should be between 128 and 256 MB in size,
And other file formats should use the same guidelines as COPY INTO (100MB to 250MB)
To minimize the cost of the Materialized View, use the partitioned field as the cluster key, if clustering is required.
Set up an automatic refresh method (Blob notifications, …).
Name the external tables with a prefix/suffix EXT to facilitate their identification. Ex. EXT_SALES, SALES_EXT.