Snowflake Snowpro Core Preparation Part 15 — Semi Structured Data Transformations & Unstructured Data Transformations Using Directory Tables & File Functions In Snowflake

Ganapathy Subramanian.N
7 min readJun 5, 2024

--

In this blog, we are going to discuss about Semi structured data transformations, Data types, Querying Semi structured data, Semi structured data functions, Flatten, Lateral Flatten etc. Unstructured data access using Dictionary tables & File Functions.

Semi Structured Data:

● Semi-structured data, or partially structured data, doesn’t follow the tabular structure associated with relational databases or other forms of data tables. However, it does contain tags and metadata to separate semantic elements and establish hierarchies of records and fields.

● Unlike structured data, semi-structured data doesn’t require a prior schema definition like structured data does.

> Without a predefined, fixed schema, semi-structured data is more flexible and freer to evolve over time as new attributes are added.

● Semi-structured data supports a hierarchical data structure that contains nested information.

● Formats -

> XMLExtensible Markup Language — Easy-to-use markup language allows users to define tags and attributes required for storing data in a hierarchical form.

> JSON JavaScript Object Notation — Collects semi-structured data from IoT devices, web browsers, and smartphones. Used to transfer data in between servers and apps or internet-connected devices.

> Avro — Originally developed for use with Apache Hadoop, Avro is a remote procedure call (RPC) framework and data serialization.

> ORCOptimized Row Columnar — Designed to achieve more-efficient compression and enhance performance for reading, writing, and processing data over earlier Hive formats.

> Parquet — columnar storage file format. Is ideal for working with complex data in large volumes and features different methods for efficient data compression and encoding types.

Data Types:

Querying Semi Structured Data:

● Traversing data -

● Dot Notation -

● Bracket Notation -

● Retrieving a Single Instance of a Repeating Element -

● Casting -

Semi Structured Data Functions:

● Parsing JSON and XML data.

● Creating and manipulating ARRAY’s and OBJECT’s.

● Extracting values from semi-structured and structured data (e.g., from an ARRAY, OBJECT, or MAP).

● Converting/casting semi-structured data types and structured data types to/from other data types.

● Determining the data type for values in semi-structured data (i.e., type predicates).

Examples:

PARSE_JSON & TO_JSON:

ARRAY_CONSTRUCT:

OBJECT_CONSTRUCT:

FLATTEN:

● Flattens (explodes) compound values into multiple rows.

● Is a table function that takes a VARIANT, OBJECT, or ARRAY column and produces a lateral view.

● Can be used to convert semi-structured data to a relational representation.

Output:

Example:

LATERAL FLATTEN:

All the details about Snowflake’s Semi Structured data transformations are available in this YouTube Video:

Unstructured Data:

● Unstructured data doesn’t have a predefined structure.

● Examples — Emails, PDFs, Images, Audio files, Video files etc.

Directory Tables:

● A directory table is an implicit object layered on a stage (not a separate database object) and is conceptually similar to an external table because it stores file-level metadata about the data files in the stage.

● Both external (external cloud storage) and internal (Snowflake) stages support directory tables.

Query a directory table to retrieve a list of all the files on a stage — Contains information about each file, including the size, a timestamp of when it was last modified, and its Snowflake file URL.

Join a directory table with a Snowflake table that contains additional data and metadata about unstructured files to see unstructured files and their related data in a single view.

● Use a directory table with the Snowpark API or external functions to create a file processing pipeline.

● Auto Refresh — Refresh operation synchronizes the metadata with the latest set of associated files in storage.

  • AWS — SQS, Azure — Event Grid, GCP — Pub/Sub.

Directory Table Examples:

Querying Directory Tables:

Returns the File URLs whose size > 10000 bytes
Returns the File URLs for the CSV files
Combining Directory tables with regular tables
Automated Metadata Refresh using Streams, Tasks & UDF.

File Functions:

● Enables access to files staged in cloud storage.

File Functions Examples:

GET_PRESIGNED_URL — Generates a pre-signed URL to a file on a stage using the stage name and relative file path as inputs.

BUILD_SCOPED_FILE_URL — Generates a scoped Snowflake file URL to a staged file using the stage name and relative file path as inputs. A scoped URL is encoded and permits access to a specified file for a limited period — 24 hours.

BUILD_STAGE_FILE_URL — Generates a Snowflake file URL to a staged file using the stage name and relative file path as inputs. A file URL permits prolonged access to a specified file. That is, the file URL does not expire.

GET_STAGE_LOCATION Example
GET_ABSOLUTE_PATH Example

All details about the Snowflake’s Dictionary tables and File functions are available in this YouTube Video:

Photo Courtesy:

https://docs.snowflake.com

https://k21academy.com/microsoft-azure/dp-900/structured-data-vs-unstructured-data-vs-semi-structured-data/

https://select.dev/posts/snowflake-semi-structured-data

Reference Links:

Semi structured data 101:

https://www.snowflake.com/guides/semi-structured-data-101/

Loading semi structured data:

https://docs.snowflake.com/en/user-guide/semistructured-intro

Semi structured data types:

https://docs.snowflake.com/en/sql-reference/data-types-semistructured

Querying Semi structured data:

https://docs.snowflake.com/en/user-guide/querying-semistructured

Semi structured data functions:

https://docs.snowflake.com/en/sql-reference/functions-semistructured

Flatten:

https://docs.snowflake.com/en/sql-reference/functions/flatten

Directory tables:

https://docs.snowflake.com/en/user-guide/data-load-dirtables

Directory table enable in stages:

https://docs.snowflake.com/en/sql-reference/sql/create-stage

Querying directory tables:

https://docs.snowflake.com/en/user-guide/data-load-dirtables-query

File Functions:

https://docs.snowflake.com/en/sql-reference/functions-file

Catch you in the next blog — Snowflake Snowpro Core Preparation Part 16 — Time Travel, Fail-Safe, Cloning, Replication,Encryption & Data Sharing — https://medium.com/@trustngs/snowflake-snowpro-core-preparation-part-16-time-travel-fail-safe-cloning-cab23692a8cf

--

--

Ganapathy Subramanian.N

Director - Data Engineering @ Tiger Analytics. 100+ Technical Certifications. AWS-CB, Cloud(Architect,Devops,Data Engg,DW&ML),YouTuber & Blogger