Snowflake Snowpro Core Preparation Part 15 — Semi Structured Data Transformations & Unstructured Data Transformations Using Directory Tables & File Functions In Snowflake
In this blog, we are going to discuss about Semi structured data transformations, Data types, Querying Semi structured data, Semi structured data functions, Flatten, Lateral Flatten etc. Unstructured data access using Dictionary tables & File Functions.
Semi Structured Data:
● Semi-structured data, or partially structured data, doesn’t follow the tabular structure associated with relational databases or other forms of data tables. However, it does contain tags and metadata to separate semantic elements and establish hierarchies of records and fields.
● Unlike structured data, semi-structured data doesn’t require a prior schema definition like structured data does.
> Without a predefined, fixed schema, semi-structured data is more flexible and freer to evolve over time as new attributes are added.
● Semi-structured data supports a hierarchical data structure that contains nested information.
● Formats -
> XML — Extensible Markup Language — Easy-to-use markup language allows users to define tags and attributes required for storing data in a hierarchical form.
> JSON — JavaScript Object Notation — Collects semi-structured data from IoT devices, web browsers, and smartphones. Used to transfer data in between servers and apps or internet-connected devices.
> Avro — Originally developed for use with Apache Hadoop, Avro is a remote procedure call (RPC) framework and data serialization.
> ORC — Optimized Row Columnar — Designed to achieve more-efficient compression and enhance performance for reading, writing, and processing data over earlier Hive formats.
> Parquet — columnar storage file format. Is ideal for working with complex data in large volumes and features different methods for efficient data compression and encoding types.
Data Types:
Querying Semi Structured Data:
● Traversing data -
● Dot Notation -
● Bracket Notation -
● Retrieving a Single Instance of a Repeating Element -
● Casting -
Semi Structured Data Functions:
● Parsing JSON and XML data.
● Creating and manipulating ARRAY’s and OBJECT’s.
● Extracting values from semi-structured and structured data (e.g., from an ARRAY, OBJECT, or MAP).
● Converting/casting semi-structured data types and structured data types to/from other data types.
● Determining the data type for values in semi-structured data (i.e., type predicates).
Examples:
PARSE_JSON & TO_JSON:
ARRAY_CONSTRUCT:
OBJECT_CONSTRUCT:
FLATTEN:
● Flattens (explodes) compound values into multiple rows.
● Is a table function that takes a VARIANT, OBJECT, or ARRAY column and produces a lateral view.
● Can be used to convert semi-structured data to a relational representation.
Output:
Example:
LATERAL FLATTEN:
All the details about Snowflake’s Semi Structured data transformations are available in this YouTube Video:
Unstructured Data:
● Unstructured data doesn’t have a predefined structure.
● Examples — Emails, PDFs, Images, Audio files, Video files etc.
Directory Tables:
● A directory table is an implicit object layered on a stage (not a separate database object) and is conceptually similar to an external table because it stores file-level metadata about the data files in the stage.
● Both external (external cloud storage) and internal (Snowflake) stages support directory tables.
● Query a directory table to retrieve a list of all the files on a stage — Contains information about each file, including the size, a timestamp of when it was last modified, and its Snowflake file URL.
● Join a directory table with a Snowflake table that contains additional data and metadata about unstructured files to see unstructured files and their related data in a single view.
● Use a directory table with the Snowpark API or external functions to create a file processing pipeline.
● Auto Refresh — Refresh operation synchronizes the metadata with the latest set of associated files in storage.
- AWS — SQS, Azure — Event Grid, GCP — Pub/Sub.
Directory Table Examples:
Querying Directory Tables:
File Functions:
● Enables access to files staged in cloud storage.
File Functions Examples:
● GET_PRESIGNED_URL — Generates a pre-signed URL to a file on a stage using the stage name and relative file path as inputs.
● BUILD_SCOPED_FILE_URL — Generates a scoped Snowflake file URL to a staged file using the stage name and relative file path as inputs. A scoped URL is encoded and permits access to a specified file for a limited period — 24 hours.
● BUILD_STAGE_FILE_URL — Generates a Snowflake file URL to a staged file using the stage name and relative file path as inputs. A file URL permits prolonged access to a specified file. That is, the file URL does not expire.
All details about the Snowflake’s Dictionary tables and File functions are available in this YouTube Video:
Photo Courtesy:
https://select.dev/posts/snowflake-semi-structured-data
Reference Links:
Semi structured data 101:
https://www.snowflake.com/guides/semi-structured-data-101/
Loading semi structured data:
https://docs.snowflake.com/en/user-guide/semistructured-intro
Semi structured data types:
https://docs.snowflake.com/en/sql-reference/data-types-semistructured
Querying Semi structured data:
https://docs.snowflake.com/en/user-guide/querying-semistructured
Semi structured data functions:
https://docs.snowflake.com/en/sql-reference/functions-semistructured
Flatten:
https://docs.snowflake.com/en/sql-reference/functions/flatten
Directory tables:
https://docs.snowflake.com/en/user-guide/data-load-dirtables
Directory table enable in stages:
https://docs.snowflake.com/en/sql-reference/sql/create-stage
Querying directory tables:
https://docs.snowflake.com/en/user-guide/data-load-dirtables-query
File Functions:
https://docs.snowflake.com/en/sql-reference/functions-file
Catch you in the next blog — Snowflake Snowpro Core Preparation Part 16 — Time Travel, Fail-Safe, Cloning, Replication,Encryption & Data Sharing — https://medium.com/@trustngs/snowflake-snowpro-core-preparation-part-16-time-travel-fail-safe-cloning-cab23692a8cf