Snowflake Snowpro Core Preparation Part 14 — Estimation Functions, System Functions & Table Functions
In this blog, we are going to discuss about Estimation Functions, Sampling, System Functions & Table Functions
Estimation Functions:
● Functions which are helpful to estimate the values.
Cardinality Estimation:
● Snowflake uses HyperLogLog to estimate the approximate number of distinct values in a data set.
● HyperLogLog is a state-of-the-art cardinality estimation algorithm, capable of estimating distinct cardinalities of trillions of rows with an average relative error of a few percent.
● Snowflake recommend using HyperLogLog whenever the input is potentially large, and an approximate result is acceptable.
● The average relative error of our HyperLogLog implementation is 1.62338% (i.e., the average relative difference to the corresponding COUNT(DISTINCT …) result).
Similarity Estimation:
● Snowflake uses MinHash for estimating the approximate similarity between two or more data sets.
● The MinHash scheme compares sets without computing the intersection or union of the sets, which enables efficient and effective estimation.
● The Jaccard similarity coefficient (or index) is used to compare the similarity between two sets — J(A,B) = (A ∩ B) / (A ∪ B).
● However, this calculation can consume significant resources and time and, therefore, is not ideal for large data sets.
● In contrast, the goal of the MinHash scheme is to estimate J(A,B) quickly, without computing the intersection or union.
Frequency Estimation:
● Snowflake uses the Space-Saving algorithm, a space and time efficient way of estimating approximate frequent values in data sets.
● It is implemented through the APPROX_TOP_K family of functions.
● Additionally, the APPROX_TOP_K_COMBINE function utilizes the parallel Space-Saving algorithm.
● The percentage of error for the algorithm depends heavily on how skewed the data is, and the number of counters used in the algorithm.
● As data becomes more skewed, or more counters are used, the output will be more accurate.
Percentile Estimation:
● Snowflake uses an improved version of the t-Digest algorithm, a space and time efficient way of estimating approximate percentile values in data sets.
● It is implemented through the APPROX_PERCENTILE family of functions.
● This algorithm has a constant relative error.
● This algorithm has substantial empirical support, but no rigorous proof of any accuracy guarantees.
Sampling:
● Returns a subset of rows sampled randomly from the specified table.
● Fraction Based — Sample a fraction of a table, with a specified probability for including a given row.
- The number of rows returned depends on the size of the table and the requested probability.
- A seed can be specified to make the sampling deterministic.
● Fixed Size — Sample a fixed, specified number of rows.
- The exact number of specified rows is returned unless the table contains fewer rows.
● SAMPLE and TABLESAMPLE are synonymous and can be used interchangeably.
Fraction-based row Sampling Examples:
Fixed-size row Sampling Examples:
All details about the Estimation functions are available in this Youtube Video:
System Functions:
● Snowflake provides the following types of system functions –
- Control functions that allow you to execute actions in the system (e.g., aborting a query).
- Information functions that return information about the system (e.g., calculating the clustering depth of a table).
- Information functions that return information about queries (e.g., information about EXPLAIN plans).
● Many of these system functions have the prefix SYSTEM$ (e.g., SYSTEM$TYPEOF).
● For the system functions that use SYSTEM$ prefix, must specify the prefix when calling the function.
Examples:
Table Functions:
● A table function returns a set of rows for each input row.
● The returned set can contain zero, one, or more rows. Each row can contain one or more columns.
● Table functions are sometimes called “tabular functions”.
● Table functions are typically used when a function returns multiple rows for each individual input.
● Example — A function that accepts an account number and a date, and returns all charges billed to that account on that date. (More than one charge might have been billed on a particular date.)
● System-Defined Table Functions — Snowflake provides hundreds of built-in functions, many of which are table functions.
● User-Defined Table Functions — Users can also write their own functions, called user-defined functions or “UDFs”.
Examples:
All the details about System and Table Functions are available in this Youtube Video:
Photo Courtesy:
Reference Links:
Cardinality estimation:
https://docs.snowflake.com/en/user-guide/querying-approximate-cardinality
Similarity estimation:
https://docs.snowflake.com/en/user-guide/querying-approximate-similarity
Frequency Estimation:
https://docs.snowflake.com/en/user-guide/querying-approximate-frequent-values
Percentile Estimation:
https://docs.snowflake.com/en/user-guide/querying-approximate-percentile-values
Sampling:
https://docs.snowflake.com/en/sql-reference/constructs/sample
System Functions:
https://docs.snowflake.com/en/sql-reference/functions-system
Table Functions:
https://docs.snowflake.com/en/sql-reference/functions-table
Catch you in the next blog — Snowflake Snowpro Core Preparation Part 15— Semi Structured Data Transformations & Unstructured Data Transformations using Directory Tables & File Functions In Snowflake — https://medium.com/@trustngs/snowflake-snowpro-core-preparation-part-15-semi-structured-data-transformations-unstructured-ee1199ad6a0a