Snowflake Snowpro Core Preparation Part 14 — Estimation Functions, System Functions & Table Functions

Ganapathy Subramanian.N
6 min readJun 5, 2024

--

In this blog, we are going to discuss about Estimation Functions, Sampling, System Functions & Table Functions

Estimation Functions:

● Functions which are helpful to estimate the values.

Cardinality Estimation:

● Snowflake uses HyperLogLog to estimate the approximate number of distinct values in a data set.

● HyperLogLog is a state-of-the-art cardinality estimation algorithm, capable of estimating distinct cardinalities of trillions of rows with an average relative error of a few percent.

● Snowflake recommend using HyperLogLog whenever the input is potentially large, and an approximate result is acceptable.

● The average relative error of our HyperLogLog implementation is 1.62338% (i.e., the average relative difference to the corresponding COUNT(DISTINCT …) result).

Similarity Estimation:

● Snowflake uses MinHash for estimating the approximate similarity between two or more data sets.

● The MinHash scheme compares sets without computing the intersection or union of the sets, which enables efficient and effective estimation.

● The Jaccard similarity coefficient (or index) is used to compare the similarity between two sets — J(A,B) = (A ∩ B) / (A ∪ B).

● However, this calculation can consume significant resources and time and, therefore, is not ideal for large data sets.

● In contrast, the goal of the MinHash scheme is to estimate J(A,B) quickly, without computing the intersection or union.

The similarity index of these two tables is approximated as 0.79, as opposed to the exact value 0.8 (i.e., 4/5).

Frequency Estimation:

● Snowflake uses the Space-Saving algorithm, a space and time efficient way of estimating approximate frequent values in data sets.

● It is implemented through the APPROX_TOP_K family of functions.

● Additionally, the APPROX_TOP_K_COMBINE function utilizes the parallel Space-Saving algorithm.

● The percentage of error for the algorithm depends heavily on how skewed the data is, and the number of counters used in the algorithm.

● As data becomes more skewed, or more counters are used, the output will be more accurate.

Percentile Estimation:

● Snowflake uses an improved version of the t-Digest algorithm, a space and time efficient way of estimating approximate percentile values in data sets.

● It is implemented through the APPROX_PERCENTILE family of functions.

● This algorithm has a constant relative error.

● This algorithm has substantial empirical support, but no rigorous proof of any accuracy guarantees.

The value returned in this case is higher than any value in the data set

Sampling:

● Returns a subset of rows sampled randomly from the specified table.

Fraction Based — Sample a fraction of a table, with a specified probability for including a given row.

  • The number of rows returned depends on the size of the table and the requested probability.
  • A seed can be specified to make the sampling deterministic.

Fixed Size — Sample a fixed, specified number of rows.

  • The exact number of specified rows is returned unless the table contains fewer rows.

● SAMPLE and TABLESAMPLE are synonymous and can be used interchangeably.

Fraction-based row Sampling Examples:

Return a sample of a table in which each row has a 10% probability of being included in the sample
Return a sample of a table in which each row has a 20.3% probability of being included in the sample
Return an entire table, including all rows in the table
Return an empty sample
Return a sample of a table in which each block of rows has a 3% probability of being included in the sample, and set the seed to 82

Fixed-size row Sampling Examples:

Return a fixed-size sample of 10 rows in which each row has a min(1, 10/n) probability of being included in the sample, where n is the number of rows in the table

All details about the Estimation functions are available in this Youtube Video:

System Functions:

● Snowflake provides the following types of system functions –

  • Control functions that allow you to execute actions in the system (e.g., aborting a query).
  • Information functions that return information about the system (e.g., calculating the clustering depth of a table).
  • Information functions that return information about queries (e.g., information about EXPLAIN plans).

● Many of these system functions have the prefix SYSTEM$ (e.g., SYSTEM$TYPEOF).

● For the system functions that use SYSTEM$ prefix, must specify the prefix when calling the function.

Examples:

Aborts the specified session
Cancels all active/running queries in the specified session.
Recursively resumes a specified task and all its dependent tasks.
Returns hostnames and port numbers to add to your firewall’s allowed list so that you can access Snowflake from behind your firewall.
Returns hostnames and port numbers for AWS PrivateLink, Azure Private Link, and Google Cloud Private Service Connect deployments.
Retrieves a JSON representation of the current status of a pipe.
Given the text of a SQL statement, this function generates the EXPLAIN plan in JSON.

Table Functions:

● A table function returns a set of rows for each input row.

● The returned set can contain zero, one, or more rows. Each row can contain one or more columns.

● Table functions are sometimes called “tabular functions”.

● Table functions are typically used when a function returns multiple rows for each individual input.

● Example — A function that accepts an account number and a date, and returns all charges billed to that account on that date. (More than one charge might have been billed on a particular date.)

System-Defined Table Functions — Snowflake provides hundreds of built-in functions, many of which are table functions.

User-Defined Table Functions — Users can also write their own functions, called user-defined functions or “UDFs”.

Examples:

Using a Table Function in the FROM clause
Using a Table as Input to a Table Function
Using a Table as Input to a Table Function
System defined function — Data loading INFER_SCHEMA

All the details about System and Table Functions are available in this Youtube Video:

Photo Courtesy:

https://docs.snowflake.com

Reference Links:

Cardinality estimation:

https://docs.snowflake.com/en/user-guide/querying-approximate-cardinality

Similarity estimation:

https://docs.snowflake.com/en/user-guide/querying-approximate-similarity

Frequency Estimation:

https://docs.snowflake.com/en/user-guide/querying-approximate-frequent-values

Percentile Estimation:

https://docs.snowflake.com/en/user-guide/querying-approximate-percentile-values

Sampling:

https://docs.snowflake.com/en/sql-reference/constructs/sample

System Functions:

https://docs.snowflake.com/en/sql-reference/functions-system

Table Functions:

https://docs.snowflake.com/en/sql-reference/functions-table

Catch you in the next blog — Snowflake Snowpro Core Preparation Part 15— Semi Structured Data Transformations & Unstructured Data Transformations using Directory Tables & File Functions In Snowflake — https://medium.com/@trustngs/snowflake-snowpro-core-preparation-part-15-semi-structured-data-transformations-unstructured-ee1199ad6a0a

--

--

Ganapathy Subramanian.N

Director - Data Engineering @ Tiger Analytics. 100+ Technical Certifications. AWS-CB, Cloud(Architect,Devops,Data Engg,DW&ML),YouTuber & Blogger