Provision Faster Development Dataset Using Snowflake Table Sample

In this blog, we are going to discuss on how Snowflake Table Sampling can help you create faster development data set to ease your development lifecycle.

We all love getting our development queries to complete faster on production, like data. But production masked data has huge volume, which doesn’t support faster development lifecycle.

Say you are working on a new feature and considering it’s from scratch, you will do lots of trial and error before reaching to final state. How data volume is going to take a significant role in this is what this blog all about. Just think of running a query on smaller dataset Vs big data set when your goal is to complete your development faster and not performance testing. For performance testing, you will get different environment. Table sampling use different probability method to create a sample dataset from production volume of dataset. Smaller production like sampled dataset means faster outcome, whether it's pass or fail, it will give you boost in your next step.

Let’s understand this better with a simple example. Let's say you want to capture a picture of a sight you recently visited, that can be nicely done using your mobile phone camera or any digital camera and same can be also done using your professional photoshoot camera. Now it's depends on your scope of requirement whether you are interested in detailing of each data points in your photograph or you just need to capture a moment…!

So if you need detailing than you will go for professional camera, else phone camera can also do the decent work for you. Similarly, It's not always required to develop your code using production volume data, the same can be achieved with smaller set of sampled data.

We are going to see different table sampling method and how it can help provision development dataset in this blog.

What is Snowflake Table sampling?

Below is the syntax of how we can do table sampling in Snowflake.

SELECT …
FROM …
{ SAMPLE | TABLESAMPLE } [ samplingMethod ] ( { <probability> | <num> ROWS } ) [ { REPEATABLE | SEED } ( <seed> ) ]
[ … ]

Nicely defined in Snowflake Documentation.

Snowflake Table Sampling helps returns a subset of rows sampled randomly from the specified table. The following sampling methods are supported:

  • Sample a fraction of a table, with a specified probability for including a given row. The number of rows returned depends on the size of the table and the requested probability. A seed can be specified to make the sampling deterministic.
  • Sample a fixed, specified number of rows. The exact number of specified rows is returned, unless the table contains fewer rows.

SAMPLE and TABLESAMPLE are synonymous and can be used interchangeably. The following keywords can be used interchangeably:

SAMPLE | TABLESAMPLE
BERNOULLI | ROW
SYSTEM | BLOCK
REPEATABLE | SEED

What are the different sampling method Snowflake supports?

Snowflake supports 2 different methods for table sampling.

  • BERNOULLI (or ROW).
  • SYSTEM (or BLOCK).

What is BERNOULLI or ROW sampling?

In this sampling method, Snowflake Includes each row with a probability of <p>/100. The resulting sample size is approximately of <p> /100 * number of rows on the FROM expression. This is the default method for sampling if nothing is specified specifically.

Similar to flipping a weighted coin for each row. Let's see more on demo below:

What is SYSTEM or BLOCK sampling?

In this sampling method, Snowflake Includes each block of rows with a probability of <p>/100. The sample is formed of randomly selected blocks of data rows forming the micro-partitions of the FROM table.

Similar to flipping a weighted coin for each block of rows. This method does not support fixed-size sampling. Let’s see more on demo below:

What is REPEATABLE | SEED parameter ?

REPEATABLE or SEED parameter can help generate deterministic sampling i.e. generating different samples with the same <seed> AND <probability> from the same table, the samples will be the same, as long as the table is not modified.

Which is the better solution ? Snowflake Table Clone or Table Sampling for faster development dataset creation?

Both Table cloning & Table sampling can help you provision a development environment faster.

Snowflake Zero Copy Clone feature is the one of the most powerful feature in Snowflake which provides a convenient way to quickly take a point in time “snapshot” of any table, schema, or database and create a reference to underline partition which initially shares the underlying storage unless we do any change.

Table sampling helps you create smaller datasets from production data sets base don sampling method opted for.

Clone helps provision development environment in fraction of time from production as it doesn't copy the source data, rather reference the source storage. It only cost storage if you do any modification on source data. Since it take source reference, all query on clone table will run on production like volume and hence cost compute.

Whereas, sampling create smaller data out of bigger chunk, hence cost you storage but in the same time help you reduce compute cost.

One thing to keep in mind that in today’s world storage is cheaper than compute.

Let's see the same in action in the below demo:

Can we do sampling during table join?

Yes, you can do table sampling either for all tables or partial set, or you can also sample the join result set. You can also sample table based on fraction. All you can see in below live demo:

Things to Remember:

  • In addition to using literals to specify <probability> | <num> ROWS and seed, session or bind variables can also be used.
  • SYSTEM | BLOCK sampling is often faster than BERNOULLI | ROW sampling.
  • Sampling without a seed is often faster than sampling with a seed.
  • Fixed-size sampling can be slower than equivalent fraction-based sampling because fixed-size sampling prevents some query optimization.
  • Sampling the result of a JOIN is allowed, but only when all the following are true:

The sample is row-based (Bernoulli).

The sampling does not use a seed.

The sampling is done after the join has been fully processed. Therefore, sampling does not reduce the number of rows joined and does not reduce the cost of the JOIN.

For Fraction-based

  • For BERNOULLI | ROW sampling, the expected number of returned rows is (p/100)*n.
  • For SYSTEM | BLOCK sampling, the sample might be biased, in particular for small tables.
  • For very large tables, the difference between the two methods should be negligible.
  • Also, because sampling is a probabilistic process, the number of rows returned is not exactly equal to (p/100)*n rows, but is close.

For Fixed-size:

  • If the table is larger than the requested number of rows, the number of requested rows is always returned.
  • If the table is smaller than the requested number of rows, the entire table is returned.
  • SYSTEM | BLOCK and seed are not supported for fixed-size sampling.
  • Sampling with an <seed> is not supported on views or subqueries

Hope this blog & YouTube video helps you to get insight on the Snowflake Table Sampling and how it can help you create a faster development data set to ease your development lifecycle.If you are interested in learning more details about Snowflake Table Sampling, you can refer to their SF documentation. Feel free to ask a question in the comment section if you have any doubts regarding this. Give a clap if you like the blog. Stay connected to see many more such cool stuff. Thanks for your support.

You Can Find me:

Follow me on Medium: https://rajivgupta780184.medium.com/

Follow me on Twitter: https://twitter.com/RAJIVGUPTA780

Connect with me in LinkedIn: https://www.linkedin.com/in/rajiv-gupta-618b0228/

Subscribe to my YouTube Channel: https://www.youtube.com/c/RajivGuptaEverydayLearning

#Keep learning #Keep Sharing #Everyday Learning.

References:-

--

--