Smarter Analytics in Snowflake

We live in a time where cloud data platforms like Snowflake can scale flexibly to analyse large volumes of data. But just because we can do something, it doesn’t mean that we should. Take for example the Census, which takes place every 10 years, and every citizen is contacted. The entire operation takes years of planning, billions of dollars.

There must be a better way, and as it turns out, there is. Research Analysts can answer most questions by sampling a population. Data sampling for analytics done right can serve to reduce data volumes which in turn increases the speed to insights and reduces costs.

Before we go any further let’s define what data sampling is:

Data sampling is the process of creating a subset of a full data set for the purpose of analytics in order to uncover meaningful insights from the full data set.

In this post, we go through two options to create sampled datasets.

The first option uses out of the box sampling functionality from Snowflake that gets you up and running quickly to validate if sampling is a viable technique for your analytical use cases.

The second option is more involved and it maintains a sampled dataset as an ongoing workload. Once you’ve established that sampling is a viable technique, you can use this option to sample subjects randomly and grow the sample deterministically.

Option 1: Quick Sampling

Snowflake provides the sample keyword that can be used to select a subset of rows from a table. The example below uses this to select subjects (in this case users) from the user table.

create or replace table user_sample as
select user_id from user
sample rows (0.1) seed(4);

Note, deterministic sampling is guaranteed if the data set being sampled is a table, the table is unchanged and the seed value across sampling runs is identical.

Once you have a sample of users, you can then use the list of users to restrict transactional data that belong to those users.

create or replace table user_event_sample as
select *
from event where user_id in (
select user_id
from user_sample
);

For a detailed account of these foundational sampling features please refer to Snowflake’s documentation on sample or this excellent write up on sampling to create development environments by Rajiv Gupta.

Option 2: Continuous, Ongoing Sampling

The first option allows us to quickly create sampled data. Once value for sampling has been demonstrated, you can use this option to create and maintain a sampled dataset that grows along with the full dataset.

Step 1 — Standardise Subject Identifiers

The first step is to convert subject identifiers that need to be sampled into a value that is more randomly and evenly distributed. For this we use Snowflake’s hash function.

hash(user_id)

The hash function returns a signed integer. To keep the arithmetic simple, we’ll take the absolute value using the abs function to ensure the return value is always positive.

abs(hash(user_id))

Note, the limitations of the hash function and the advice in the documentation to not use the function to create unique values do not apply. This is because we do not expect the value returned to be unique. The main reason we use the hash function is to produce a more evenly distributed and random value.

Step 2— Sample

Now that we have an evenly distributed number for our subject we can use mathematical operators to sample our dataset. To do this we use Snowflake’s mod function. In order to sample 1 in 3 users we would use the following expression.

mod(abs(hash(user_id)), 3) = 1

To sample 1 in a 1000 users we’d use the following expression.

mod(abs(hash(user_id)), 1000) = 1

To sample 3 in a 1000 users we’d use the following expression.

mod(abs(hash(user_id)), 1000) between 1 and 3

Step 2 — Create ongoing sample

Putting it all together in order to sample from the usertable we would do the following.

create or replace table user_sample (
user_id string
);

insert into user_sample
select user_id from user
where mod(abs(hash(user_id)), 3) = 1;

It’s even easier to create a sample over transactional data.

create or replace table user_event_sample (
...
);

insert into user_event_sample
select * from user_event
where mod(abs(hash(user_id)), 3) = 1;

By using the same expression in the where clause we ensure that the same user_id is selected across the user_sample and user_event_sample tables.

The insert statements can be added to your existing data pipelines to continuously build the sampled data sets and keep them ready to use for analytical purposes.

--

--