Why do you need random sampling?
Random sampling is one of the most important technique for data scientists. Especially, it will be powerful in following situations:
- checking the data visually
- reducing volume of the data
- splitting the data for validation of the ML model (i.e. train/valid/test split, k-fold cross-validation)
There are many posts describing sampling technique in Python. But if the amount of the data is too large to handle in Python, BigQuery can be desirable choice. Here, I’m going to share some tips for random sampling in BigQuery using public data, and you can quickly try all the queries below if you can use BigQuery.
TABLESAMPLE Sampling
The TABLESAMPLE clause allows for easy sampling by specifying the percentage of records to be sampled. Moreover, the amount of scanning is also reduced with the TABLESAMPLE clause, so it helps improve the cost performance.
select * from `bigquery-public-data.samples.natality`
tablesample system (0.1 percent)
Please keep your mind that the TABLESAMPLE clause has no reproducibility.
And the data is sampled not by records but by blocks with the TABLESAMPLE clause. Usually, BigQuery table is composed of multiple blocks, but if the amount of the data is less than 1 GB and the table is composed of just one block, the TABLESAMPLE clause will not work at all.
Here is an example. bigquery-public-data.samples.shakespeare
is composed of single block, and following both queries return the same number of records.
--with TABLESAMPLE clause
select count(*) from `bigquery-public-data.samples.shakespeare`
tablesample system (0.1 percent)
----------------------------------------------------------------
164656
--without TABLESAMPLE clause
select count(*) from `bigquery-public-data.samples.shakespeare`
----------------------------------------------------------------
164656
FARM_FINGERPRINT Sampling
The another way to sample your data is using the FARM_FINGERPRINT function. This function returns a hash value, so it has enough randomness and reproducibility.
select * from `bigquery-public-data.samples.shakespeare`
order by farm_fingerprint(format('%T', (word || corpus, "random_seed")))
limit 10000
The random seed and sampling result can be changed by replacing STRING of “random_seed” with another STRING.
select * from `bigquery-public-data.samples.shakespeare`
order by farm_fingerprint(format('%T', (word || corpus, "random_seed_2")))
limit 10000
Unlike TABLESAMPLE sampling, the cost of BigQuery with the FARM_FINGERPRINT is the same as full sampling. So it doesn’t help to reduce costs.
Segment Sampling
If you want to sample N records from each segment, combine the FARM_FINGERPRINT function with the ROW_NUMBER function.
In this example, five records are sampled from each corpus.
select * from `bigquery-public-data.samples.shakespeare`
qualify 5 >= row_number() over (
partition by corpus
order by farm_fingerprint(format('%T', (word || corpus, "random_seed")))
)
Conclusion
There are several ways for random sampling and each method has pros and cons. Briefly, If you want to reduce cost of BigQuery, TABLESAMPLE sampling is suitable. If you want reproducible sampling, FARM_FINGERPRINT sampling is recommended.