Efficient Non-Deterministic Sampling of Large BigQuery Tables

Matt Dixon
Google Cloud - Community
7 min readFeb 26, 2024

As table size grows, so too does the cost, execution, and iteration time of downstream tasks requiring the use of this table. Of course, in modern day cloud data warehouses and MPP database systems, techniques such as clustering and partitioning replicate (and often times multiply) the utility of the tried and true indexing solutions of relational DBMS’s. Sometimes, however you need a representative sample of the table’s data and limiting results to a single (or even a range of partitions) doesn’t cut it. So how can we do this efficiently in BigQuery? This is where a lesser know feature (that is currently still in preview) comes to the rescue: TABLESAMPLE

Let’s pretend you are working on a super complicated analytical query that will be responsible for feeding multiple downstream reporting process, hydrating a feature store and producing data quality alerts for proactive resolution of issues. The source for this data is a customer events table that is over 10 TB in size. You’re team has followed best practices and partitioned by event_datetime and clustered by customer_id, but you want to develop, test and iterate on your query across a representative dataset without waiting the overhead of querying the entire table every time you want to make a change (oh, and your boss really doesn't want to see the bill your’e going to generate either). So what do you do?

For illustration here I am using the Wikipedia public dataset available in BigQuery. It is ~36GB on disk (kind of small by today’s standards), but the efficiency gains we will get here can be extrapolated out as table size increases.

The Naive LIMIT Clause

For our first attempt at this we just limit the results of a SELECT * query to 1000 records using the LIMIT clause.

As you can see (and probably guessed) this will scan the ENTIRE table and then limit the result-set returned to the calling process. BAD. Let’s try something else.

Level 1: TABLESAMPLE

Now, let’s break out the TABLESAMPLE clause. This will utilize the metadata that BQ has stored on the table related to the physical data blocks to query random subsets of data (random data blocks) from large BigQuery tables. The docs for it are here but, at its core, it allows you to provide a value for which BigQuery should reduce the size of the source table to (in terms of percentage of original size) prior to executing the associated query.

From our example here, you can see after supplying the TABLESAMPLE SYSTEM (10 PERCENT) clause, our volume of data processed drops from 35.69 GB to 3.56 GB (for all you math wizards out there, that is indeed 10%).

Sick. But we can do better. Depending on the size of the table, this can still return a ton of records, which could slow us down. We just want to work with, say, 1000 records for now. Can we accomplish this?

Level 2: TABLESAMPLE with a LIMIT

Nothing ground breaking here, but you can still use the LIMIT clause we leveraged in our first attempt AFTER we sample the data. This will limit the result-set, after reducing the source data size through the TABLESAMPLE clause (in this case exactly 10% of the first source dataset).

SELECT * FROM `bigquery-public-data.samples.wikipedia` 
TABLESAMPLE SYSTEM (10 PERCENT)
LIMIT 1000

You’ll see from this screenshot that we are still, in fact, only processing 3.56 GB of data, but have constrained our result-set to only 1000 records.

Okay this is good but how random/ representative is this sample of data? Remember this is one of the key reasons we are going through through this whole exercise in the first place. Currently we are not explicitly adding in any logic to force randomization; we are relying solely on an assumption that the above sql will give non-deterministic results. For reasons beyond the scope of this article, we should not trust this to be the case. (Take my word for it, for now.) So, how do we force the non-deterministic nature we are after?

Level 3: Explicitly randomizing the result-set

For this we are going to leverage a built-in function in BQ called rand(). This will allow us to both select a random subset of rows from our sampled table and order this subset randomly, before we finally limit the result.

SELECT * FROM `bigquery-public-data.samples.wikipedia` 
TABLESAMPLE SYSTEM (10 PERCENT)
WHERE rand() < 0.1 -- Percent of rows from table subset.
ORDER BY rand() -- Randomly ordered.
LIMIT 1000

You’ll see from this screenshot that we are still, in fact, only processing 3.56 GB of data, but have constrained our result-set to only 1000 records. This logic will provide us the added benefit of providing (mostly — see below for why) non-deterministic results.

Level 4: Making this repeatable

Now, we’ve spent some time building up this logic and although it is not all that complicated when broken down into its parts, let’s be sure we make this process repeatable, parameterized and shareable (both the code and the results) across members of your team.

For the purposes of this demo, in order to keep it relatively simple, I’ve kept this 100% SQL based and leveraged Stored Procedures for this implementation. You could just as easily wrap some of this code in python and use it through CI/CD or shared code repos as well, but again, outside the scope of this article.

First, lets write the Stored Procedure. Because we wanted to make this parameterized and one of those parameters is the source table, we needed to leverage Execute Immediately to execute the dynamic SQL we build within the procedure.

Here’s the code:


CREATE OR REPLACE PROCEDURE `development.rand_table_sample`(source_table STRING, dest_table STRING, tablesample_percent_int FLOAT64, rows_sampled_percent FLOAT64, row_limit INT64)
BEGIN

DECLARE sql STRING;

SET sql = format("""
CREATE OR REPLACE TABLE %s AS
SELECT * FROM %s TABLESAMPLE SYSTEM ( %f PERCENT)
WHERE rand() < %f
ORDER BY rand()
LIMIT %d
""", dest_table, source_table, tablesample_percent_int, rows_sampled_percent, row_limit);

-- CREATE SAMPLE DATASET.TABLE
EXECUTE IMMEDIATE (sql);

END;

To call it, you need to pass in the source table from which you want to copy the data, the destination table where you want to persist the sample of data, the percentage of the source table to sample (through TABLESAMPLE), the rows to sample from that subset and the number of rows to limit the final result-set to.

-- DECLARE VARIABLES
DECLARE tablesample_percent_int FLOAT64;
DECLARE sampled_table STRING;
DECLARE rows_sampled_percent FLOAT64;
DECLARE row_limit INT64;
DECLARE destination_table STRING;

-- SET VARIABLES
SET sampled_table = '`bigquery-public-data.samples.wikipedia`';
SET destination_table = '`development.sample_wiki_1000`';
SET tablesample_percent_int = 10;
SET rows_sampled_percent = 0.1;
SET row_limit = 1000;

-- SAMPLE AND COPY DATA
CALL `development.rand_table_sample`(sampled_table, destination_table, tablesample_percent_int, rows_sampled_percent, row_limit );

As you probably guessed, this persists that data for reuse into destination_table . This table can then be referenced by mutliple users, processes and procedures to build downstream processes off of a representative sample of the sampled_table.

SELECT * FROM `development.sample_wiki_1000`

Improvement Opportunities

I won’t pretend this is a perfect solution, but it is one that has worked for me multiple times over the years. I admit, the stored procedure implementation is not how I have gone in the past (I have historically opted for a version-controlled python wrapped implementation) but you can’t beat the simplicity of the SP version.

There is no notion of maintenance of referential integrity. If this process is executed against two separate tables whom share an implied referential constraint, there is no guarantee that this will be maintained in the samples (i.e. you may have orphaned rows). This is becoming less and less of a problem in cloud data warehouses as extensions on standard SQL implementations like Arrays and Structs are becoming common place, eliminating a lot of the need for Star or Snowflake schemas in these systems. I did want to call it out though, as a current weakness of this implementation.

Lastly, as the percent of the table you are sampling shrinks (i.e. as TABLESAMPLE Percent trends to 0) and the percent of rows from that sample increases the more deterministic the records included in the result-set becomes. The order should remain random, but basically the more data from the shrunken source table that you select the less random it will be. This may be obvious but I wanted to call it out.

I am curious. What techniques have you used over the years to accomplish similar things? What did you think of this implementation? And if you are reading this and work for Google — when is this feature coming out of Preview and into GA?!

Until, next time… ☟

https://www.beardeddata.com

--

--

Matt Dixon
Google Cloud - Community

Staff Data Engineer working in healthcare. Using technology to improve the patient care journey for providers and patients alike. Dad. Husband.