Sampling data with dbt and BigQuery: a strategy to reduce costs

Gabriel Campos
Indicium Engineering

--

When modeling data with dbt or simply when creating some table directly through BigQuery, you may have come across very large tables, which made data transformation costs quite expensive, especially in environments that charge for processing. So my goal with this text is to describe a way to work with data transformation using sample data, with dbt and BigQuery. I hope with this text to help others who are interested in better managing the costs of data transformations in projects that involve building a data warehouse, or to bring a discussion about how to work with sample data in a better way.

Tablesample function and partitioned tables

Before going into how to work directly with sample data in dbt and BigQuery, it seems appropriate to first talk about the BigQuery tablesample function and the use of partitioned tables.

Let’s say you want to run the following example query in order to reduce costs:

select
food_id
, food_type
, food_descripton
from `food_project.schema_food.food_table_raw`
where date = '2022-07-08'
limit 5

It is a common misconception that using the where clause will reduce data processing. It doesn’t work because in order to find the where condition, BigQuery will examine all records to identify if the date (2022–07–08) used in where clause is part of the record. Also, using limit does not reduce the amount of data processed. Another mistake is to think that the limit clause will bring in a random subset of data from a table. It doesn’t happen.

According to the google cloud platform documentation about the tablesample function, we have that:

Unlike the LIMIT clause, TABLESAMPLE returns a random subset of data from a table. Also, BigQuery does not cache the results of queries that include a TABLESAMPLE clause, so the query might return different results each time.

Thus, the following query selects a 30% sample of the data from the table:

select
food_id
, food_type
, food_descripton
from `food_project.schema_food.food_table_raw` tablesample system (30 percent)

According to the official documentation of the tablesample function, BigQuery tables are stored in data blocks. Therefore, in this example the tablesample clause caused the query to randomly select 30% table blocks by reading the rows only in the selected blocks, which also saves costs by not having to read the entire table.

You can combine the tablesample with other clauses, such as where. In the following example case, the query selects data from only 30% of blocks in the table and searches within these selected blocks only those that have the desired date.

select
food_id
, food_type
, food_descripton
from `food_project.schema_food.food_table_raw` tablesample system (30 percent)
where cd_date = '2022-07-08'

There are other examples of clause combinations and more definitions of the function in its official documentation.

Another way to further reduce costs is to create the sample tables as partitioned tables. The following example partitions the table by extraction date.

create or replace table `food_project.schema_food.food_table_sample` partition by date_trunc(extraction_date, day) as (

select
food_id
, food_type
, food_descripton
from `food_project.schema_food.food_table_raw` tablesample system (30 percent)
where cd_date = '2022-07-08'

)

According to the GCP courses about partitioned table,

a partitioned table is a table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and control costs by reducing the number of bytes read by a query.

However, if you create samples from already partitioned tables you should keep in mind that tablesample from partitioned tables tend to be more biased, since partitioning produces blocks where the rows in each block have the same partitioning key.

Creating and working with data sample in dbt

After having covered the topics of partitioned tables and tablesample function, my goal is to present a way of working with sample data in dbt, using BigQuery.

First, you must create a folder for samples inside the models folder of your dbt project. You can name it as “samples”, for example. It is interesting to have a folder for each data source within the “samples” folder you created, in order to better organize your work.

Inside the sample folder and inside the data source folder where you want to sample, you should create a .sql file to write the code that does the sampling.

{{ config(
materialized='table',
partition_by={
"field": "extraction_date",
"data_type": "date",
}
)}}

with
raw_table as (
select
food_id
, food_type
, food_descripton
, parse_date('%Y-%m-%d', extracted_at) as extraction_date
from {{ source('data_source_food', 'food_table_raw') }} tablesample system (30 percent)
where cd_date = '2022-07-08'
)

select *
from raw_table

Note that we are partitioning the sample table created by extraction date, and we are selecting only 30% of the storage blocks in the table with a where clause that specifies only the cd_date is ‘2022–07–08’.

Attention: When sampling data, make sure you get the same sample range. For example, if table1 needs to join with table2 and both have a sample with different dates, there will certainly be a “miss join” between these tables.

After creating the sample model for the desired table, add a sample_source.yml file for this created table, as in the following example:

version: 2

sources:
- name: sample_raw_food
tables:
- name: sample_food_table

After these steps you should have set up some specifics in your profiles file and in dbt_project.yml. Below you will find an example of how to do it in profiles.yml.

project_dbt:
target: dev
outputs:
dev:
type: bigquery
method: oauth
project: dw-project
dataset: stg_analyst_name
threads: 2
timeout_seconds: 300
priority: interactive
retries: 1
maximum_bytes_billed: 5500000000
sample:
type: bigquery
method: oauth
project: dw-project
dataset: sample_raw
threads: 2
timeout_seconds: 300
priority: interactive
retries: 1
maximum_bytes_billed: 5500000000

Notice that there is a sample configuration, in addition to the usual configuration for running on the dev environment, just below the personal configuration. This makes it possible to run the following code for the samples to be inserted into the sample_raw schema:

dbt run --target sample

or

dbt run -m sample_model --target sample

Then, by running these codes, you will cause dbt to create in the BigQuery sample schema the tables with the sample results. (Schema is called dataset in BigQuery)

After generating the sample data in schema sample, you could already work with this data. However, there is another interesting configuration in dbt_project.yml that might reduces costs.

After you create a sample data, you don’t need to create it again and again. So, let’s set dbt_project to disable the folder where the sample data is located. Data models in the sample folder will not run if the +enabled clause is set to false.

models:
project_dbt:
samples:
+enabled: false
+materialized: table
source_food:
+schema: food

Notice that there is an enable field, which enables permission to run data models from “samples” folder. Whenever you want to run a data sample, you must change the enabled from false to true. By making this change you will cause the data models in the “samples” folder to be enabled to run. After running the models to create the samples, disable it by changing it to false again. This will reduce unnecessary costs, since data samples will not be generated constantly.

You can make this switch from True to False happen automatically. Just use the following Jinja as shown in the code below.

models:
project_dbt:
samples:
+enabled: "{{ true if target.name=='sample' else false }}"
+materialized: table
source_food:
+schema: food

Alternating between the sample data and the original data

After all the necessary settings are made, you can now use the sample data to do the data transformations and build your Data Warehouse in order to save costs.

However, it is important to use some jinja in the models that use the sample data specifically. So dbt can switch between the sample data, in the development environments and the original data in the production environment. Below is an example of how to use jinja to achieve this alternation of environments.

with 
example_table as (
select
food_id
, food_type
, food_description
{% if target.name == 'prod' %}
from {{ source('data_source_food', 'food_table_raw') }}
{% else %}
from {{ source('sample_raw_food', 'sample_food_table') }}
{% endif %}
)

select *
from example_table

This way, if we are not running in production or without the target prod, dbt will run the model with the data from the sample table.

Closing remarks

Using sample data to transform and to model data is very important for cost management in a project that uses a larger volume of data.

I briefly presented the tablesample function and how to partition tables in BigQuery, and a strategy for working with sample data using dbt. I hope this text will be useful to anyone who is interested in dbt optimization.

I have described the strategy that I know. If you know another more interesting one for working with sample data, or would like to contribute some improvement whith this one, I’ll be very glad. This would be of immense importance to the dbt community.

I hope you like it. Thank you for your time!

--

--