How to save 90% of BigQuery’s storage cost

A tale of default configurations

Yerachmiel Feltzman
Israeli Tech Radar

--

Hey man, stop with the click-bait titles — I can hear you thinking. ;)

That’s a true story, though.

BigQuery Storage costs in a real project I worked on (numbers hidden for privacy)

I’m a senior Big Data Engineer working at Tikal — a company composed of tech experts, where we provide hands-on consultancy, scaling R&D teams with cutting-edge technologies.

By changing a simple default configuration, I managed to drop BigQuery storage costs bill by 90% for one of Tikal’s clients.

Photo by micheile henderson on Unsplash

A tale of default configurations

Tools with several features are usually built on top of configurations. Many should be changed to fit your workflows.

However, to avoid overflowing users, most products ship default configurations. Why did solution X decide to default configuration Y to value Z? I don’t know, but they did.

BigQuery has an important feature called time travel. It allows for retrieving older states of data for a given table. It works well for deletion too.

Did you delete data by mistake? Just time-travel, get it back, and save your day.

Time travel, however, potentially increases storage costs. Depending on the data volume and frequency of changes time traveling storage can pile up fast.

If you set your storage billing model to use physical bytes, the total storage costs you are billed for include the bytes used for time travel and fail-safe storage. If you set your storage billing model to use logical bytes, the total storage costs you are billed for do not include the bytes used for time travel or fail-safe storage. You can configure the time travel window to balance storage costs with your data retention needs. (source: https://cloud.google.com/bigquery/docs/time-travel#billing)

And, it’s enabled by default. 🤔

Really? I don’t want it, let’s shut it down! 😜

  1. Set billing mode to physical, so we pay for only what we store:
ALTER SCHEMA my_landing_zone
SET OPTIONS(
storage_billing_model = 'PHYSICAL');

2. Reduce time travel to the minimum (two days by the time I wrote this article):

ALTER SCHEMA my_landing_zone
SET OPTIONS(
max_time_travel_hours = 48);

That’s it, we saved a lot of money for our company. We can close this article.

Photo by King Lip on Unsplash

Ok, ok, not so fast. Let’s put a little bit more thinking into this process.

Should we turn time travel off?

1. Assess potential costs savings

To assess potential costs the following can help:

-- pricing details: https://cloud.google.com/bigquery/pricing#storage
with t as (
SELECT
*,
(ACTIVE_LOGICAL_BYTES/POW(1024,3) * 0.02 + LONG_TERM_LOGICAL_BYTES/POW(1024,3) * 0.01 ) AS logical_cost_forecast,
(ACTIVE_PHYSICAL_BYTES/POW(1024,3) * 0.04 + LONG_TERM_PHYSICAL_BYTES/POW(1024,3) * 0.02 ) AS physical_cost_forecast,
((ACTIVE_PHYSICAL_BYTES - TIME_TRAVEL_PHYSICAL_BYTES - FAIL_SAFE_PHYSICAL_BYTES) /POW(1024,3) * 0.04 + LONG_TERM_PHYSICAL_BYTES/POW(1024,3) * 0.02 ) AS physical_cost_forecast_with_zero_time_travel_and_fail_safe,
safe_divide(time_travel_physical_bytes, ACTIVE_PHYSICAL_BYTES) as time_travel_per_active_byte_ratio,
safe_divide(fail_safe_physical_bytes, ACTIVE_PHYSICAL_BYTES) as fail_safe_per_active_byte_ratio
FROM
`<my-project>.<my-region>.INFORMATION_SCHEMA.TABLE_STORAGE`
where
project_id = "<my-project>"
)
select
t.table_schema,
t.table_name,
round((t.ACTIVE_LOGICAL_BYTES + t.LONG_TERM_LOGICAL_BYTES)/POW(1024,3), 0) as total_logical_GiB,
round((t.ACTIVE_PHYSICAL_BYTES + t.LONG_TERM_PHYSICAL_BYTES)/POW(1024,3), 0) as total_physical_GiB,
round(t.logical_cost_forecast, 0) as logical_cost_forecast,
round(t.physical_cost_forecast, 0) as physical_cost_forecast,
round(t.physical_cost_forecast_with_zero_time_travel_and_fail_safe, 0) as physical_cost_forecast_with_zero_time_travel_and_fail_safe,
round(time_travel_per_active_byte_ratio, 3) as time_travel_per_active_byte_ratio,
round(fail_safe_per_active_byte_ratio, 3) as fail_safe_per_active_byte_ratio
from
t
order by
t.physical_cost_forecast desc

2. Assess for what reason time travel is being used

Do we need it for every dataset? It depends.

Do we need it for logical queries? We want to query older states of the data as part of our analytical needs. Or, we are applying data-quality checks that compare upserts over time.

Do we want it for mistaken deletes? In other words, we are using it for “sleeping well at night” reasons, i.e., data backup.

If time travel is an alternative to logic-data-state over time, we should consider whether more classic designs are better solutions. Slow-changing dimensions (SCD), more specifically SCD-2, should be considered. Compare costs easily with the query on item 1 above. Also, consider the engineering costs of implementing SCDs.

If the reason is backup-related, the next items shall be valuable.

3. Assess the risks of turning it down

The main alternative is to keep it on. It’s costly, but definitely might save your day.

To put it in simple terms, the trade-off is:

  • Pay upfront to store time-travel data.
  • Pay only for re-ingesting when needed.

I would vote for the latter. Better defer costs to happen only when necessary.

Nonetheless, if one or more to the following hold, we might prefer paying upfront so we can sleep better at night:

  • Re-ingest compute costs more than time-travel storage.
  • Re-ingest is hard.

4. Deal with re-ingest compute costs

By the time I write these lines, BigQuery offers free batch ingestion when using the shared slot pool. Surely that’s not always sufficient, but a lot of times it’s a good choice everything considered.

https://cloud.google.com/bigquery/pricing#data_ingestion_pricing

If my product manager asks for low-latency streaming ingestion, I’m sure I’ll show him the above table. We might anyway decide to go with streaming inserts, but it’s my responsibility to let him know of the trade-offs. Many times, conversations like this ended up changing the first-draft tech requirements, because business stakeholders reconsidered the actual requirements vis-a-vis potential costs.

Consequently, if we can use batch loading free tier, we have made the trade-off irrelevant and are better off deferring the costs.

If that’s not the case, use the costs assessment query in item 1 and do your math.

In any case, in the cloud, learning your vendor pricing structure is what differentiates good engineers from excellent ones.

5. Deal with re-ingest complexity

Design for idempotent backfillings.

We should design for idempotency and we should design for easy backfilling. Usually, the latter is allowed when the former is implemented.

In other words, backfilling is way easier if the data pipeline is idempotent.

Consider the following two data ingestion pipelines:

a. append new partition
b. insert or replace partition

What option is idempotent?

If we run option-a twice, data will be duplicated. On the other hand, option-b can be run a thousand times and for the same input data, the landing table will always contain the same result.

Consider now that we need to re-ingest data for any reason. In what of the data ingestion pipelines above is the task of re-ingesting easier?

To backfill the option-a data pipeline, duplication will need to be handled, maybe by deleting backfilled partitions first, and then re-ingesting. On the contrary, the option-b pipeline can be just run, and “that’s all folks”.

BigQuery supports loading data into partitions idempotently by using the WRITE_TRUNCATE disposition. Ingesting programmatically with the Python SDK, it would look like this:

load_job_metadata = dict(
write_disposition=bigquery.job.WriteDisposition.WRITE_TRUNCATE,
# other configs
)
job_config = bigquery.job.LoadJobConfig(**load_job_metadata)

Hence, if we design our data pipeline with re-ingest in mind from the very first architecture discussion, and make it idempotent, we would have mitigated the “re-ingest is hard” risk. It’s not hard anymore.

If both hold, we could create simple commands that re-ingest specific partitions and run them as we wish. For instance, if partition 2023-08-06 12:00:00 was mistakenly deleted, we could run:

./commands.sh gcp:workflow:run:partition "2023080612"

Saving storage costs

Read the docs. Find hidden gems. Go back to the drawing table. Do your math.

… and make the CFO happier. 🤑

Is he a happy CFO? — Photo by Joe Caione on Unsplash

Happy costs-under-control pipeline. 🚀

--

--