A Migration Misstep: From Redshift to BigQuery

Published in

99.co

4 min readDec 8, 2021

Context

Our team has been working on migrating our services from Amazon Web Services (AWS) to Google Cloud Platform (GCP). One major service is the data warehouse, which hosts terabytes of data that powers our daily data needs. Migrating such a huge data warehouse is already daunting enough, and the complexity further multiplies when considering the dependencies that come with it. A misstep is bound to happen, and this article illustrates how it happened, how we fixed it and what we learned from it.

Sound the Alarm

Unlike Redshift, which charges a fixed amount depending on the cluster size of your choice, BigQuery costs are variable. It depends on the storage size, and more importantly, the amount of data scanned, i.e. the query costs. Over the course of migration, we regularly check in on GCP’s billing dashboard to ensure that we don’t incur too much query costs. For the most part, the cost increases linearly and well within our expectations. That is until one day, it ballooned to three times more as compared to the day before. A quick triangulation through Github commits, Slack logs and messages finally led us to the culprit: a non-data engineering production service that queries BigQuery data indiscriminately.

Dissecting the Problem

To illustrate the problem, we’ll use one of the query that the service uses, pre- and post-migration:

Pre-migration:

SELECT
  user_id,
  received_at,
  listing_id
FROM
  listing_viewed
WHERE
  date_trunc(‘hour’, “received_at”) = date_trunc(‘hour’, %(timestamp)s)
  AND user_id IS NOT NULL
  AND listing_id IS NOT NULL

Post-migration:

SELECT
  user_id,
  received_at,
  listing_id
FROM
  listing_viewed
WHERE
  TIMESTAMP_TRUNC(received_at, HOUR) = TIMESTAMP_TRUNC(TIMESTAMP “{timestamp}”, HOUR)
  AND user_id IS NOT NULL
  AND listing_id IS NOT NULL

The query aims to select user_id, received_at and listing_id from the table listing_viewed, where received_at is of a certain hour depending on the time of query, and user_id is not null and listing_id is not null. The post-migration query is adapted to the new syntax of BigQuery. The logic is sound. Everything looks fine, right? Yes… but not really.

The post-migration query consumes few gigabytes of scanned data every time it runs on BigQuery. The service runs the query (with different permutations) on a loop with a different timestamp every five minutes, looking for delta to cache in its own database. As a result, we rack up query costs in no time. On the other hand, this query would cost nothing on Redshift. We have to find a way for the service to scan over less amount of data.

Fortunately, the table, listing_viewed is partitioned by ‘_PARTITIONTIME’. This means that under the hood, BigQuery has already split the table into smaller tables by day, depending on the time of ingestion of the data. For example, all rows created in a day will be inside a sub-table, while the rows created in the subsequent day will be in another sub-table, and so on. As long as we pass in a WHERE clause that filters by ‘_PARTITIONTIME’, we can direct BigQuery to scan over less amount of data.

Post-migration, amended:

SELECT
  user_id,
  received_at,
  listing_id
FROM
  listing_viewed
WHERE
  TIMESTAMP_TRUNC(received_at, HOUR) = TIMESTAMP_TRUNC(TIMESTAMP “{timestamp}”, HOUR)
  AND user_id IS NOT NULL 
  AND listing_id IS NOT NULL
  AND _PARTITIONTIME >= TIMESTAMP_SUB(“{timestamp}”, INTERVAL 7 DAY)

With _PARTITIONTIME included in the WHERE clause, BigQuery needed to only scan fewer tables to get what we needed. The new query costs us about 50 megabytes, bringing down the costs by about 20 times. This translates to a significant amount of cost savings every month.

Note: TIMESTAMP_SUB is used in the filter clause to ensure that we don’t filter away data we potentially needed, as data at the time of ingestion might differ from the actual received_atof the data

Learnings

Apart from having more post mortem to share the nuances of our new data warehouse, a cost monitoring alert on GCP is created for whenever the BigQuery service exceeded the expected range substantially. We also updated our code review process such that any code changes on data warehouse related services will require explicit approval from Data Engineers.

We would also caution that it is generally not advisable to create a product-facing production service on top of a data warehouse. This is because data warehouses are designed mainly for a different set of needs, catered and optimized for a data team’s query & analysis. We have arrived at this design owing to legacy reasons, and will have to work within that constraint for the time being.

A Migration Misstep: From Redshift to BigQuery

Context

Sound the Alarm

Dissecting the Problem

Learnings

Written by Andy Yan