You’re not the only that messes up — how I dropped half of the data from a new table in BigQuery

Marin Aglić
3 min readApr 13, 2023

--

In this story, I’ll share with you how I made a minor mistake that caused us to lose half of the data in a new table. Be careful when migrating data.

Loosing rows

Introduction

A business requirement that we had had us create a new EL (extract and load) pipeline in Airflow. Basically, we have a (somewhat) big table in BigQuery with lots of columns and data, and querying the table is costly. However, one of the teams that works with the data requires only a subset of the columns. So, we were tasked with creating a new table with the subset of the existing data that would make querying cheaper.

The first mistake

Our first mistake when creating the new table was not making it partitioned. Why was this a mistake? The pipeline for filling the table runs daily and in 7 days, the table already had some 5GB of data. We concluded that we should definitely partition the table.

The approach was pretty obvious. Since, we’re using terraform, we’ll need to drop and re-create the same table, only with (required) partitioning. But, we kind of want to keep the data. So, let’s create the same table on the sandbox project, store the current data there, and then put it back into the production project.

The second mistake

So, of course, we started writing the terraform for the table on sandbox, made it partitioned and all. But, we don’t really want to keep much data on the sandbox project, so, let’s make the partitions expire after 7 days. At this point, we had some 12 days of data in the production table… see where I’m going with this?

We created the partitioned table on sandbox with an expiration time for the partition after 7 days.

Migrating the data

I wrote the query for copying the data from production to sandbox. After running the query, almost 8 million rows were inserted into to the sandbox table, almost 8GB of data in our use case.

Then, we dropped the production table, re-created it and just slightly changed the query to copy the data in the other direction (from sandbox to production). Imagine my surprise when instead of (almost) 8 million rows, the result was that around 4.5 million rows were inserted. Which was of around 4.5 GB.

At first we were — “What the…? What happened?”. My first instinct was that there was a mistake in the query. So, double checked the query… of course there was no mistake, as the query was pretty simple.

Went through my personal query history — “Did I maybe write a delete statement somewhere?” (a lot of self-doubt there).

Only after some 5–10 minutes of looking through my personal query history and talking with my coworker how weird this is… did I notice that we set the partition expiration time to 7 days. Of course I forgot about partition expiration — why not. And BigQuery just dropped older partitions — of course 😅😄.

Conclusion

Good thing it’s a new table, the copy of the data exists, and we’re required to backfill the data for the past few months. So, in this case, there was no real harm done. But yeah, check whether there is a partition expiration if you’re migrating data to a table.

--

--

Marin Aglić

Working as a Software Engineer. Interested in Data Engineering. Mostly working with airflow, python, celery, bigquery. Ex PhD student.