How to schedule a BigQuery ETL job with Dataprep

Lak Lakshmanan
Jan 21, 2018 · 4 min read

The BigQuery user interface lets you do all kinds of things — run an interactive query, save as Table, export to table, etc. — but there is no way (yet!) to schedule a query to run at a specific time or periodicity. Graham Polley’s been on a roll lately, coming up with four workarounds that are all serverless and involve other GCP products. But he’s missed what, in my mind, is the clear winner for ETL jobs — it’s flexible, it’s powerful, it involves no coding, and it’s likely you’ll want it in your toolkit even after the BigQuery team rolls out the ability to run scheduled queries.

To be clear: once BigQuery has scheduled queries, you want to use that, so that you can keep your data in BigQuery and take advantage of its scale and power. However, if you are doing transformations (the T in ETL), then consider this approach:

  1. In the BigQuery UI, save the desired query as a View.
  2. In Cloud Dataprep, write a new recipe, with a BigQuery source. Optionally, add some transforms to your recipe. For example, you might want to add some formulas, de-deduplications, transformations, etc.
  3. Export result of transformation to a BigQuery table or CSV file on Cloud Storage
  4. Schedule the Dataprep flow to run periodically

If the UI is different when you try to replicate the steps, just hunt around a bit. The functionality is likely to be there, just in a different place.

Scheduling a Query to run periodically is easy, really.

1. Save BigQuery query as View

I’m going to demonstrate using a query that pulls recent weather from a public dataset of weather, so type this into the BigQuery UI and then save as View:

#standardSQL
SELECT
date,
MAX(prcp) AS prcp,
MAX(tmin) AS tmin,
MAX(tmax) AS tmax
FROM (
SELECT
wx.date AS date,
IF (wx.element = 'PRCP',
wx.value/10,
NULL) AS prcp,
IF (wx.element = 'TMIN',
wx.value/10,
NULL) AS tmin,
IF (wx.element = 'TMAX',
wx.value/10,
NULL) AS tmax
FROM
`bigquery-public-data.ghcn_d.ghcnd_2018` AS wx
WHERE
id = 'USW00094846'
AND DATE_DIFF(CURRENT_DATE(), wx.date, DAY) < 15 )
GROUP BY
date
ORDER BY
date ASC
Save the query as a view

2. In Cloud Dataprep, write a new recipe

Step 1: Launch Dataprep from the GCP web console
Recipes are part of a flow

Click on Import Datasets and follow the UI flow to get to your BigQuery dataset and import your newly created current_weather view as a Dataprep dataset.

Import the current_weather BigQuery View as a Dataprep Dataset
Add steps to the recipe

After you click on “Edit recipe”, Dataprep will pull out a sample of your dataset and show you the columns, their distributions, etc.

Click on the blue plus button to add a step to the recipe. We’ll remove any rows where the data columns are empty:

Writing a delete transform to delete rows that match a criterion

Note that as you write the formula, Dataprep shows you which rows/columns will be affected.

3. Run the job to Export the data

Click on “Run job”. The default is to create a CSV file on Cloud Storage, but we can change this to BigQuery by clicking on the pencil (“Edit”) icon:

Change the output of the job to write to your dataset in BigQuery

Click on “Run job” to run it once. Here, we ask the job to drop and replace the table each run, but as you can see, there are other options. This runs as a Dataflow job, i.e. at scale and distributed.

Dataprep flows run as Dataflow jobs! You don’t need to write Java/Python to use Dataflow.

4. Schedule the job to run periodically

Go to the “Flows” section of the Dataprep UI and click on the three buttons next to your new Flow. You’ll see an option to add a schedule:

Options include daily, weekly, etc. but also a crontab format for further flexibility:

Scheduling the job to run periodically

And that’s it!

Google Cloud - Community

A collection of technical articles published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Lak Lakshmanan

Written by

Data Analytics & AI @ Google Cloud

Google Cloud - Community

A collection of technical articles published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade