How to schedule a BigQuery ETL job with Dataprep

Lak Lakshmanan
Jan 21, 2018 · 4 min read

The BigQuery user interface lets you do all kinds of things — run an interactive query, save as Table, export to table, etc. — but there is no way (yet!) to schedule a query to run at a specific time or periodicity. Graham Polley’s been on a roll lately, coming up with four workarounds that are all serverless and involve other GCP products. But he’s missed what, in my mind, is the clear winner for ETL jobs — it’s flexible, it’s powerful, it involves no coding, and it’s likely you’ll want it in your toolkit even after the BigQuery team rolls out the ability to run scheduled queries.

Note: BigQuery now has scheduled queries, so please use that, so that you can keep your data in BigQuery and take advantage of its scale and power.

The approach in this article is useful only if you are doing transformations (the T in ETL):

  1. In the BigQuery UI, save the desired query as a View.
  2. In Cloud Dataprep, write a new recipe, with a BigQuery source. Optionally, add some transforms to your recipe. For example, you might want to add some formulas, de-deduplications, transformations, etc.
  3. Export result of transformation to a BigQuery table or CSV file on Cloud Storage
  4. Schedule the Dataprep flow to run periodically

If the UI is different when you try to replicate the steps, just hunt around a bit. The functionality is likely to be there, just in a different place.

Image for post
Image for post
Scheduling a Query to run periodically is easy, really.

1. Save BigQuery query as View

I’m going to demonstrate using a query that pulls recent weather from a public dataset of weather, so type this into the BigQuery UI and then save as View:

#standardSQL
SELECT
date,
MAX(prcp) AS prcp,
MAX(tmin) AS tmin,
MAX(tmax) AS tmax
FROM (
SELECT
wx.date AS date,
IF (wx.element = 'PRCP',
wx.value/10,
NULL) AS prcp,
IF (wx.element = 'TMIN',
wx.value/10,
NULL) AS tmin,
IF (wx.element = 'TMAX',
wx.value/10,
NULL) AS tmax
FROM
`bigquery-public-data.ghcn_d.ghcnd_2018` AS wx
WHERE
id = 'USW00094846'
AND DATE_DIFF(CURRENT_DATE(), wx.date, DAY) < 15 )
GROUP BY
date
ORDER BY
date ASC
Image for post
Image for post
Save the query as a view

2. In Cloud Dataprep, write a new recipe

Image for post
Image for post
Step 1: Launch Dataprep from the GCP web console
Image for post
Image for post
Recipes are part of a flow

Click on Import Datasets and follow the UI flow to get to your BigQuery dataset and import your newly created current_weather view as a Dataprep dataset.

Image for post
Image for post
Import the current_weather BigQuery View as a Dataprep Dataset
Image for post
Image for post
Add steps to the recipe

After you click on “Edit recipe”, Dataprep will pull out a sample of your dataset and show you the columns, their distributions, etc.

Image for post
Image for post

Click on the blue plus button to add a step to the recipe. We’ll remove any rows where the data columns are empty:

Image for post
Image for post
Writing a delete transform to delete rows that match a criterion

Note that as you write the formula, Dataprep shows you which rows/columns will be affected.

3. Run the job to Export the data

Click on “Run job”. The default is to create a CSV file on Cloud Storage, but we can change this to BigQuery by clicking on the pencil (“Edit”) icon:

Image for post
Image for post
Change the output of the job to write to your dataset in BigQuery

Click on “Run job” to run it once. Here, we ask the job to drop and replace the table each run, but as you can see, there are other options. This runs as a Dataflow job, i.e. at scale and distributed.

Image for post
Image for post
Dataprep flows run as Dataflow jobs! You don’t need to write Java/Python to use Dataflow.

4. Schedule the job to run periodically

Go to the “Flows” section of the Dataprep UI and click on the three buttons next to your new Flow. You’ll see an option to add a schedule:

Image for post
Image for post

Options include daily, weekly, etc. but also a crontab format for further flexibility:

Image for post
Image for post
Scheduling the job to run periodically

And that’s it!

Google Cloud - Community

Google Cloud community articles and blogs

Lak Lakshmanan

Written by

Data Analytics & AI @ Google Cloud

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Lak Lakshmanan

Written by

Data Analytics & AI @ Google Cloud

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store