Building SQL pipelines in BigQuery with Dataform

Collaboratively transform, document, schedule datasets using SQL

Increasingly, we see a move from building ETL pipelines (where much of the transformation is carried out in tools like Spark or Dataflow before the data is loaded into BigQuery) to ELT pipelines (where the transformation is carried out within BigQuery itself). The reasons are that (1) SQL is easier for business users to write (2) BigQuery scales better and is less expensive than alternative data processing technologies.

The problem with doing all the transformation code in SQL, though, is that it can become hard to maintain. How often have you come back to a project after a few months and been faced with a bunch of views, tables, user-defined functions, and scripts and scratched your head in bewilderment?

That’s why it’s very useful to have a environment that supports best practices in terms of transformation code — the same sort of best practices you want to apply to any code: documentation, reusability, readability, assertions, unit testing, source code control, and so on. Dataform, newly acquired by Google Cloud, and soon to be part-and-parcel of the Google Cloud Data Analytics portfolio provides such an environment. It’s free to all users.

1. Install Dataform

In the GCP Console, open up CloudShell (it’s the > icon in the top blue ribbon). Then, use NPM to install Dataform:

npm i -g @dataform/cli

2. Initialize a Dataform project

Then create a new Dataform project (I’m calling it bikes_weather):

dataform init bigquery bikes_weather \
--default-database $(gcloud config get-value project) \
--include-schedules

At this point, the project has been created with the file structure recommended by Dataform:

3. Set up authentication to BigQuery

Go to https://console.cloud.google.com/iam-admin/serviceaccounts and create a service account. Give this account the role of BigQuery Admin (so that Dataform can create new tables etc.). Then, download the JSON key to the project and upload the file to CloudShell.

Finally, in CloudShell, type:

cd bikes_weather
dataform init-creds bigquery

Provide the path to the JSON key file when asked. Then make sure to add the key file to .gitignore so that you don’t check it in by mistake.

echo filename.json > .gitignore
git add -f .gitignore

4. Set up raw Sources

Let’s start out by adding a definition for a source named definitions/sources/weather.sqlx. Use the CloudShell editor.

config {
type: "declaration",
database: "bigquery-public-data",
schema: "ghcn_d",
name: "ghcnd_2020",
description: "Weather data from Global Historical Climate Network (GHCN) in 2020"
}

Basically, this is the raw data as it exists in the BigQuery tables `bigquery-public-data.ghcn_d.ghcnd_2020`

This data looks like this:

It needs to be prepared to be useful. Let’s do that next.

5. Prepare data

Create a file named definitions/staging/nyc_weather.sqlx. This will take the raw data and format it to be usable weather data for New York City in this form:

config {
type: "table",
schema: "staging",
description: "Cleaned up data corresponding to New York City",
// column level documentation, defined in includes/docs.js
// columns: docs.nyc_weather
}
SELECT
date,
MAX(prcp) AS rain_mm,
MIN(tmin) AS tmin_celsius,
MAX(tmax) AS tmax_celsius
FROM (
SELECT
wx.date AS date,
IF (wx.element = 'PRCP', wx.value/10, NULL) AS prcp,
IF (wx.element = 'TMIN', wx.value/10, NULL) AS tmin,
IF (wx.element = 'TMAX', wx.value/10, NULL) AS tmax
FROM
${ref("ghcnd_2020")} AS wx
WHERE
wx.id = 'USW00094728'
)
GROUP BY
date

We are asking Dataform to create a table for us (we could have also asked it to create a view). In fact, we can even get Dataform to partition and cluster the table that it creates for us:

config {
type: "table",
bigquery: {
partitionBy: "date",
clusterBy: ["tmin_celsius", "tmax_celsius"]
}
}

The view is going to be created using the query that we specify. There are two key things to note. First, the FROM clause of the table is a reference to the table name in weather.sqlx. Second, the columns in the view will be documented. The documentation is specified in includes/docs.js as follows:

const DATE = {
date: `The date (UTC)`
};
const RAIN_MM = {
rain_mm: `Daily rainfall, in mm`
};
const TMIN_CELSIUS = {
tmin_celsius: `Daily minimum temperature, in Celsius`
};
const TMAX_CELSIUS = {
tmax_celsius: `Daily maximum temperature, in Celsius`
};
// group documentation by table
const nyc_weather = {
...DATE,
...RAIN_MM,
...TMIN_CELSIUS,
...TMAX_CELSIUS,
};
module.exports = {
nyc_weather
}

Basically, we define the columns and then define the view (nyc_weather). The idea is that column names will be unique in the project and appear in multiple views and tables.

6. Compile and Dry Run

Let’s try it out. First, compile the project by typing on the command-line:

dataform compile

You should see that Dataform tells you that one dataset staging.nyc_weather will be created as a table.

You can also check out the dependencies:

dataform run --dry-run

6. Run

Run it to create the views and tables we want.

dataform run

Indeed, the table has gotten created for us:

7. Use Dataform web UI

The real user experience with Dataform, however, comes with the web UI. To do that, you need a Git repository (which is how the collaboration with your team mates happens).

Go to my Git repository (https://github.com/lakshmanok/bikes_weather) and fork it into your GitHub account.

Then, in https://dataform.co, sign in and go to the User settings and link your GitHub account.

From the project list, import from a Git repository and import your fork of my repo.

Specify your GCP project id to set up BigQuery as well. You will have to upload the service account JSON file.

8. Build a SQL pipeline to create a report

Let’s build a pipeline to create a report about New Yorkers’ bicycling behavior on weekdays, and weekends, and the impact of weather on their behavior.

9. Prepare the New York bikes data

Click on the three dots next to Sources and Create a New File. Give this the name nyc_citibike.sqlx. Create an empty file and put this in the contents:

config {
type: "declaration",
database: "bigquery-public-data",
schema: "new_york_citibike",
name: "citibike_trips",
description: "New York bicycle rentals"
}

Then, create staging/bicycle_trips.sqlx as a View and put in these contents:

config {
type: "view",
schema: "staging",
description: "Bicycle rental data corresponding to New York City",
}
select
EXTRACT(DATE from TIMESTAMP(starttime, 'UTC')) as date,
DATETIME_DIFF(stoptime, starttime, SECOND) AS duration
from
${ref("citibike_trips")}
LIMIT 10

Note that the right-hand side menu compiles the query, lets you preview results, etc. Once the query is good, remove the LIMIT clause.

This query gives us the date and duration of bicycle trips in New York. Note that I’m converting the time to UTC to match the weather data.

Finally, create reports/bikes_weather.sqlx with the following contents:

config {
type: "view",
schema: "staging",
description: "Bicycle trip count and duration by weather",
}
select
date,
AVG(duration) AS avg_duration,
COUNT(duration) AS num_trips,
ANY_VALUE(rain_mm > 5) AS rainy,
ANY_VALUE(tmin_celsius + tmax_celsius)/2 AS temperature
from
${ref("bicycle_trips")}
JOIN
${ref("nyc_weather")}
USING(date)
GROUP BY date

This is now doing a join. Dataform will ask you to create the dependencies to ensure that the weather data exists (recall that we wanted it to be a table):.

10. Run it

Click on the dependency tree to visualize the graph of how the data got created.

The view itself looks like this:

You can now use this view as the input to your BI tool.

You can also click on the “Create Schedule” to schedule the whole process.

Enjoy! The code so far is in https://github.com/GoogleCloudPlatform/bigquery-oreilly-book/tree/master/blogs/dataform/bikes_weather

p.s. Dataform vs. UDFs:

One question I got on Twitter was when you’d use Dataform, and when you would use User Defined Functions. Both let you reuse code. However, Dataform lets you define projects, and reuse SQL code/table definitions/column documentation/tagging among team members in a project. UDFs let you reuse only SQL code, but let you share the SQL functions globally (not just within a project).

--

--

--

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Recommended from Medium

Role of Data Science in Digital Marketing in 2020

Well, unfortunately the real world is not that simple. If you are dropping a

Menu for preparing data & serving it cold

2nd Generation Crowd Annotation to Shift AI-NLP Game

K-Means Clustering,Real Use Case in the security domain

Experiments with: Altair

Dimension Reduction Tutorial on Kuzushiji MNIST

Exploring a powerful SQL pattern: ARRAY_AGG, STRUCT and UNNEST

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Lak Lakshmanan

Lak Lakshmanan

Operating Executive at a technology investment firm; articles are personal observations and not investment advice.

More from Medium

Multi-Cloud Analytics with BigQuery Omni : No time to load !

How to use Backfill: the Time Machine for Scheduled Queries in BigQuery

BigQuery Materialized Views without Aggregation

Materialized views in BigQuery