How we use DBT and BigQuery External Connections to easily and reliably warehouse Cloud SQL data

Published in

Apolitical Engineering

5 min readMay 22, 2023

Introduction

Like many data teams, we have spent our fair share of time figuring out how to retrieve data from the many places it’s generated and pool it all in one central location — a data warehouse— so that our data can be easily activated by the business.

Activation of data may mean

passing it to a business intelligence tool
retrieving it for analysis
using it to train a machine learning model
sending recombined data back out to the third party tools we use

This ‘hub and spoke’ model is in widespread use. There are huge advantages to having up-to-date, reliable data all available in one place, ready to be queried and retrieved with one set of tools. The data warehouse acts as a central exchange for data.

If it weren’t for this, any tool we wanted to use to activate our data would have to be compatible with many different data sources.

In practice

We do a lot of our work in Google Cloud Platform (GCP), which supports a our operations in a variety of ways. Two GCP services are relevant for this article.

First, GCP operates a popular data warehouse service, Google BigQuery. It’s optimised for data analysis, supports SQL queries, and integrates directly with a wide variety of other tools.

Secondly, GCP provides a Cloud SQL service — managed infrastructure that makes it easy to securely and reliably operate relational databases. SQL databases are ubiquitous, acting as the back end for many web applications, including some parts of the Apolitical platform.

If we follow the hub and spoke approach, clearly we must load data from Cloud SQL into BigQuery. That way, data from our platform can be combined with data from other systems and used for business intelligence purposes etc.

How might we do this?

Historically, we took a standard approach:

set up infrastructure to reliably run a series of scheduled tasks
export data from each table in our databases to a file
load data from the latest file into the warehouse, transforming as needed

This was simple but reliable. It gave us a daily backup of our data (stored in a third GCP service, the Google Cloud Storage blob-store) and an up-to-date snapshot of each table. There wasn’t a strong business need for fully-fledged change data capture (CDC), but a history of each table was available if needed, as was reasonably fresh (i.e. less than 1 day old) data for each table.

What changed?

In 2022 we overhauled our data infrastructure, increasing capacity and bringing more order to our data.

As part of this work, we introduced DBT (Data Build Tool) to our stack.

This had a number of advantages. By executing data transformation in the warehouse itself, we were able to decouple data loads from data transformation.

This allowed us to

store near-raw and fully processed data in BigQuery
adopt Airbyte, speeding up pipeline development (since we removed the need for bespoke transformations in our pipelines, an off-the-shelf option for pipelines became feasible)
keep better track of data lineage, since all the transformations now occur in one place (DBT and BigQuery both have built-in table-level lineage tools)

Further,

common data transformations were captured in version control in a reusable form
we were able to easily apply more complex transformations to our data tables and therefore create proper data models for different business uses

In short, DBT enabled a paradigm shift — unlocking many new abilities for our small data team.

In this article, I will describe one particular way in which DBT allowed us to simplify our data pipelines.

Replacing our pipelines with direct connections

One other change we took advantage of was the introduction of direct Cloud SQL to BigQuery connections and federated querying.

By setting up such a connection from BigQuery to our Cloud SQL instances, we can write queries that retrieve data from our SQL databases, via BigQuery. These are known as federated queries.

(as an aside, we in fact configured these connections to target a read replica instance — this is good practice as it means federated queries don’t place any load directly on production databases)

The syntax for such a query might look something like this:

    SELECT
        customerId AS customer_id,
        customerName AS customer_name,
        subscriptionType AS subscription_type,
        email,
        JSON_EXTRACT_SCALAR(location, "$.countryCode") AS country_code,
        createdAt AS created_at,
      FROM
        EXTERNAL_QUERY(
          "<project-id>.<external-connection>",
          """
          SELECT
              -- double quotes interpreted by postgreSQL as
              -- a case-sensitive identifier name:
              "customerName",
              -- some postgreSQL types (e.g. UUID, Enum) are not
              -- supported by BigQuery, and must be coerced:
              CAST("customerId" AS TEXT),
              CAST("subscriptionType" AS TEXT),
              -- other types are handled transparently by BigQuery:
              email,
              location,
              "createdAt"
            FROM
              customers;
          """
        )

Here I have tried to provide a realistic (though fictional) example of how a table called customers in a CloudSQL PostgreSQL database could be read via BigQuery.

It’s especially salient to discuss the structure of this example:

there is an inner query (which the reader will note is written in the PostgreSQL dialect) wrapped inside the EXTERNAL_QUERY() expression
there is an outer query that selects from the output of the EXTERNAL_QUERY() expression

Queries like this can be executed in the same way as a regular query. That means we can store them as DBT models.

What we did

We created DBT base models that use the federated query syntax described above.

By materialising these models as tables, we can create a snapshot of a table in CloudSQL every time DBT is run.

This effectively recreates the existing warehousing workflow, but in a simpler manner. We don’t need to ensure the scheduled data loads complete before DBT runs, because DBT will handle that for us naturally. The only additional infrastructure is the external connection, which GCP maintains for us.

Tradeoffs

There are some drawbacks to our approach.

First, we have elided the distinction between a source table and a base model in DBT.

Federated CloudSQL queries are not currently supported by the dbt-external-tables package, which allows some DBT sources to reference external datasets that the warehouse can directly access. However, this might change in future.

Secondly, on its own, our approach does not allow us to maintain a history of the external table. DBT supports snapshots, which would allow a history of the base model to be captured (but not the source itself).

The future

We’re keeping an eye on GCP DataStream for BigQuery, which became generally available not long before this article was written.

DataStream promises to replicate data directly into BigQuery continuously (rather than in batches) and to provide a built-in, exact history of changes to source data.

As a fully-managed service, this would be similarly straightforward to set up and use as our federated query approach. Furthermore, such a deployment would cleanly separate raw, source tables from base models.