Recreate your Google Analytics reports in BigQuery using dbt

Part II — Deep dive

Patryk Buczyński

Published in

Analytics Vidhya

8 min readOct 25, 2019

Data modeling is a huge topic.

In order to emphasize it properly using my limited English, I should probably write HUGE instead.

If you want to get a grasp of how huge it is, you can just look at the Directed Acyclic Diagram in this post by Olivier Dupuis. Or some of DAGs posted in the #measure Slack channel. It’s a tedious job of transforming a lot of raw, unclear, messy data, often requiring multiple interconnected steps. But every analyst with the access to the SQL database usually does it everyday, on a smaller scale. Using one SQL query that creates a view, another one that creates a table out of this view, another one that joins it with some other data…

In the previous post of this mini-series I’ve been trying to showcase the idea of data build tool (dbt), designed to make data transformations easier. I presented to you a working dbt model which should recreate some of Google Analytics models in BigQuery, using GA export data. I also tried to provide you some instructions on how to run it.

“Mini” should be a keyword of this article. The whole idea of recreating some Google Analytics reports in BigQuery is an example of ‘mini’ transformation, something that a lot of web analysts might do from time to time and and then forget about it. It might be done using large queries with a lot of CTEs or subqueries or using a chain of scheduled queries.

Not very nice, not too efficient, might break. The goal of this post is to explore the possibilities of using dbt to make these small transformations easier, more reliable, and most importantly — reusable.

Part I was barely scratching the idea, this time we’ll try to deliver some meat. We’re gonna dig into some features that make dbt so interesting for our little web analytics world. Ready?

The basics

Let’s start with some recap.

What are we doing? Recreating 3 Google Analytics reports (Source / Medium, Product Performance, All Pages) in BigQuery using Google Analytics export data.

Here you’ll find the Part I, showing how to install dbt and run the model.

Here you’ll find the model repository. It contains two folders and dbt_project.yml file. dbt_project.yml stores all the model related settings like folder paths and variables. There is no surprise when it comes to folders: “macros” folder contains all the macros used in the model.

The “model” folder is divided into two subfolders: Staging, containing the only subquery that will not materialize into a table and Production, containing the final queries that will become tables in BigQuery.

If you analyze each query independently you’ll find out that they are just plain SQL with some weird additional formulas. And that’s correct. With a lot of oversimplification dbt can be defined as a collection of SQL queries empowered by Jinja template framework. But the power of dbt lies in relations: models with other models, models and macros, macros and variables and so on.

dbt and the power of relations

Now we’re getting to the core of what makes dbt the next hot thing. It might be a very blank statement, but it seems to be true that in simpler cases most of the data build tool power comes from the ability to make SQL dynamic, populate values in one model based on the other values like variables or macros. What does that mean? How does data build tool do that?

determining the order of execution

dbt does this based on the relations and references between the queries. It is super important when you are working in a limited BigQuery environment and are not familiar with other Cloud workflow automation tools (like me). If you ever tried to use the BigQuery scheduled queries to orchestrate more than few queries that are dependent on each other, you know that’s not the most reliable solution. Data build tool gets rid of this problem, building a DAG (Directed acyclic graph) based on your queries. It checks which element of the model depends on other ones and builds the queue on that.

Let’s check the example from the model.

{{config(materialized = ‘table’)}}SELECT *,
sum(revenue/quantity) avgPrice, 
sum(quantity/transactions) avgQuantity 
FROM {{ref(‘stg_revenue’)}}
GROUP BY 1,2,3,4 
ORDER BY 2 DESC

This is a simple query that recreates a Product Performance report, getting all the metrics from stg_revenue table (to be found in Staging folder) and doing some calculations on top of them. The

FROM {{ref(‘stg_revenue’)}}

command tells dbt that it needs to resolve the stg_revenue query first in order to be able to finish this one. Thus the simplest of DAGs is being created:

stg_revenue -> productPerformance

It’s so simple that it doesn’t need those fancy DAG graphics :)

Jinja

In context of our model (and probably GA data in BQ in general) this is the most important feature of dbt. Jinja is a template framework for Python which finds an amazing use when it comes to writing DRY (don’t repeat yourself) SQL. Let’s try to understand it using some examples.

This is a fragment of sourceMedium.sql :

with s as ({{getSessions(‘concat(trafficSource.source, “ / “, trafficSource.medium)’, ‘sm’)}}),

In allPages model you will find something very similar:

JOIN ({{ getSessions(‘h.page.pagePath’, ‘page’) }}) as s

What you can see above is a Jinja macro reused in two different models. It gets the number of sessions but based on different dimensions.

If you compile and run the first statement, you’ll get the number of sessions split by traffic source and medium (brought to the “source / medium” form present in Google Analytics).

If you compile and run the second statement, you’ll get the number of sessions split by pages. It’s like changing the dimensions in Google Analytics custom report. Here’s the code that does that:

{% macro getSessions(field, alias) -%}

	SELECT 
  {{ alias }}
  , sum(CASE WHEN hitNumber = first_hit THEN visits ELSE null END) AS sessions
  FROM (SELECT
        {{ field }} {{ alias }}
        , MIN(h.hitNumber) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_hit
        , h.hitNumber
        , totals.visits
        FROM {{var('tableName')}}, {{ unnest('hits', 'h') }}
	WHERE {{tableRange()}}
        GROUP BY 1, h.hitNumber, fullVisitorId, visitStartTime, totals.visits, visitId
        ORDER BY 2 DESC)
      {{ group_by(1) }}
      ORDER BY 2 DESC

{%- endmacro %}

Jinja macros resemble function logic in programming languages. They are defined with set of parameters (in our case it’s ‘field’ and ‘alias’) and then can be called with those parameters. The SQL query is compiled with the values of parameters. So, basically if you use:

{{ getSessions(‘h.page.pagePath’, ‘page’) }}

the compiled SQL will look like this:

SELECT 
    page
  , sum(CASE WHEN hitNumber = first_hit THEN visits ELSE null END) AS sessions
  FROM (SELECT
        h.page.pagePath page
        , MIN(h.hitNumber) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_hit

The macro tries to abstract the GA session logic in order to make it possible to get the number of sessions with any dimension. The second macro (getBounces) does exactly the same but gets the number of bounces (not bounce rate) metric.

In the utility.sql folder you’ll find some small macros that may make your life easier and some redundant ones (like the unnest one which needs the same amount of code as the statement it is replacing). We’ll take a deeper look at some of them in the next section.

variables

Another thing of beauty. Google Analytics BigQuery export schema makes you repeat small things a lot. Table names are a good example.

If you have a lot Google Analytics properties with data exported to BigQuery, you might be used to juggling between all of them duggly-muggly.819301803.ga_sessions_ It’s not the most time-consuming task in the world, but copying and pasting different table names might be a little bit annoying. Dbt variables solve this problem — you can define the table names in the dbt_project.yml file (remember the tableName variable in “Project Settings” section?) and then refer to them using {{var(‘variable_name’)}}

Just like in the allPages model:

(SELECT 
  h.page.pagePath Page
  , count(*) Pageviews
  , count(distinct concat(fullVisitorId, cast(visitID as string))) uniquePageviews
  , sum(CASE when h.isEntrance is not null then 1 end) Entraces
  , sum(CASE when h.isExit is not null then 1 end) Exits
  FROM {{var('tableName')}}, unnest(hits) as h

This will simply compile to

FROM `table_name_that_you_specified_in_dbt_project.yml file`

You can reuse that setup in every new model you create.

In our model we have two more variables: rangeStart and rangeEnd. They define the time ranges of query and are not used in models directly. Instead, you’ll find it in tableRange() macro, placed in the utility.sql file in macros folder:

{% macro tableRange() -%}_table_suffix between '{{var("rangeStart")}}' and '{{var("rangeEnd")}}'{%- endmacro %}

This macro populates _table_suffix statement (which is heavily used in BQ with GA data) with the variables you specified in dbt_project.yml file. So, in order to use the time ranges in our model, we need only to call the macro:

(SELECT 
          concat(trafficSource.source, " / ", trafficSource.medium) sm ,
          sum(totals.transactions) transactions,
          sum(totals.totalTransactionRevenue / 1000000) revenue
          FROM {{var('tableName')}}
   WHERE {{ tableRange() }}

Which will compile to:

WHERE _table_suffix between 'rangeStart value' and 'rangeEnd value'

That way you do not have to change the values in every query each time when you want to change the date range.

It’s even more handy. When running the model (dbt run) you can change the variables on the run using the

dbt run --vars '{key: value}'

flag. Instead of changing the ranges on multiple queries you have to do it once. That’s as DRY as it can get. You can find more information about that on the dbt website.

What else?

This is only a small fraction of selected features that might be the most useful in small modelling tasks. If you decide to dig deeper, you’ll find stuff like materialization handling, scheduling (obviously), and what seems to be most exciting: automated testing. Those features are not used in our model (or not to great extent) though.

What’s next for dbt in web analytics?

Even when scratching the surface of most basic features of dbt it gives you the feeling that it’s something worth investing your time. I’m not planning to make data modelling my daily business but can see it as a valuable addition to my toolkit. Especially when thinking about challenges that our industry brings like merging the historical Google Analytics data with new Web and App properties/Snowplow or building real warehouses for marketing, with billions and billions of Google Ads tables… :)

Feasible data modeling slowly becomes a requirement, not an option in web analytics. At least in some bigger companies.

Yet let me repeat myself — the biggest advantage that dbt promises is not technical. It’s all about mindset shift. It makes you think about your SQL as an reusable resource, not something that you can forget the moment the data is there.

It might be a good move to start building your SQL-dbt codebase now. Even if that only means recreating some Google Analytics data in BigQuery.