Recreate your GA reports in BigQuery using data build tool — part I, magic

Patryk Buczyński
8 min readSep 25, 2019

--

If your work is somehow related to code and technology you probably have experienced those “everything I’ve been doing before is lame” moments. I had a few of them. When I understood how dataLayer works, started digging into Javascript, discovered BigQuery, used some APIs for the first time, discovered R… The not-so-surprising paradox is that the more you know, the more often that moment comes.

Now I am in the middle of another moment like this. It is caused by data build tool (dbt), open-source work of beauty created by Fishtown Analytics. If you ever caught yourself thinking “God, I already wrote a query like this” this might be a thing for you.

Don’t be shy, you’re at least a semi-developer. Or you will be

One of the milestones I mentioned was discovering BigQuery which resulted in gradual moving away from Google Analytics reporting interface. It was so important because it made me slowly realize that I am entering some sort of engineering discipline and I can’t be afraid of code anymore, be it SQL, JS, R, Python or anything else. It was a first step towards an engineer mindset, and it shouldn’t be too risky to say that I was not the only web analyst that made it. And it had a lot of implications.

Even if we focus on ‘web analytics’ data sources only, the amount of SQL that one produces tends to pile up with time. Data model of GA BigQuery export is not too straightforward and even recreating basic metrics might be a challenge. Then, even in the same company analysts might have different idea of how to get session data (totals.visists? count distinct of concatenated ids?), get unique number of events, simulate pageviews reports or follow GA’s Enhanced Ecommerce module logic. Product teams can be very fond of metrics they know from Google Analytics and want them to be reported in the same way in their BI tool (which often is connected to BQ). This leads analysts to depressing work of trying to guess what Google Analytics engineers did x years ago when they were creating the reporting interface. It’s likely that right now somewhere in the world some analyst is trying to figure out how to make his BQ number of users match the number from GA.

It’s natural that high complexity environments create chaos very fast. dbt seems to be motivated by the realization that it’s time for analysts and analytics teams to start organize their chaos in a similar manner that developers do. But before I try to rewrite something that was already written — dbt team explains wonderfully that thinking in their viewpoint. It leaves nothing to add, I highly recommend it.

So, what’s dbt?

It’s a command line tool for creating analytics workflows or, as the creators state it “the T in ELT”. It allows you to create models transforming data using SQL queries combined with the Jinja template framework. The full description would need another post so let me oversimplify the matter: it allows you to end with ad-hoc SQL queries misery, think in more strategic way of your data (even if it’s only GA data in BigQuery), create processes that will make your life and life of other analysts in your company much easier.

You can think of it as a way of embracing software development principles in data analytics world (again, see the dbt viewpoint). You can find more detailed information about the tool itself on the dbt website.

Even with very basic understanding of the tool I can see that it should be tremendously useful for web analysts in few major ways:

a) saving the time on rewriting SQL queries (the DRY principle)
b) standardization of the metrics across the whole company
c) quickly creating data schemes for simplifying daily work and dashboarding

To prove it we will try to solve the mentioned problem of product people needing to recreate Google Analytics logic in BigQuery queries. Let’s face it — it’s boring, it’s bad, it’s mundane. This is why it makes it perfect for DRY-dbt approach. Do it once, forget about it, quickly reuse and modify when needed.

The goal is to create tables containing data from 3 of the most popular GA reports: All Pages, Product Revenue, Source / Medium in their exact GA form (or as exact as possible). If that seems easy and straightforward to you… I am sorry, but I need to assume that you never tried to do it :) Google Analytics reports have a lot of underlying, non-obvious logic underneath. Google provides no concise explanation of it (at least none I know of) so recreating even one metric in BigQuery can be an absolute pain.

In order to weaken this pain I’ve created a data build tool model that recreates all three mentioned tables in exact or almost exact form as they are presented in Google Analytics. If everything goes as it should, you should be able to install dbt, get the model running and create all three tables in 15–30 minutes.

Prequisites:

You need to have a Google Analytics data exported into BigQuery in standard form (Google provided export). If you’re not Analytics 360 user or do not want to use real dataset, you can always work on sample dataset provided by Google. The only drawback from that is that data in that dataset is not the same as in corresponding sample Google Analytics property so the results will not match (although I am not completely sure about this).

What’s going to happen:

  • you will compile and run a dbt model (several chained SQL queries) against your BigQuery data and create three tables in your BigQuery with exact or almost exact data as in Google Analytics
  • you will query your BigQuery data so be aware that you might get billed for that according to your BQ pricing

If that doesn’t scare you, let’s start.

1) Install dbt (and Python and Git if you’re not using them)

Data build tool is installed via pip so you need to have Python installed. Also, Git is required. The installation process is very straightforward and described well here.

2) Configure the profiles file

After installation you’ll find the profiles.yml file in your .dbt folder. It holds the information about the warehouses you are connecting to, credentials and paths where you want to save your data.

For the purpose of this model we’ll keep it very simple. This is how your file should look:

You need to specify the project and dataset. Important: this will be the project and the dataset that you’ll save the data to. The project and dataset that the queries will run against will be specified on the model settings level.

For the sake of simplicity we’re using OAuth. If you want to use another authentication method you can do it following the steps here.

3) Clone/download the model repository

You’ll find it here: dbt-ga-bq-replicate.

4) Change the variables in dbt_project.yml file

In the cloned repository you will find an .yml file called dbt_project. It contains all of the project settings. In our case that amounts to only three variables:

- rangeStart

- rangeEnd

- tableName

First two describe the range of tables queried. To limit the amount of the data being queried it’s set to one day (01.07.2019).

The tableName should be populated to the name of your BigQuery table containing GA data. Name needs to be in StandardSQL convetions (dot instead of semicolon, `` instead of []). Also, leave the table name in table_name_* format. All of the model logic is build on WHERE _table_suffix statement and including the date in the table name would cause an error.

5) Compile the model.

This is where we’ll do something with the actual dbt. At first you need to compile the model running the:

dbt compile

command in your shell in the project directory.

This will make the magic happen. Jinja variables, references and macros will be translated to real SQL query, and saved in newly created ‘target’ folder. The tool will create a virtual DAG (Directed Acyclic Graph) based on the references in the model, so all the queries will be run in the correct order. But not at this point. Obviously compiling the model doesn’t run any actual SQL or query the data. But if you go to the newly created ‘target’ folder in your project you’ll be able to see the plain SQL that will be executed if you decide to run the model. Ready?

6) Run the model

Few important points here:

1) Please make sure that you don’t have any tables named ‘productPerformance’, ‘allPages’ or ‘sourceMedium’ in your target dataset (specified in the profile.yml file). If you do and run the model, the tables will be replaced! To avoid this you can either change the target dataset in your profile.yml file, rename the sql files to avoid using the same names as the tables you currently have or use aliases.

2) Running the model starts normal SQL queries in BigQuery so if you decide to do it, you will be billed according to your Google Cloud pricing. By default, model is setup to query only one day of data (20190701). You can check the amount of the data processed by simply copy-pasting the compiled queries from your ‘target’ folder to BigQuery interface.

So, whenever you’re ready you can just use:

dbt run

Which will just run the model, prompting the authentication in your browser. Provided that everything went well, it will create three tables: productPerformance, allPages, sourceMedium in the BigQuery project and dataset that you specified in your profiles.yml file. The tables should display the data in exact or in some cases almost exact form as it is in corresponding Google Analytics reports. Some metrics are deliberately omitted due to complexity, metrics created using cardinality alghorhitm might be off by a very small margins but generally it should be the same data.

That’s it?

That’s it. After going through some fuss of dbt installation and configuring the connection profile, implementing and running the model should be a matter of minutes. It’s extremely reusable — even the ability to change date ranges and source table in one place might save insane amounts of time spend on updating _table_suffix in the long run. You can create pretty little variables that hold your complicated table names. I’m pretty sure that there is a way to add some easy segmentation logic to that that would allow you to quickly create tables with data only for some chunk of users. And don’t forget that my understanding of dbt is very basic and the model is using only very small chunk of dbt features. How about automated testing? Scheduling models? Separate environments? More sophisticated macros?

With the landscape changing so fast much of web analytics related work can be a matter of solving the problem of different data schemes and joining historical data. Web and App (Firebase) vs old Google Analytics vs Snowplow vs advertising data vs Mixpanel? Whatever the combination, you can still expect a lot of fuss for which you’ll need a proper toolkit.

dbt offers a lot. But leaving aside the technical features, I believe that the most important thing about it is the mindset shift. It encourages building your own model library and moving away from ad-hoc SQL overproducing. Combined with GitHub it enables sharing, pulling queries out of analyst silos. Maybe somebody will add another tables to this model that will result in all GA reports being replicated, however useless would that be?

If you’re up to this task and want to get to know more about how dbt really works, stay tuned. In the second part of this dbt duology I will try to explain the model itself and how it was created.

--

--