How to Schedule Query using Airflow and Google BigQuery

Karl Christian
DataSeries
Published in
3 min readJun 22, 2019

I’m part of amazing data team as Business Intelligence in Tokopedia (full profile in LINK). Our team is growing fast and always have one simple task to automate the manual jobs and scalable without limits to perform at any cases. And we are hiring (I should brand my company also, haha), please check at Tokopedia Career

I want to thank Tokopedia that has kept giving me valuable knowledges, to keep learning constantly, and I feel I have something that I need to share with universes.

DEMOCRATIZE ACCESS TO KNOWLEDGE

It’s our vision in data team. This time, I want to share my experiences using Google BigQuery (BQ) and Airflow as a service to create orchestration workflow to build one of important reports.

Problem Statement

Company need dashboard of Conversion (CVR) of products in Tokopedia to track, mostly it’s about user journey, as example when user first visit in platform, how many user will register, then create order, pay the order, and so on. Usually, it will be known as CVR as end funnel session count divided by start funnel session count.

Those number will be consumed mostly from Google Analytics. That’s why to create this reports I need to liaise with other team as well, data tracking team (the one who know GA so much) and data warehouse team (the one who will ingest data directly from GA to BQ).

Assume I have discussed well with those teams, and the number can be trusted well to be consumed by Tableau to become dashboard.

Then, what I need to do is how to create the dashboards reliable day by day, there’s no missing data, and the dashboards always show the right number when user see it.

Analyzing the Problem

From problem statement above, I want to gather the problems to get the point and construct the right solution.

The problems:

  1. The data is big even greater because it’s event based data from GA, consists of so many rows.
  2. User wants to analyze this as complex as their request, there’s a lot of calculation, so there’s 2 option for this, whether the calculation is located in big query or as calculated field in Tableau.
  3. It needs scheduler tools to automate the ETL process from source tables to result tables that I will consume to dashboard.

Solution

For second problem, I recommend if the calculation can be handled by big query, it should be done in big query to create the best user experience when user see dashboard in Tableau (read: less loading time because less calculation).

For first and third, I will use big query and Airflow as my combination to create reliable data source.

Airflow is workflow management tools to schedule, it’s built based on python code. It’s dynamic as python, extensible and customizable, and scalable to any limits.

preview of Airflow workflow

From above, you can see I set some dependency from one task to another task, and it can be done with just a bunch of code in python script. Then, you can check whether the source is ready or not yet, if it’s not ready, the schedule will delay the ETL and run in next hour. What I can say this tool can help you to fully automate the dashboard until the summary data is created in Big Query.

DNA Tokopedia: Make it Happen, Make it Better

That DNA makes me to be able to do all these, first step is make it happen, so I can make happen the ETL scheduler, almost there’s no human interference. Next step, probably, we can optimize the airflow automation just by a few of python code but with the same quality.

Thank you for taking time to read my writing, please clap if you like this and wait for the next writing. See ya !

--

--

Karl Christian
DataSeries

Data Engineering | Data Warehouse | Analytics Engineering