Data everywhere
Photo by Joshua Sortino on Unsplash

Streamlining Helpshift’s Internal Analytics & Reporting Pipelines with DBT, Airflow, AWS, & Metabase

The next-gen tech stack for Transforming Raw Data to Actionable Insights

Mithil Oswal
7 min readApr 13, 2023

--

Introduction

Reports
Photo by Carlos Muza on Unsplash

As data and reporting pipelines continue to become more complex and the need for timely and accurate data insights grows, it’s increasingly important to have a solid data integration framework.

DBT, Airflow, AWS, and Metabase are three powerful tools that can help companies manage their data infrastructure and build robust, scalable, and reliable data and reporting pipelines.

In this blog post, we’ll take a closer look at how these tools can be used together to create a powerful framework and streamline the end-to-end process of generating custom reports for internal purposes.

But before moving on to that, first let’s look at how this was previously achieved —

Previous Approach

Pythonic — Jenkins approach
Pythonic — Jenkins approach

Steps:-

  • Write a Python script containing queries that pulls data from various databases and writes it to a S3 bucket location.
  • Write a Groovy file that schedules the script as a Jenkins job, scheduled to run daily/weekly.
  • Create tables in Athena based on the result files.
  • Design various reports / dashboards in Metabase.
  • Done!

Cons of this framework:-

  • Lengthy & time-consuming dev work.
  • Requires knowledge of Python.
  • Complex Python script contains code for fetching credentials, connecting to databases using different libraries, running SQL queries, pre-processing data, & pushing to S3.
  • Requires manually creating tables in Athena.
  • Needs comprehensive testing.

Now let us look at a much improved version of this framework.

Let’s break the ice with some introductions now, shall we?

The Tech Stack

  • DBT (Data Build Tool):
DBT
DBT — https://www.getdbt.com/

DBT is an open-source tool that helps you manage your SQL-based data transformations. With DBT, you can build a SQL-based data pipeline that transforms raw data into a format that is ready for analysis. It provides a structured way to define SQL-based transformations, which can be versioned, tested, and deployed as code.

DBT is particularly useful for data modelling and building data pipelines because it allows users to create data models in a modular and reusable way. DBT is compatible with multiple databases.

An entire list of supported data platforms is here — Supported data platforms | dbt Developer Hub.

  • Airflow:
Airflow
Airflow — https://airflow.apache.org/

Airflow is used to create, schedule, and monitor workflows. It allows users to define complex workflows as DAGs (Directed Acyclic Graph), which can be composed of different tasks that run in parallel or sequentially. Airflow is particularly useful for managing data pipelines because it provides a way to schedule and orchestrate data transformations, and it can handle dependencies between tasks.

Airflow supports a wide range of data sources and data processing tools, which makes it a flexible solution for managing pipelines.

  • Metabase:
Metabase
Metabase — https://www.metabase.com/

Metabase is an open-source tool that helps you visualize and explore your data. Metabase is used to create charts, dashboards, and reports from data that help you understand your data and make informed decisions. It provides a user-friendly interface for exploring and visualizing data, and it can be connected to a variety of data sources.

Metabase is particularly useful for data exploration and analysis because it allows users to easily create and share reports and dashboards.

The significantly optimized DBT — Airflow — AWS — Metabase Framework for Data & Reporting Pipelines

DBT — Airflow — AWS — Metabase Approach
DBT — Airflow — AWS — Metabase Approach

Steps:-

  • Write a DBT Model, which is basically the SQL transformation query.
  • Write a DAG Script that schedules the DBT model to run on a daily/weekly cadence.
  • Create your reports in Metabase using the tables automatically created during the DBT run.
  • Voila! Done!

Let’s look at this framework and the process in detail -

The fundamental prerequisite is that you have DBT, Airflow, and Metabase installed and set up on your server.

1. Install dbt-athena or any other DBT adapter for your database.

Reference: GitHub — Tomme/dbt-athena: The athena adapter plugin for DBT

2. Initialize a new DBT project using dbt init project_name and add the database credentials - which will be added to the profiles.yml file.

The profiles.yml file might look somewhat like this:

custom_analytics_athena:
target: dev
outputs:
dev:
aws_profile_name: data_analysis
database: awsdatacatalog
region_name: us-east-1
s3_staging_dir: s3://*****/*****/dbt
schema: hs_mle_reporting
threads: 1
type: athena
work_group: primary

3. Add your SQL transformation query as a .sql file in the models directory

Sample DBT model file that uses an incremental strategy of insert_overwrite

{{ config(materialized='incremental', incremental_strategy='insert_overwrite', partitioned_by=['date'], external_location='s3://*****/*****/dbt/metrics_aggregate') }}

with fact_table as
(select created_at_date,
domain,
id,
time_to_resolve
from hs_gold.issues_fact
where created_at_date >= date('2022-01-01')

{% if is_incremental() %}
and created_at_date >= current_date - interval '3' month
{% endif %} )

select domain,
count(id) as issue_volume,
avg(time_to_resolve)/1000 as time_to_resolve_avg,
created_at_date
from fact_table
group by created_at_date, domain
order by created_at_date, domain

The above piece of SQL basically creates a table, partitioned by the “date” column, in the 1st run of the model — which contains some aggregated metrics from a huge master table; and stores it at the specified S3 bucket location.

Now, for every run after the 1st initial run, the model will execute in an incremental fashion — i.e. it will refresh the data for the last 3 months.

This is achieved using Jinja. You can read more about it here:

Jinja and macros | DBT Developer Hub

4. Add the above model to the schema.yml file.

version: 2

models:
- name: metrics_aggregate
description: "Test model for issues data"

5. Run the DBT model using dbt run --models metrics_aggregate.

6. Now let’s schedule this model to run periodically using a DAG Python Script in Airflow.

So what’s a DAG?

In Airflow, a DAG (Directed Acyclic Graph) is a collection of tasks that define a data pipeline. A DAG is essentially a graph that shows the flow of data between tasks, with each task representing a unit of work that needs to be completed as part of the pipeline.

Each task in a DAG is defined as a Python function, and tasks can be chained together to form a pipeline. When a DAG is executed, Airflow automatically creates and schedules tasks based on the dependencies and relationships defined in the DAG.

Let’s look at an example DAG script below —

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, date, timedelta

default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2023, 03, 01),
'retries': 0
}

with DAG('metrics_aggregate_dag', default_args=default_args, catchup=False, dagrun_timeout=timedelta(minutes=30), schedule_interval='0 7 * * *') as dag:
dbt_run_metrics_aggregate = BashOperator(
task_id='metrics_aggregate_model',
bash_command=f"""cd ~/dbt/data_analysis && dbt run --models metrics_aggregate""",
dag=dag
)

In the above DAG script, we use a BashOperator to execute a bash command — to run the

The above DAG script defines a DAG named metrics_aggregate_dag with a BashOperator task that runs a DBT command to execute the metrics_aggregate model.

The DAG is scheduled to run every day at 7:00 am, and it uses default arguments such as the DAG owner, the start date, and the number of retries.

The dbt_run_metrics_aggregate task is a BashOperator that executes the dbt run shell command using the bash_command parameter to execute only the metrics_aggregate model.

The catchup set to False ensures that the DAG only runs starting from the DAG start date and not for any missed runs. dagrun_timeout sets the timeout duration for the DAG to complete.

Airflow UI with DAGs
Airflow UI with DAGs

7. Once DBT creates the corresponding tables, you should be able to see the data in Metabase — this is assuming you have the database & schema connections already set up.

Metabase Tables
Metabase Tables

8. Now go ahead and create highly powerful and optimized reports and dashboards in Metabase using the transformed data!

Pros of this framework:

  • Faster turnaround time for dev work.
  • Only need to write SQL transformations in the DBT Model.
  • Simple Airflow DAG script to schedule model run.
  • Tables are created automatically in Athena.
  • Significantly easy to modify SQL, deploy and maintain.
  • Everyone can build their own pipelines!

Wrapping it up

In conclusion, the combination of DBT, Airflow, and Metabase creates a powerful and flexible data pipeline for managing and analyzing data. DBT provides a reliable and scalable way to transform and model data, while Airflow enables the orchestration and scheduling of data pipelines. Metabase provides an intuitive and user-friendly interface for exploring and visualizing data.

By integrating these three tools together, data teams can streamline their data pipelines, improve data quality, and gain valuable insights into their business. DBT, Airflow, and Metabase have quickly become go-to tools for modern data teams, and their popularity is only expected to grow as data volumes and complexity continue to increase.

Overall, the DBT + Airflow + Metabase framework provides a powerful and efficient solution for data teams looking to improve their data pipelines and gain valuable insights from their data.

--

--