TIQETS.ENGINEERING

Taming the Dependency Hell with dbt

Managing data warehouse views with dbt at Tiqets

rrbarbosa
Tiqets Engineering

--

At Tiqets, we maintain a number of derived views in our data warehouse. Instead of always going back to the raw data to answer a question, we model the data so it represents more “high-level” abstractions that are useful to the business. For example, we derive:

  • active customers from orders
  • ad costs allocation from orders and Google Ads reports
  • conversion funnels from pageviews and a multitude of goals

These views have been very useful in preventing repetitive work, but also in improving our understanding of the business. However, they were also a pain to manage.

We made heavy use of Looker Persistent Derived Tables (PDTs) to model around 80 views. They were created on a schedule, managed in Looker. The main problem is that we had no way of tracking dependencies between these views. For instance, we want to make sure that the latest order data landed on the warehouse before updating the active customers view. This gets even more complicated when PDTs depend on other PDTs.

To make sure everything was up-to-date, we created an intricate schedule that took into consideration the dependencies and how long it took to create each of the PDTs. It mostly worked, until it didn’t. Data ingestion could be delayed, queries could take longer than usual, or we could see failures due to schema changes. That would require someone to manually tame this dependency hell, and refresh the PDTs.

In this blog post, we’ll describe how we’re using dbt to simplify the management of the views and build more trust in the data we store in our data warehouse. Our solution includes Gitlab CI for automated tests and delivery, and Apache Airflow for scheduling to keep them up-to-date.

What’s dbt?

Historically, data warehouses were resource constrained, slow and expensive systems. That led to development of the ETL (extract transform load) pattern: the process of creating new database objects by extracting data from a source database, transforming it on a separate server, and loading the transformed data into the warehouse. Modern data warehouses have reduced the cost of storing data, while increasing the available processing power and memory. That enabled the emergence of the ELT (extract load transform) pattern, where the extracted raw data is first loaded and then transformed inside the data warehouse.

The Data Build Tool, dbt, is becoming the de-facto standard for performing transformations in modern data warehouses (and other SQL engines). With dbt, one describes transformations from the existing data into new views, called models, by writing SQL queries. These queries are templated with Jinja, adding quite some flexibility. A templated query looks like this:

From: https://github.com/fishtown-analytics/jaffle_shop/blob/master/models/staging/stg_payments.sql

The directive ref in line 7 is one of the most powerful features of dbt. By templating the references to other models, it keeps track of dependencies. As a result, it can build the models in the correct order, and even run queries in parallel when possible.

Developing with dbt

For development, we use a typical software development flow with a Continuous Integration (CI) pipeline. The code is kept in a git repository. Changes are made by creating feature branches. Every time a change is pushed, an automated CI pipeline will run linting and tests (more on that later). The code is then reviewed by another Tiqeteer. If everything checks out, the branch is merged, triggering an automated deployment. Finally, after deployment the pipeline generates the documentation. This logic lives in a Gitlab CI pipeline (similar to the folks at the Gitlab data team).

Having automated tests allows us to make changes confidently. This is possible by two features from dbt:

  • Tests: provide a simple way to run assertions in the data (e.g., is this column unique?). There are a few builtin tests, but you can also test anything you can express with a SQL query.
  • Custom schemas: this allows us to create models with the updated code, without changing models in our production environment.

We combine these two features to automatically create the models in a custom schema and running tests against them, for every feature branch.

The pipeline has the following steps:

  1. Build a docker image with the dbt and our project.
  2. Compile the dbt project to catch simple mistakes and check if the queries follow our standards with sqlfluff.
  3. Deploy over 200 models with dbt run.
  4. Test the models with dbt test, we currently have over 1000 tests in master.
  5. Document the project with dbt docs generate, we serve the results with Gitlab Pages.

The interesting bits are in steps 3 and 4. The way they work depends on whether the pipeline is for a merge request, or for the master branch. Let’s start with the simpler case: master pipelines. Basically, we just execute dbt run targeting the production schema. If the execution is successful, we save the results (the target folder) to an S3 bucket that will be used in merge request runs.

For merge requests, we create a schema based on the branch name. For instance, a branch called feature/add_awesome_model would create a schema called dbt_feature_add_awesome_model in our warehouse. Because building every model from scratch takes a long time, we only build the models that were created or modified in the branch. That’s possible with the “slim CI” feature, here’s how it works. The--state flag expects the results from a previous dbt run (which we saved to S3 in our master pipeline). Based on this state, dbt can identify and build only new and modified models.

An interesting problem happens when one of these new or modified models depends on an unmodified model, and, therefore, are not built by the merge request pipeline. Using the --defer flag, you can instruct dbt to use the last production version of this model. That way we can greatly reduce the time required for running the pipeline.

Still, some models are very expensive to build from scratch, they simply have too many rows. In this case, we use a simple Jinja template that checks the current target. If it’s not production, we limit the number rows with a custom where clause. It looks something like this:

Finally, testing after deployment may seem backwards from a software development perspective. However, dbt tests are assertions on the generated models that we can only run the tests after they are built. Often we catch development “bugs” in the feature branches, before merging them. Failures in the master pipeline rarely happen when first merging the feature. However, they are very useful in scheduled runs, indicating something in the source data has changed. We’ll discuss scheduling in the next section.

Scheduling updates and tests

Merging a branch to master updates the models when the code is updated, but we also need a solution to keep the models up-to-date when new data lands on the warehouse. At Tiqets, we use Apache Airflow for scheduling our data pipelines, so we’ve implemented a DAG that triggers our Gitlab CI pipeline after we load data from our production database. It looks like this:

  • wait_dwh_sync is a sensor that blocks until the export from the production database has been completed
  • start_pipeline calls the Gitlab API to start the CI pipeline, updating the models and running the tests
  • wait_pipeline is sensor that checks for the status and generate a Slack alert in case of errors
  • drop_feature_schemas is a supporting task that drops all schemas created to run the CI pipeline for feature branches in Gitlab, discussed in the previous section.

However, we soon realized that we wanted to update models that depended on different data sources. We currently have implemented two other use cases, which we call Ad Cost Allocation and Time Machine (with several more planned). The Ad Cost Allocation pipeline depends on reports from Google Ads, Google Analytics, and Bing Ads. It transforms the data to a common format, and allocates costs to orders, following a number of business rules. The Time Machine pipeline is a model holding daily snapshots of important tables from the production database. It allows us to travel back in time, and retrieve the contents of a table. For instance, the availability for products in the Rijksmuseum on 2020–07–03.

These models have different data sources, and could potentially execute at different frequencies. To support these models, we’ve implemented custom hooks and operators based on the airflow-dbt project. Our main change was to integrate with Gitlab to retrieve our dbt project. Before any execution, we clone a fresh copy of the repository. That way we don’t need to worry about how to ship dbt models to Airflow, we execute whatever is on master. We authenticate using an API token, stored in the Airflow metadata database.

This setup is fairly clean and straightforward, however we do have some pain points. Before describing them, let’s first discuss some other dbt use cases at Tiqets.

Other use cases

Besides running the three data pipelines described above, we have a few other use cases.

Managing User Defined Functions (UDFs): adding UDFs to our dbt repo gives us a way to version control the code. UDFs are also part of our CI pipeline, meaning they get attached to a custom database schema, allowing us to try something new, without touching the production data. Bonus points if you use seeds to create fixtures, and test your UDFs before shipping!

Data freshness sensors: verify if fresh data has landed in a table. Running the ad cost allocation process before all reports landed in the warehouse was a big source of confusion. As some reports are provided by a third party, we don’t have visibility when the task is done. However, we can use freshness to check when data for the current day has landed, and then trigger the allocation process.

Data integrity tests: not all data in our warehouse is under dbt control (yet?). Nonetheless, we can still use dbt to make assertions on them. For instance, we have tests that check if our latest scraper runs have the expected output, or if we identify invalid data in our clickstream data.

Anomaly detection: another use of tests is to check whether we observe large changes in our datasets. Anything that can be expressed in SQL can be a test, so we keep track of changes in our acquisition channel, triggering a Slack message if we detect large changes. Bonus points if you message @stakeholder instead @data-crew.

Snapshotting: the dream of every data engineer is immutable data, making it possible to reconstruct any state by re-running tasks. Unfortunately, in the real world, data often changes in place. Snapshots keep track of changes in any models, creating a new row for every update. This allows us to keep track of updates in our Google Analytics reports.

Pain points and next steps

While we are very happy with the automatic dependency management provided by dbt, we do have a few pain points.

First, debugging failed tests is not easy. The queries for failed tests are stored in the logs, but once you have them it’s still hard to find the offending rows. For instance, the query for testing uniqueness in a column returns the number of duplicate rows. You’ll need to rewrite the query to be able to understand what’s going on. This is a known issue.

A second pain point is not a dbt limitation, but a flawed assumption from our side. We started from the point where the dbt models should be updated after we sync data from our production database to our warehouse. Although it was true that most of our models were based on that data, we have a multitude of data sources: CRM, email SaaS tools, ticketing systems, etc. Splitting these in different DAGs gave us some more flexibility, but now we have to worry about concurrency issues: are these DAGs updating the same models or UDFs? We’ll need to design a more flexible way to organize and update models. That way, our models will always be as fresh as possible.

A third issue is around our CI pipeline runs for feature requests. Some models are very expensive to build, using a lot of our Redshift cluster resources. Therefore, while we can in theory test all models without changing what’s in production, we still can have performance issues, and angry Looker users, when some models are running. In one instance, running a model in CI caused us to run out of disk 🔥

Besides addressing these pain points, we are working towards simplifying the flow by having all “production” runs for dbt in Airflow. This should mainly address pain point 2, by allowing us to use data source updates to trigger dbt runs. Besides, this looks cool.

Stay tuned for a follow up post, going in more detail on how Data Analysts have been using dbt at Tiqets to model our data. Or, if you are very interested, we are hiring:

Senior data engineer

Data engineer

Jobs at Tiqets

Rafael Barbosa is a Data Engineer at Tiqets, when he’s not busy taming dependencies, he (unsuccessfully) tries to learn Russian to speak with his kids.

--

--