dbt modeling at scale

William Nouet
Checkout.com-techblog
11 min readJul 27, 2022

--

Data platform is a widely used term by many companies, but it can have so many different meanings. Why is that? Let’s start by looking at a definition of what a data platform is. After going through a number of different sources, we believe this definition is the best fit: a data platform is an integrated set of technologies that collectively meet an organization’s end-to-end data needs. It enables the acquisition, storage, preparation, delivery, and governance of your data, as well as a security layer for users and applications. This answers the question of why a data platform can have so many meanings because every company has different data ecosystems.

As a company, we see data as a strategic asset to make data-driven decisions to ultimately benefit our merchants, and this is where the Data Platform team comes in. The Data Platform aims to be a one-stop shop for all the data analytical needs of the Company by providing all the tools and platforms.

The Data Platform team is here to ensure internal stakeholders can easily collect, store, process and utilize data to build reports or products aiming to solve business problems. Our focus is on maximizing the amount of time business stakeholders spend on solving business problems and minimizing time spent on technical details around implementation, deployment, and monitoring of their solutions.

We are the central data team that helps other teams to solve their business problems. Not just to work on ad-hoc requests (solving one problem at a time), though. We structured our platform into four main pillars, which would cover key areas affecting major parts of the company (solving many problems at a time), namely: infrastructure, ingestions, processing, and governance.

In this article, we will focus on our data processing abilities in Snowflake through the use of dbt from development to deployment and monitoring.

1. Data Modeling Decentralisation

We use Snowflake as our data warehouse solution to build data applications and visualizations (through Looker). Snowflake serves 900+ active quarterly users running 20k+ ad hoc queries per day. Most users query either raw data for development purposes to build data models or data that has been previously modeled. The Data Platform Team is responsible for ingesting raw data in Snowflake in a database (LANDING) and performing out-of-the-box (such as system deduplication) transformations in another one (SOURCE) from where developers can build business-driven models.

Enter dbt! dbt is a SQL-based open source data transformation tool made by dbt Labs, that allows developers to implement analytics pipelines within a data warehouse. With dbt, one can easily build models, tests and document data assets based on source transformations developed in SQL. The resulting artifacts usually take the form of a set of tables and views in a data warehouse. One can develop dbt models and tests using the open source CLI from the core project and integrate them within any run-time environment.

Checkout.com currently runs 15+ dbt projects containing 1000+ models, querying 1+ TB of data per day for use cases ranging from internal analytics, operational financial reporting, merchant-facing analytics, fraud detection, etc. We have 100+ dbt developers contributing to these projects on a weekly basis. They are part of a company-wide community focused on analytics engineering, where they are able to share technical knowledge as well as best practices and demos in dedicated forums (e.g. Slack channel, lunch & learn’s). The community thus relies on the Data Platform Team to provide the controls for their development efforts, such as data access and libraries in the form of macros, centralizing common utilities such as feature/development database set-up, data retention and data sampling.

Developers’ common ask was to be able to develop models on top of production data as data contained in lower environments could differ from production, and volumes were insignificant to test models’ performance. The solution to this issue was to develop data models directly in Production on top of Production data! Such setup meant implementing a dedicated Role Based Access Control model to enable users to write data in the Snowflake Production environment, but on dedicated feature/development databases, whilst ensuring they could not run any destructive actions over production data, as well as creating dedicated tooling to test against a subset of existing data.

We interchangeably use the term feature and development databases. In practical terms, this means that typically a dbt developer, in any dbt project, will create a feature in git, which will be complemented by a feature/development database in Snowflake. This database will most likely be destroyed once the feature branch is merged into the main git branch.

2. Development

2.1. Project template

The Data Platform Team created a dbt project template in Github that any team can fork from in order to set up their own project. This template contains a step-by-step README, dedicated packages to pull common dbt macros, as well as common table sources and profiles with environmental variables. The barrier to entry to develop in Production is thus very low, pending the right data access and Snowflake resources have been previously granted.

2.2. Feature database

To enable dbt developers to work out of Production we first created dedicated roles on a per project basis of the form <PROJECT>_DBT_DEVELOPER with underlying access to a dedicated warehouse <PROJECT>_WH (size: medium, scaling policy: economy), and access to the data the project requires. With this role, developers are able to create feature databases where all tables/views will be built via dbt during development.

The dbt profile (sitting in the default location ~/.dbt/profiles.yml) of a dbt developer would be agnostic to the project they would work on, as per below. Here the PROJECT variable name would have been previously defined, along with the DBT_DATABASE_SUFFIX, which, ideally, would be of the same name as the Git feature branch/JIRA story.

With the above setup, developers can then create their feature database through the setup_database macro the Data Platform Team created (cf. Appendix 5.1) and contained in a dedicated Github repository that can be pulled as a dbt package.

Once created, the data workflow for development looks as follows:

Figure 1: Feature database development workflow

It is worth noting that feature databases are automatically dropped through a separate process if no queries have been executed in the database over the last two weeks.

2.3. Retention

In the course of developing new models; dbt developers might wish to only test against a subset of the production data (thus avoid querying tables with potentially billions of records). To speed up development & save on resources, we opted to use optional retention variables that can be invoked by developers. In order to reference only a subset of the existing data contained in a table, retention parameters can be added to the sources defined in a sources.yml file as per below.

In order to apply data source filtering (cf. Appendix 5.2), we add parameters as metadata, namely, the retention_column and the retention_period, which enable filtering out the data older than a defined time period. In the above example, any records which contain a replication_date older than 1 day will be filtered out of the source.

Simply put, it means that we are adding a WHERE clause when referencing the existing source table, e.g.

In the example above, any reference to LANDING.LEDGER.LEDGER_ENTRIES will be filtered to records for which the replication_date is newer than 1 day.

Note that the metadata above is applied at the schema level, meaning that any tables in the schema that contains the defined retention_column will get retention applied. Schema retention can be overwritten at the table level by defining the metadata directly at the table level as per below:

The above means that the data is in the LANDING.LEDGER.LEDGER_ENTRIES source table will be filtered to the last day, but the data in the LANDING.LEDGER.LEDGER_ACCOUNTS table will be filtered to the last 10 years. Had we not defined an exception to the LANDING.LEDGER.LEDGER_ACCOUNTS table, it would have inherited the definitions at schema level (1 day).

2.4. Sampling

An initial idea to speed up development and allow access to production-like tables (at least as far as accurate table schemas) was to use sampling. The basic idea of sampling was to allow a production-like development environment to be spun up quickly & cheaply using native Snowflake functionality (see here) for users without having to clone entire databases or large production-scale tables or allow users to interact with production directly during development (cf. Appendix 5.3).

One problem that was encountered was around reliably sampling tables when joining between datasets. If you think of a typical data warehousing strategy that split up tables between facts & dimensions; how could we account for some development work that needed a complete set of dimensions to inform fact tables? Or how could we test all edge cases in a fact table with only a sample & reliably advance a feature to production?

2.5. Sources

In order to make retention and sampling transparent to end users, we have developed a bespoke macro that overrides the built-in source dbt macro (cf. Appendix 5.4). Our macro applies retention and sampling based on the targets defined in the dbt_project.yml file. Most of the dbt projects, by default, use retention for development and CI:

Developers can thus call the source macro as they would normally do in their code, knowing that they will develop with only a subset of data and that the macro will behave exactly like the built-in source macro when models are running in production, i.e., with a prod target.

3. CI/CD

Once data modeling and development are complete, to ensure only high-quality code is deployed to production, we have developed and templated robust deployment pipelines. They include all the necessary steps to test, validate and ensure that the new models or the changes to an existing model are of the highest quality.

3.1. Release process

When a pull request against the main branch is opened, it triggers a CI step in our pipeline to check and validate if dbt models still run as expected. Developers would instantly see on Github if the CI build passed or failed. At this step, the developer has two options — either to run the CI build against the entire database or add deployment scripts for their specific feature. To reduce costs and time, we usually encourage developers to add scripts for all of their feature deployments Our pipelines use these scripts for automation.

Figure 2: CI/CD release process

Depending on the outcome of the CI build, the feature will either need some changes or be ready for a peer review. Once the changes are successfully built and approved by a peer review, the feature branch gets merged to main and triggers the remaining steps in the pipeline.

An average deployment pipeline contains all the following stages but can be limited solely to Build and Deploy to Prod for simpler projects.

Figure 3: Deployment projects

After a pull request is merged into the main branch, the build step will run for the main branch, and if it is successful, a new version tag will be created in Github. At this point, a new release version is available for release but hasn’t been deployed yet. It can then be deployed to all environments successively (UAT -> SBOX -> PROD) (or, as mentioned before, directly to Production for simpler projects).

3.2. Templating

Given our position as the central data team in the company, we support many different teams and projects and help them set up and successfully run dbt projects. To make onboarding as easy as possible, we have created templates for different deployment pipelines which no longer require any help from our team and reduced onboarding time from a couple of days to a couple of hours.

To ensure that all teams, no matter the size, resources or skill sets, have the best possible onboarding experience in addition to the pipeline templates, we provide infrastructure to run deployments. This leads to a problem with dbt which we have observed a lot in the past — dbt has quite short release cycles with some releases that are not backwards compatible (e.g. dbt v1.0). Teams with fewer resources are not always able to upgrade to the latest dbt version as fast as others.

To not limit some teams from using the newest dbt features or others to forcefully upgrade their version without any necessity than just catching up with others, we are supporting at least 3 different dbt versions by using virtual environments on our infrastructure. We support versions as old as v0.20, which was our dominant stable release for quite some time.

As well as potentially leveraging our infrastructure, as part of their onboarding, every new team or project gets a specific role in Snowflake (e.g. ProjectName_CI) as well; this way, they have their dedicated resources to run models, and we can easily track performance and costs associated with any project, and help them to optimize their dbt models or Snowflake resources if we start observing any issues with performance, increased build times or spikes in cost.

4. Orchestration & monitoring

Now that we’ve got everything up and running; how do we orchestrate and keep an eye on an ever-growing number of workflows? How do we maintain confidence in the state of our production system? How do we write tests for unthought-of scenarios? What about problems that aren’t strictly related to the data in the table but might nonetheless represent a risk to the quality of the data in your pipelines? Well, get ready to settle into a roaring read on orchestration, observability, testing & monitoring. We’ll discuss this more in future articles: watch this space!

Appendix

5.1. Setup database macro

5.2. Retention macro

5.3. Sampling macro

5.4. Source macro

--

--