Data Warehouse Automation Using dbt on (Azure) SQL Server
5 things you should be aware of before adapting dbt on (Azure) SQL Server for automating data warehousing/modeling
Introduction
data build tool (dbt) is one of the rising stars in the world of data and analytics and is therefore part of every modern data stack these days. In a nutshell, dbt simply takes code, compiles it into SQL and runs it against a data warehouse. This simple concept allows to easily create and manage data pipelines to automatically test, build and (incrementally) populate a target analytical model including documentation and data linage out-of-the-box — ideal for automating data warehousing workloads. The company behind dbt, dbt Labs (previously Fishtown Analytics), provides also an IDE named dbt cloud in addition to their open-source Python library dbt-core . Their cloud service gives developers further support in writing version-controlled code in the browser and to orchestrate transformation workloads apart from some other additional features. As a part of their great documentation, you can also find a brief and concise overview of the features and what you can actually achieve with dbt. This article is specifically about the must-knows by adopting dbt on (Azure) SQL Server including the traditional way of building data warehouses which may require you to rethink some aspects by transitioning. This two part series guides you through the setup of dbt on Windows to target (Azure) SQL Server and Azure Synapse Analytics.
1. No dbt-core (currently)
To start with, all features and functionalities necessary for running dbt on (Azure) SQL Server or Azure Synapse Analytics are at the time of this writing totally driven by the community. As such, they are currently neither part of the official dbt-core library nor supported by the dbt cloud service. This also means, that it requires you to set up your own environment for running dbt. However, the good news is that there is remedy for both scenarios. Firstly, this article shows you how to deploy and run dbt on Azure Container Instances. Secondly, there also already exists a repository specifically made for all dbt projects for the Microsoft (Azure) data ecosystem, even-though this is also not directly affiliated with Microsoft, but totally community-driven. Its idea is to make all features and functionalities provided by dbt-core including some extensions work on SQL Server (2016 and later), Azure SQL and Azure Synapse Analytics. It further includes example pipelines for deploying dbt via Azure DevOps pipelines.
2. No data extraction
Further, dbt might sometimes get mistakenly mentioned as an ETL tool, but it is NOT. It’s not made to transfer or load data between systems, but to transform data that is already loaded into a data warehouse by using a mix of SQL and YAML syntax. This “transform after load” concept, also known as ELT, has meanwhile become commonplace. Firstly, due to the scalable processing power and near-endless storage capabilities provided by cloud-based data warehouses and secondly, it doesn’t require to develop and maintain complex data integration processes to get to new information. Therefore, it is recommended to use dbt together with some off-the-shelf data integration tool such as Azure Data Factory to apply the transformation on top of raw data. Modern ELT tools are also a great time-saver for pulling data down from SaaS applications as they come with lot’s of built-in connectors.
3. No data orchestration
The same goes for the orchestration of data workloads. While the dbt cloud enables analysts to orchestrate the entire analytics engineering workflow in their browsers, dbt-core just provides a command line interface (CLI) to execute dbt commands necessary for running the workloads an a regular basis. For this reason, it’s common to see organizations using orchestration tools in addition to dbt such as Airflow, Luigi, Dagster or dbt can also be integrated with Azure Data Factory. However, for instance, by using Airflow you can recreate the entire dbt DAG (Directed Acyclic Graph) based on the manifest.json file generated by dbt itself containing the relevant dependencies within a model. This enables you to “have fine-grained control of the success, failure, and retry of each dbt model as a corresponding Airflow task”¹. Maybe you wonder at this point why to use dbt in the first place instead of triggering SQL or Python scripts directly via the orchestration tool? It is the simpleness, reusability and integrity which makes dbt so attractive.
4. No auto-increment key
In traditional data warehousing, there is further the tendency to use incremental numbers as surrogate keys. This has been common practice to speed up joins between Dimensions and Facts and also decouple keys from the business logic. However, dbt does NOT support auto-increment numbers, because they are not idempotent, instead they promote to use hashed business keys. This is something that can be done independently in Facts and Dimensions, without the need to inject those keys from the Dimension into the Fact in a later stage allowing you to run most Facts and Dimensions safely in parallel. On the other hand, using auto-incrementing numbers as surrogate keys requires you to run Dimensions first and then populate Facts with the newly auto-generated keys. This not only lengthens the time required to refresh the analytical model, but it makes it also very difficult to maintain separate development and production versions of your transformations since rerunning all your code could lead to different auto-incrementing keys.
5. No graphical user interface
dbt is strong on the engineering aspect, it says “code is the best abstraction to express complex analytic logic”² . This is also what makes dbt so flexible, it allows to easily work with variable definition across files, build modular code including extensions and unit tests, commit all this code to Git and hence easily integrate it in CI/CD pipelines. As an engineer, all this gives you a lot more confidence that you can actually run something in production and to build healthy release cycles around it. On the other hand, fancy drag-and-drop products with a shiny UI might seem impressive at first sight, but they can make life difficult when it comes to maintain them in production including proper CI/CD pipelines. However, compared to writing SQL code within SQL Server Management Studio or Azure Data Studio, dbt lacks when it comes to schema fetch and propagate, as well as some interactive syntax check and SQL preview features. This might result in kind of switching between one of those tools supporting such functionalities and the dbt CLI together with a code editor such as VSCode.
Conclusion
dbt is doubtlessly an extremely powerful and flexible tool that quickly allows to build pretested data pipelines filling analytical data models ready for reporting. This can be especially interesting in case of developing standardized models and to make them easily deployable to multiple environments. Following the dbt’s DRY (Don’t repeat yourself) approach, by putting business logic and reusable parts in macros, actually encourages you to follow this principles leading to clean and understandable code. However, the flexibility of dbt also requires you to establish a certain set of shared practices which everybody follows, otherwise you could end up debugging and fixing failing model deployments. That means, if you think about adopting dbt, it is important that you invest enough time to evaluate whether it is the right tool and that everything is technically feasible.
Resources
[1]: astronomer. 2021. Building a Scalable Analytics Architecture with Airflow and dbt: Part 1. [ONLINE] Available at: https://www.astronomer.io/blog/airflow-dbt-1. [Accessed 5 August 2021].
[2]: blog.getdbt. 2021. What, exactly, is dbt?. [ONLINE] Available at: https://blog.getdbt.com/what-exactly-is-dbt/. [Accessed 5 August 2021].