Enabling engineering best practices into data workflows

Eetu Huhtala
If Technology
Published in
6 min readMay 12, 2022

We use dbt with a few additions to enhance the data and analytics engineering experience here at If.

Image by Archer Mechanical from Pixabay

Like many other companies, we have preferred ELT over ETL in new data pipeline development over the last six years. ETL (Extract-Transform-Load) is a three-step data integration process used to combine and synthesize data from multiple sources into a data warehouse. ETL has been in use for many decades. The modern approach is ELT (Extract-Load-Transform), and a clear difference from ETL is the order of operations. In addition, there are some key benefits of the new approach, such as the ability to use the compute resources of your target platform for data processing and flexibility, especially when moving large amounts of data.

As a part of this change, we first started to express our data transformations with code mainly by using SQL and stored procedures to execute our transformations. Before this transition, we relied on the GUI-based data integration tools for the transformation phase. While the previous approach worked well for some cases, with more complex scenarios, the pipelines were often challenging to maintain due to the limitations of the GUI and its ability to provide visibility and transparency to the underlying logic.

Code, not graphical user interfaces, is the best abstraction to express complex analytic logic. — What, exactly, is dbt?

In our case, stored procedures were a step in the right direction. Still, they also came with some weaknesses, especially when comparing the development process with the best practices and methods used in software development. For example, without any custom solutions or custom-built integrations, no out-of-the-box version control support, testing is complex, deployments are manual, and debugging is clunky, to name a few examples. In addition, making minor changes to an existing process can often be cumbersome.

About two years ago, we started to investigate automating our deployments but soon understood this activity required custom-built in-house solutions if we would continue with stored procedures. There was a lot of buzz around a tool called dbt (data built tool). Exciting features bundled in a single product, a fast-growing ecosystem, and a vibrant community was enough to gain our initial interest. So we started to read some documentation and learn dbt. Finally, about a year ago, we began to test it more extensively with our data and for our use cases when we were more convinced that this would be worth investigating in more detail.

What is dbt?

In short, dbt is a modern transformation tool that enables engineering best practices into data workflows. In the overall data integration flow, dbt only does the T in the ELT process, and it has quickly become a new standard for SQL-based data transformations. In addition, dbt integrates with many different data warehouses, so you can achieve almost the same data engineering experience regardless of which data warehouse platform you’ve chosen.

The following list contains some of the key benefits we’ve identified in using dbt:

  • Automated dependency management within a data pipeline
  • Modular code: single query per file
  • Express all transformations with SQL select only
  • Use macros to write reusable SQL
  • Data quality testing for input and output data
  • Automatically generate documentation, including data flow diagrams
  • Automate deployments, both for test and production, enabling continuous integration (dbt plays nicely together with CI/CD tools)
  • Allow multiple separate test environments on a single test server by using database parameters

Some of the listed features are available through other offerings and products. Still, it is very nice to have all the above available within one open-source product that connects to multiple different data warehouse platforms.

Our setup

The tool is available in two flavors, dbt Core and dbt Cloud. The Core version is free and open-source, and it comes with a command-line interface, while dbt Cloud is a managed cloud version with a browser-based UI. At If we are using the open-source version of dbt. To make the deployment and maintenance of our development environment easier, we’ve decided to containerize the setup and run it inside a Docker container. Our engineers use Visual Studio Code (VS Code) and the Remote-Containers extension to connect to the development container. The main benefit of the described approach is the ability to rebuild the environment from scratch just by creating a new container from a container image. All the needed packages for development are stored inside the container, not on the local machine. If we had installed the components locally, there is a risk that we would start spending our days fixing the dependencies on someone’s laptop. This approach also allows us to update versions in a controlled manner and distribute the changes to all who need them.

For code linting, we are using a tool called sqlfluff. Sqlfluff is used to flag stylistic errors and auto-format SQL code. It is hard to find a person who wants to be nitpicky with code formatting in code reviews. Still, if you have many engineers involved in the same code and no one or a tool takes care of the formatting, the code readability and maintainability will go down. So at least a detailed SQL style guide is required. In our setup with sqlfluff, all code is forced to be formatted in the same way and follows the standard rules, so we can spend the time in code reviews for something more productive other than reviewing someone’s formatting. We run sqlfluff in git pre-commit hooks to make the linting process convenient.

Material Icon Theme
starship

The setup also allows us to customize both the terminal and the overall VS Code experience. So, for example, we use starship for terminal customization and Material Icon Theme, vscode-dbt, and GitLens extensions to VS Code. And these are enabled for all engineers by default, but they are not mandatory for dbt development. They are just making the user experience a bit nicer.

For continuous integration and continuous deployments (CI/CD), we use Azure Pipelines, which enables us to define our deployment pipelines in YAML files. Currently, we do a deployment and various sets of tests when we initiate a code review process (pull request) in Azure Repos. Once the code has been reviewed and approved, we deploy changes automatically into production.

In addition, we are leveraging the possibility of writing reusable SQL. We have created a set of macros for parts of our pipelines that are common in nearly all of them and, for example, to write our dbt run results to a database log table. Overall, writing reusable code is a great addition to SQL but also needs to be carefully used so that the code remains readable and maintainable, which should be a top priority when creating new pipelines.

Conclusions

During the last 12 months, we have tested dbt and its capabilities and built the required environments and processes to support data pipeline development and production runs. In addition, dbt is already in use for production use cases, and the amount of pipelines is growing fast.

We are continuously helping more people develop knowledge about using dbt and the new approach. Thanks to good publicly available documentation and free courses, the learning curve is not steep. When our employees utilize the new setup, we can see the benefits of the new approach. We spend our time solving the actual problems and building complex logic rather than spending time on manual activities or code repetition.

If you are interested to know more about the data at If, have a look at the articles below:

--

--