How dbt fits into ETL /ELT

Blosher Brar
6 min readMay 17, 2024

--

Image is from the Zuar website

In 2006 mathematician Clive Humby is coined the phrase data is the new oil.” Indeed, entire tech empires have been built by harnessing the power of data. Companies like Netflix, Uber, and Amazon have used data to out compete their competitors like Blockbuster, taxi companies, and Sears that relied on the collective wisdom of executives instead of data.

Moneyball: The Art of Winning an Unfair Game is a popular book by Michael Lewis that describes the true story in the use of statistical analysis by the Oakland Athletics baseball team to gain a significant competitive advantage in the Major League Baseball (MLB).

Although this article will not fix all your organization’s data problems, I hope you learn something that will help improve your organization's data infrastructure.

Developing a great data infrastructure requires having the right tools and technologies, as well as the right engineers to implement this technology. As the title states, this article will introduce you to a very popular open-source data tool known as dbt (Data Build Tool) and how it fits into the ETL/ELT processes.

dbt can greatly increase the efficiency of data teams by offering a data transformation tool that automates tedious tasks that are necessary but can be very time consuming. An example of this is documentation and testing, which dbt has integrated directly within its solution. Hence, many data teams are looking to augment their data warehouse or data lake using dbt.

Lets begin the discussion by discussing the recent trends in data-drive organizations. In a cloud-first technology world where data storage and compute is increasingly becoming cheaper, many data teams have adapted their data pipeline approach from ETL (Extract -> Transform -> Load) to ELT (Extract Load -> Transform).

Data Extraction is the process of connecting with a source system, whether that is connecting directly to a database or via an API and “extracting” that data. This can happen in a variety of ways, including batch-processing or streaming. Some data teams use a third-party SaaS tool like FiveTran or Zapier or write their own code in AWS Lamda or Azure Functions to interact with the source system.

As mentioned previously, historically, data teams would then transform this data BEFORE it was loaded into their data warehouse. However, most modern data architectures advocate for transformations AFTER loading data into their data warehouse.

Image is made by author

Data Loading is the process of “loading” the data from source systems into a data warehouse or data lake depending on if the data being loaded is structured, semi-structured, or unstructured. Some popular options for data storage include AWS S3 bucket, Azure Data Lake, or maybe Snowflake if you prefer a fully managed solution. Due to the scalability, flexibility, and cost-effectiveness provided by cloud storage, many organizations prefer to load their entire raw data instead of a subset, as was done historically with the ETL process.

Raw data can take many forms, including text files, log files, images, audio recordings, emails, CSV files, CRM (Customer Relationship Management) data, in addition to more traditional structured data from source system databases.

The benefit is that organizations have the option to analyze their data in its entirety in case more information can be extracted from this data in the future. If only historical aggregated data is stored, as opposed to historical raw data, there is a chance future opportunities for analysis are missed. Preserving raw data allows organizations to conduct retrospective analysis and conduct research while ensuring data integrity.

On the flip side, organization must be careful against having too little control in what goes inside their Data Lake, converting Data Lakes into “Data Swamps.” It is vital that organizations follow data management best practices including data governance, data quality, and metadata management. Data teams that rely on data must be able to discover appropriate data sets and have access to reasonably clean data.

Image is from the Collibra website

Data Transformation is the process of “cleaning” data by maintaining data quality, and usability. This includes checking for and removing duplicate records, standardizing formatting, applying business logic, and combining appropriate data sets. More often than not, this is NOT a simple process and involves the creation of transform scripts involving SQL and/or Python. Most organizations will also NEVER have a completely clean data set and its a work in-process that must be iteratively improved.

The approach organizations take to transformations vary widely depending on business requirements, cleanliness of raw data, and team expertise. There are countless third-party SaaS tools that can help transform data, as well as cloud services such as AWS Glue, and Azure Data Factory.

Image is from Fivetran

dbt (Data Build Tool) is one such open source data transformation tool that has become increasingly popular with many organizations. It can greatly increase the efficiency of data teams by automating tedious tasks that are necessary but can be very time consuming, such as data transformation documentation, and testing.

Hence, many data teams are looking to augment their data warehouses or data lakes using dbt. It works by combining modular SQL with software engineering best practices to make data transformation reliable, fast, and easy to understand. With dbt, anyone who knows how to write SQL SELECT statements has the power to produce reliable, actionable data sets for analytics. Although SQL is not as powerful as Python in transforming data, many data teams find they do not have the data engineering skill set to transform data using Python. This tool sits as an orchestration layer on top of a data warehouse to improve and accelerate data transformation.

Moreover, dbt in a “free” open-source project that is cloud agnostic — meaning it works within each of the major cloud vendors. dbt also allows data transformations using git / version control, ensuring more control and better data governance measures. It also contains very powerful features such as generic and custom validation tests. It allows transformation code to be written following software engineering best practices such as molecularity with Macros / Functions that allows code re-usability. Lastly, data transformation automatically generate data lineage graphs as well creating standardized documentation using data descriptions.

Image is from dbt documentation

In my next blog post I will provide a hands on-guide to transforming data in dbt including best practices. Please follow me if you are interested in learning more about the modern data stack and associated tools such as dbt.

Hope you enjoyed reading my article and feel free to reach out to me on LinkedIn to connect and learn more about everything data & technology.

--

--