CodeX
Published in

CodeX

A Modern Data Warehousing Tool: dbt & Introduction to Analytics Engineering

Creating own experimental lab by Docker to simulate ELT. What Analytics Engineering is. Introduction to dbt. What is the difference between ETL and ELT?

Introduction

I have been meeting with new tools for 3 weeks. After starting new work with a new job scope, I needed to learn new technologies. I had already known some tools but in this time I became more familiar with them. For example, I had known Docker before starting the job, but I hadn’t used it much before that. Then, I got used to Docker and started to use it for every stuff 😋

In this story, we will create our own experimental lab based on Docker to simulate ELT processes. I will use the same data which is used in the official training program of dbt. But there will be a difference, I will use PostgreSQL instead of Redshift, Snowflake, etc. Because I don’t want to spend long time explaining what they are and how we can create accounts etc. Actually, our lab will be a bit simple. We will read data from the source which is AWS S3, then load the data into our PostgreSQL data warehouse which is we created in our lab by using Docker. So, we will be able to use DBT in our local data warehouse. Also, I started to prepare a repo to gather my dbt notes. You can access it by following the link below. I am sure I will enhance the repo but I do not know whether I will write a paper for each module I learned about dbt or not.

Okay, firstly let’s talk about the core terms. I will write my own viewpoints that I understood by reading. So please don’t judge me if you do not agree with me and keep in mind; I am not claiming that I know the best or my opinion is undiscussable etc. 😇

Image from DBT

What is ELT and what is the difference between ELT and ETL?

Firstly, let’s start by explaining what ETL is. ETL is a procedure to Extract, Transform and Load the data. It starts with extracting data from the source, continues with loading data to another source for transforming, then finishes with transforming the data loaded. ELT is just slightly different from ETL. ELT means Extract, Load, Transform. We first load the data from the source databases to our data warehouse then transform it for end usage. I understood that ETL is a traditional procedure, ELT is a bit more modern than ETL and more applied nowadays.

ELT provides the to reduce our staging steps rather than ETL by transforming the data loaded in the data warehouse.

What is Analytics Engineering?

I think I can describe Analytics Engineering as a bridge between data engineers and data analysts. Basically, they model datasets in the data warehouses to provide clean datasets for end-users like data analysts. In small teams, analytics engineers’ responsibilities field lines are shared between data analysts and data engineers. Nowadays modern data teams hire analytics engineers.

Image from dbt

If you want to get deep information about Analytics Engineering, you should check the link below.

A Modern Data Warehousing Tool: dbt

dbt gives us chance to apply software engineering best practices to our data modelling processes. We can do version controlling to our data transformation processes.

We can transform our data in the data warehouses by just select statements.

dbt does the T in ELT (Extract, Load, Transform) processes – it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.

- Official dbt Document

Let’s start trying dbt!

Creating Experimental Docker Lab

Actually, I will not be explaining this section step by step. You can access the lab source by using the link below. I prefer to explain the main logic of the lab and its main components for this project.

We have core 4 files for this lab: loader_script.py, requirements.txt, Dockerfile and docker-compose.yml.

load_script.py file is doing the EL, extract and load processes for our lab. Basically, it reads the data from CSV files, then loads the data into the data warehouse which we already created in docker-compose.yml file.

In docker-compose.yml file, we prepare the lab. Docker firstly, create a PostgreSQL database which is named WarehouseDB. Then, it starts loader service to load the data into the WarehouseDB.

The loader service builds from the Dockerfile. Firstly, it installs the requirements then starts the load process into WarehouseDB. Finally, it creates 3 schemas; transformed, jaffle_shop and stripe. Then, loads the data into jaffle_shop and stripe.

We can simulate our ELT procedure by using the lab we created by Docker. To do that, I am locating my terminal in the lab folder.

Image by Author

I can execute the code below to run the lab.

docker-compose up -d
Image by Author

Now, if I connect to the WarehouseDB database, I can see the schemas that our loader service created.

Image by Author

Now, let’s start meeting with dbt.

First Steps in dbt

There are several ways to install dbt. I have chosen to install it by using pip. Then I can create an empty dbt project by using dbt CLI.

dbt init
Image by Author

We see the default files created by dbt init command in the picture below.

Image by Author

Core dbt components

dbt_project.yml file holds the core settings of our dbt project.

Image by Author

models folder holds our models. In this folder, we code our SQL models with SELECT statements. Then dbt looks under this folder and transform the data into the data warehouse. In this tutorial, we will be using just models .

We have another very important file that holds our database credentials for dbt. The file is located in ~/.dbt/profiles.yml. In this file, we have to set our db credentials and we can easily change between the production and development environments.

Image by Author

Now, I am going to change this file for our WarehouseDB.

Image by Author

dbt automatically loads the transformed data into the schema we set in this file.

Creating our first dbt model

I am going to create a basic sql SQL file under the models folder: first_model.sql. We will create a model to run the pure SQL query below.

select 
o."STATUS" as "ORDERSTATUS", p."PAYMENTMETHOD", p."STATUS" as "PAYMENTSTATUS", count(o."ID") as "ORDERCOUNT"
from jaffle_shop.orders o
left join stripe.payments p on o."ID" = p."ORDERID"
group by o."STATUS", p."PAYMENTMETHOD", p."STATUS"

The query returns the result below.

Image by Author

In dbt models, we use temporary tables that we create by using WITH clause. We see the query for our first_model.sql model below.

with orders as (select * from jaffle_shop."orders"),payments as (select * from stripe."payments"),final as (selecto."STATUS" as "ORDERSTATUS", p."PAYMENTMETHOD", p."STATUS" as "PAYMENTSTATUS", count(o."ID") as "ORDERCOUNT"from orders oleft join payments p on o."ID" = p."ORDERID"group by o."STATUS",  p."PAYMENTMETHOD", p."STATUS")select * from final

Then we run dbt run to run all transformation models.

Image by Author

dbt transforms all models into views default. We can see our transformation model under first_model view in our WarehouseDB database.

Image by Author

Changing dbt models’ configurations

If we want to load our model into a table, we need to use config at the top of the model files.

{{
config(
materialized = "table"
)
}}
...

Now I look at the database after executing the dbt run .

Image by Author

You can check here to learn other configurations on dbt models.

Referencing dbt models

We can use dbt models in different dbt models by using {{ref('model')}}. I am going to improve my model.

Image by Author

In orders.sql, I extract some columns to create my final model from orders table.

selecto."ID", o."STATUS" as "ORDERSTATUS"from jaffle_shop."orders" o

In payments.sql, I extract needed columns to create my final model from payments table.

selectp."PAYMENTMETHOD", p."STATUS" as "PAYMENTSTATUS", p."ORDERID"from stripe."payments" p

Then, I create the final_model.sql. In this file, I use the models that I created above.

with orders as (select * from {{ref('orders')}}),payments as (select * from {{ref('payments')}}),
final as (selecto."ORDERSTATUS", p."PAYMENTMETHOD", p."PAYMENTSTATUS", count(o."ID") as "ORDERCOUNT"from orders oleft join payments p on o."ID" = p."ORDERID"group by o."ORDERSTATUS", p."PAYMENTMETHOD", p."PAYMENTSTATUS")select * from final

We see the data transformed by using the models we created and loaded into the warehouse.

Image by Author

Finally

Hopefully, you enjoyed it. I enjoyed writing a lot. Learning new things is so enjoyable for me. Especially, if I can use them to make money 🙃 I am so new with dbt. I will keep learning about dbt. I will also extend the repo that I created to follow official dbt training programs. You can access the repo by using the link below.

Regards

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Baysan

Baysan

321 Followers

Lifelong learner & Freelancer. I use technology that helps me. I’m currently working as a Business Intelligence & Backend Developer. mebaysan.com