Introduction to Modern Data Analytics Tools

Docker, Airbyte, DBT, Apache Superset with Brazilian Ecommerce Data & Applying RFM in DBT

Published in

CodeX

6 min readApr 11, 2022

In this story, we will talk about the modern data architectures and open-source new generation tools used in modern data architectures. I have created a repo to use it by following this text. I will be sharing some examples from the repo during this text. The dataset that we will use for this text is Brazilian Ecommerce from Kaggle.

Introduction to Modern Data Architecture

In the modern data stacks, ETL dropped its place for ELT. So, what is the difference between ELT and ETL? In ELT, firstly the data is extracted from the sources and loaded into the target warehouses. Then, transformation processes are applied on these warehouses. To do that, we can use Airbyte. Airbyte handles EL processes for us. Then, we can transform the data on the warehouses by using dbt. dbt does the T processes. Then, we can distribute the transformed data to business intelligence software like Superset, Metabase, Redash, etc. Also, dbt brought a new role into the market with itself: Analytics Engineer.

If you want to get more detail about the role and dbt, you can check the text I published on Medium.

A Modern Data Warehousing Tool: dbt & Introduction to Analytics Engineering

Creating own experimental lab by Docker to simulate ELT. What Analytics Engineering is. Introduction to dbt. What is…

medium.com

Introduction to Tools

Airbyte

Airbyte is an open-source data integration tool that provides over 140 data source connectors. We can easily manage and create the data integration connections by using Airbyte.

It is easy to set up Airbyte; we just need to clone it from GitHub and then run it with Docker compose.

docker-compose up -d

For our scenario, we will download the data from Kaggle and load it into our local Postgres database with Airbyte. I firstly loaded the data into a database on my local to make the loading process easier.

Then, I uploaded the data from the local database to my local data warehouse which is named BrazilianEcommerce. I finished the extract & load steps by creating a connection between LocalDatabase database and BrazilianEcommercedatabase.

Image by Author | Connection between the databases

And then I ran the extract & loading process to load the data into the local warehouse. Our data sources would be successfully loaded into the local warehouse.

dbt

dbt gives chance to apply software engineering techniques to our data modelling processes. We can do version controlling to our data transformation processes. I have published a text that is about dbt on my Medium. You can access it by using the link below. In this story, I will tell a bit about how we use dbt rather than telling what dbt is.

A Modern Data Warehousing Tool: dbt & Introduction to Analytics Engineering

Creating own experimental lab by Docker to simulate ELT. What Analytics Engineering is. Introduction to dbt. What is…

medium.com

In dbt, we use just SELECT statements to create models. Also, we can create our own macros and use the Jinja template in our SQL codes! dbt has its own package hub which is hub.getdbt.com. We can download others’ macros, etc.

dbt - Package hub

Use one of our community packages to refine the raw data in your warehouse.

hub.getdbt.com

For our scenario, I created a dataset that RFM applied to it.

with fct as (selectsr."customer_unique_id" , sr."segment",sg."geolocation_lat", sg."geolocation_lng", sg."geolocation_city", sg."geolocation_state",so."order_id", date(so."order_purchase_timestamp") order_date,sp."payment_value", sp."payment_installments", sp."payment_type"from {{ ref('stg_rfm') }} srleft join {{ ref('stg_customers') }} sc on sr."customer_unique_id" = sc."customer_unique_id"left join {{ ref('stg_geolocation') }} sg on sc."customer_zip_code_prefix" = sg."geolocation_zip_code_prefix"left join {{ ref('stg_customers') }} sc2 on sr."customer_unique_id" = sc2."customer_unique_id"left join {{ ref('stg_orders') }} so on sc2."customer_id" = so."customer_id"left join {{ ref('stg_payments') }} sp on so."order_id" = sp."order_id")select * from fct

dbt provides powerful code documentation inbuilt. Also, it visualizes the SQL flow by using DAGs. To see the documentation, we need to execute just 2 commands in our dbt project.

# To create documentationdbt docs generate# To see documentationdbt docs serve --port 8010

Also, dbt makes it possible to create custom documentation for our models and sources.

{% docs segment_info %}One of the following values:| segment             | definition || ------------------- | ---------- || Loyal_Customers     |            || Need_Attention      |            || Hibernating         |            || Potential_Loyalists |            || Champions           |            || At_Risk             |            || Cant_Loose          |            || Promising           |            || New_Customers       |            || About_to_Sleep      |            |{% enddocs %}

We created a documentation markdown which is named stg_ecommerce.md file. As we see below, we can set custom documentation for models by using the Jinja template in our configuration files.

models:
  - name: stg_rfm
      description: '{{ doc("segment_info") }}'

In dbt, there are 2 types of tests: generic and singular. Generic tests are testing the columns based on 4 tests: unique, not_null, accepted_values and relationships. In addition to generic tests, we can create our own tests like the ones below.

with fct_orders as (select * from {{ ref('fct_orders') }})select"order_id",sum("payment_value")from fct_ordersgroup by "order_id"having sum("payment_value") < 0

Then we execute our tests by using dbt test command.

The thing is I fell in love with dbt, everything is well documented on their official website. Also, they provide well-organized training courses on their official website.

Learn Analytics Engineering with dbt

Get started implementing dbt with courses from the dbt Labs team.

courses.getdbt.com

Superset

Superset is an open-source business intelligence tool that is supporting by Apache. Also, I published a text about Superset on my Medium. You can access it by using the link below.

Powerful BI Tool: Apache Superset

Clear Explanation of How to Build & Run Superset. How We Can Store Our Metadata On Custom Database. How to Apply Local…

medium.com

Superset uses SQLAlchemy to create connections. So, you can create connections between Superset and anywhere you can connect with SQLAlchemy.

Creating a connection on Superset | Image by Author

Superset has a powerful SQL editor built-in.

We can create datasets by using the tables or views that are from the databases Superset connected.

Then we can create charts by using these datasets based on SQL. In Superset, there are more than 30 types of charts.

Creating a chart in Superset | Image by Author

We create dashboards by gathering the charts we created in Superset.

Finally

I know the text is a bit short. However, I don’t want to spend much time explaining the introductory level coding. Also, in this text, non of the points mentioned above need deeper knowledge to apply. They are just basic examples. Hopefully, you learned something new. I have been dealing with these techs for almost 2 months. I really enjoy using these technologies.

Also, I have created a repo for this text. You can access the repo by using the link below. The repo contains all examples of the tools we mentioned in this story.

GitHub - mebaysan/Modern-Data-Architecture

Contribute to mebaysan/Modern-Data-Architecture development by creating an account on GitHub.

github.com

Regards.

Introduction to Modern Data Analytics Tools

Docker, Airbyte, DBT, Apache Superset with Brazilian Ecommerce Data & Applying RFM in DBT

Introduction to Modern Data Architecture

A Modern Data Warehousing Tool: dbt & Introduction to Analytics Engineering

Creating own experimental lab by Docker to simulate ELT. What Analytics Engineering is. Introduction to dbt. What is…

Introduction to Tools

Airbyte

dbt

A Modern Data Warehousing Tool: dbt & Introduction to Analytics Engineering

Creating own experimental lab by Docker to simulate ELT. What Analytics Engineering is. Introduction to dbt. What is…

dbt - Package hub

Use one of our community packages to refine the raw data in your warehouse.

Learn Analytics Engineering with dbt

Get started implementing dbt with courses from the dbt Labs team.

Superset

Powerful BI Tool: Apache Superset

Clear Explanation of How to Build & Run Superset. How We Can Store Our Metadata On Custom Database. How to Apply Local…

Finally

GitHub - mebaysan/Modern-Data-Architecture

Contribute to mebaysan/Modern-Data-Architecture development by creating an account on GitHub.

Published in CodeX

Written by Baysan

Responses (3)