Introduction to Modern Data Analytics Tools
Docker, Airbyte, DBT, Apache Superset with Brazilian Ecommerce Data & Applying RFM in DBT
In this story, we will talk about the modern data architectures and open-source new generation tools used in modern data architectures. I have created a repo to use it by following this text. I will be sharing some examples from the repo during this text. The dataset that we will use for this text is Brazilian Ecommerce from Kaggle.
Introduction to Modern Data Architecture
In the modern data stacks, ETL dropped its place for ELT. So, what is the difference between ELT and ETL? In ELT, firstly the data is extracted from the sources and loaded into the target warehouses. Then, transformation processes are applied on these warehouses. To do that, we can use Airbyte. Airbyte handles EL processes for us. Then, we can transform the data on the warehouses by using dbt. dbt does the T processes. Then, we can distribute the transformed data to business intelligence software like Superset, Metabase, Redash, etc. Also, dbt brought a new role into the market with itself: Analytics Engineer.
If you want to get more detail about the role and dbt, you can check the text I published on Medium.
Introduction to Tools
Airbyte
Airbyte is an open-source data integration tool that provides over 140 data source connectors. We can easily manage and create the data integration connections by using Airbyte.
It is easy to set up Airbyte; we just need to clone it from GitHub and then run it with Docker compose.
docker-compose up -d
For our scenario, we will download the data from Kaggle and load it into our local Postgres database with Airbyte. I firstly loaded the data into a database on my local to make the loading process easier.
Then, I uploaded the data from the local database to my local data warehouse which is named BrazilianEcommerce. I finished the extract & load steps by creating a connection between LocalDatabase
database and BrazilianEcommerce
database.
And then I ran the extract & loading process to load the data into the local warehouse. Our data sources would be successfully loaded into the local warehouse.
dbt
dbt gives chance to apply software engineering techniques to our data modelling processes. We can do version controlling to our data transformation processes. I have published a text that is about dbt on my Medium. You can access it by using the link below. In this story, I will tell a bit about how we use dbt rather than telling what dbt is.
In dbt, we use just SELECT
statements to create models. Also, we can create our own macros and use the Jinja template in our SQL codes! dbt has its own package hub which is hub.getdbt.com. We can download others’ macros, etc.
For our scenario, I created a dataset that RFM applied to it.
with fct as (selectsr."customer_unique_id" , sr."segment",sg."geolocation_lat", sg."geolocation_lng", sg."geolocation_city", sg."geolocation_state",so."order_id", date(so."order_purchase_timestamp") order_date,sp."payment_value", sp."payment_installments", sp."payment_type"from {{ ref('stg_rfm') }} srleft join {{ ref('stg_customers') }} sc on sr."customer_unique_id" = sc."customer_unique_id"left join {{ ref('stg_geolocation') }} sg on sc."customer_zip_code_prefix" = sg."geolocation_zip_code_prefix"left join {{ ref('stg_customers') }} sc2 on sr."customer_unique_id" = sc2."customer_unique_id"left join {{ ref('stg_orders') }} so on sc2."customer_id" = so."customer_id"left join {{ ref('stg_payments') }} sp on so."order_id" = sp."order_id")select * from fct
dbt provides powerful code documentation inbuilt. Also, it visualizes the SQL flow by using DAGs. To see the documentation, we need to execute just 2 commands in our dbt project.
# To create documentationdbt docs generate# To see documentationdbt docs serve --port 8010
Also, dbt makes it possible to create custom documentation for our models and sources.
{% docs segment_info %}One of the following values:| segment | definition || ------------------- | ---------- || Loyal_Customers | || Need_Attention | || Hibernating | || Potential_Loyalists | || Champions | || At_Risk | || Cant_Loose | || Promising | || New_Customers | || About_to_Sleep | |{% enddocs %}
We created a documentation markdown which is named stg_ecommerce.md
file. As we see below, we can set custom documentation for models by using the Jinja template in our configuration files.
models:
- name: stg_rfm
description: '{{ doc("segment_info") }}'
In dbt, there are 2 types of tests: generic and singular. Generic tests are testing the columns based on 4 tests: unique
, not_null
, accepted_values
and relationships
. In addition to generic tests, we can create our own tests like the ones below.
with fct_orders as (select * from {{ ref('fct_orders') }})select"order_id",sum("payment_value")from fct_ordersgroup by "order_id"having sum("payment_value") < 0
Then we execute our tests by using dbt test
command.
The thing is I fell in love with dbt, everything is well documented on their official website. Also, they provide well-organized training courses on their official website.
Superset
Superset is an open-source business intelligence tool that is supporting by Apache. Also, I published a text about Superset on my Medium. You can access it by using the link below.
Superset uses SQLAlchemy to create connections. So, you can create connections between Superset and anywhere you can connect with SQLAlchemy.
Superset has a powerful SQL editor built-in.
We can create datasets by using the tables or views that are from the databases Superset connected.
Then we can create charts by using these datasets based on SQL. In Superset, there are more than 30 types of charts.
We create dashboards by gathering the charts we created in Superset.
Finally
I know the text is a bit short. However, I don’t want to spend much time explaining the introductory level coding. Also, in this text, non of the points mentioned above need deeper knowledge to apply. They are just basic examples. Hopefully, you learned something new. I have been dealing with these techs for almost 2 months. I really enjoy using these technologies.
Also, I have created a repo for this text. You can access the repo by using the link below. The repo contains all examples of the tools we mentioned in this story.
Regards.