CodeX
Published in

CodeX

Introduction to Modern Data Analytics Tools

Docker, Airbyte, DBT, Apache Superset with Brazilian Ecommerce Data & Applying RFM in DBT

In this story, we will talk about the modern data architectures and open-source new generation tools used in modern data architectures. I have created a repo to use it by following this text. I will be sharing some examples from the repo during this text. The dataset that we will use for this text is Brazilian Ecommerce from Kaggle.

Image by Author

Introduction to Modern Data Architecture

In the modern data stacks, ETL dropped its place for ELT. So, what is the difference between ELT and ETL? In ELT, firstly the data is extracted from the sources and loaded into the target warehouses. Then, transformation processes are applied on these warehouses. To do that, we can use Airbyte. Airbyte handles EL processes for us. Then, we can transform the data on the warehouses by using dbt. dbt does the T processes. Then, we can distribute the transformed data to business intelligence software like Superset, Metabase, Redash, etc. Also, dbt brought a new role into the market with itself: Analytics Engineer.

If you want to get more detail about the role and dbt, you can check the text I published on Medium.

Introduction to Tools

Airbyte

Airbyte is an open-source data integration tool that provides over 140 data source connectors. We can easily manage and create the data integration connections by using Airbyte.

It is easy to set up Airbyte; we just need to clone it from GitHub and then run it with Docker compose.

docker-compose up -d

For our scenario, we will download the data from Kaggle and load it into our local Postgres database with Airbyte. I firstly loaded the data into a database on my local to make the loading process easier.

Image by Author | LocalDatabase

Then, I uploaded the data from the local database to my local data warehouse which is named BrazilianEcommerce. I finished the extract & load steps by creating a connection between LocalDatabase database and BrazilianEcommercedatabase.

Image by Author | Connection between the databases

And then I ran the extract & loading process to load the data into the local warehouse. Our data sources would be successfully loaded into the local warehouse.

Image by Author

dbt

dbt gives chance to apply software engineering techniques to our data modelling processes. We can do version controlling to our data transformation processes. I have published a text that is about dbt on my Medium. You can access it by using the link below. In this story, I will tell a bit about how we use dbt rather than telling what dbt is.

In dbt, we use just SELECT statements to create models. Also, we can create our own macros and use the Jinja template in our SQL codes! dbt has its own package hub which is hub.getdbt.com. We can download others’ macros, etc.

For our scenario, I created a dataset that RFM applied to it.

with fct as (selectsr."customer_unique_id" , sr."segment",sg."geolocation_lat", sg."geolocation_lng", sg."geolocation_city", sg."geolocation_state",so."order_id", date(so."order_purchase_timestamp") order_date,sp."payment_value", sp."payment_installments", sp."payment_type"from {{ ref('stg_rfm') }} srleft join {{ ref('stg_customers') }} sc on sr."customer_unique_id" = sc."customer_unique_id"left join {{ ref('stg_geolocation') }} sg on sc."customer_zip_code_prefix" = sg."geolocation_zip_code_prefix"left join {{ ref('stg_customers') }} sc2 on sr."customer_unique_id" = sc2."customer_unique_id"left join {{ ref('stg_orders') }} so on sc2."customer_id" = so."customer_id"left join {{ ref('stg_payments') }} sp on so."order_id" = sp."order_id")select * from fct

dbt provides powerful code documentation inbuilt. Also, it visualizes the SQL flow by using DAGs. To see the documentation, we need to execute just 2 commands in our dbt project.

# To create documentationdbt docs generate# To see documentationdbt docs serve --port 8010
dbt Documentation | Image by Author
DAG data flow | Image by Author

Also, dbt makes it possible to create custom documentation for our models and sources.

{% docs segment_info %}One of the following values:| segment             | definition || ------------------- | ---------- || Loyal_Customers     |            || Need_Attention      |            || Hibernating         |            || Potential_Loyalists |            || Champions           |            || At_Risk             |            || Cant_Loose          |            || Promising           |            || New_Customers       |            || About_to_Sleep      |            |{% enddocs %}

We created a documentation markdown which is named stg_ecommerce.md file. As we see below, we can set custom documentation for models by using the Jinja template in our configuration files.

models:
- name: stg_rfm
description: '{{ doc("segment_info") }}'

In dbt, there are 2 types of tests: generic and singular. Generic tests are testing the columns based on 4 tests: unique, not_null, accepted_values and relationships. In addition to generic tests, we can create our own tests like the ones below.

with fct_orders as (select * from {{ ref('fct_orders') }})select"order_id",sum("payment_value")from fct_ordersgroup by "order_id"having sum("payment_value") < 0

Then we execute our tests by using dbt test command.

The thing is I fell in love with dbt, everything is well documented on their official website. Also, they provide well-organized training courses on their official website.

Superset

Superset is an open-source business intelligence tool that is supporting by Apache. Also, I published a text about Superset on my Medium. You can access it by using the link below.

Superset uses SQLAlchemy to create connections. So, you can create connections between Superset and anywhere you can connect with SQLAlchemy.

Creating a connection on Superset | Image by Author

Superset has a powerful SQL editor built-in.

Superset’s SQL Editor | Image by Author

We can create datasets by using the tables or views that are from the databases Superset connected.

Superset datasets | Image by Author

Then we can create charts by using these datasets based on SQL. In Superset, there are more than 30 types of charts.

Creating a chart in Superset | Image by Author

We create dashboards by gathering the charts we created in Superset.

Superset dashboard | Image by Author

Finally

I know the text is a bit short. However, I don’t want to spend much time explaining the introductory level coding. Also, in this text, non of the points mentioned above need deeper knowledge to apply. They are just basic examples. Hopefully, you learned something new. I have been dealing with these techs for almost 2 months. I really enjoy using these technologies.

Also, I have created a repo for this text. You can access the repo by using the link below. The repo contains all examples of the tools we mentioned in this story.

Regards.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Baysan

Baysan

321 Followers

Lifelong learner & Freelancer. I use technology that helps me. I’m currently working as a Business Intelligence & Backend Developer. mebaysan.com