Use these open-source tools for Data Warehousing

Simon Späti
Nov 29, 2018 · 6 min read
Data Warehousing with Open-Source Druid, Apache Airflow & Superset

These days, everyone talks about open-source software. However, this is still not common in the Data Warehousing (DWH) field. Why is this?

For this post, I chose some open-source technologies and used them together to build a full data architecture for a Data Warehouse system.

I went with Apache Druid for data storage, Apache Superset for querying, and Apache Airflow as a task orchestrator.

Druid — the data store

Why use Druid?

With the comparison of modern OLAP Technologies in mind, I chose Druid over ClickHouse, Pinot and Apache Kylin. Recently, Microsoft announced they will add Druid to their Azure HDInsight 4.0.

Why not Druid?

The Architecture of Druid

The great thing is that you can add as many nodes as you want in the specific area that fits best for you. If you have many queries to run, you can add more Brokers. Or, if a lot of data needs to be batch-ingested, you would add middle managers and so on.

A simple architecture is shown below. You can read more about Druid’s design here.

Apache Superset — the UI

It is easy to use and has all common chart types like Bubble Chart, Word Count, Heatmaps, Boxplot and many more.

Druid provides a Rest-API, and in the newest version also a SQL Query API. This makes it easy to use with any tool, whether it is standard SQL, any existing BI-tool or a custom application.

Apache Airflow — the Orchestrator

In the past, ETL tools like Microsoft SQL Server Integration Services (SSIS) and others were widely used. They were where your data transformation, cleaning and normalisation took place.

In more modern architectures, these tools aren’t enough anymore.

Moreover, code and data transformation logic are much more valuable to other data-savvy people in the company.

I highly recommend you read a blog post from Maxime Beauchemin about Functional Data Engineering — a modern paradigm for batch data processing. This goes much deeper into how modern data pipelines should be.

Also, consider the read of The Downfall of the Data Engineer where Max explains about the breaking “data silo” and much more.

Why use Airflow?

Instead of encapsulating your critical transformation logic somewhere in a tool, you place it where it belongs to inside the Orchestrator.

Another advantage is using plain Python. There is no need to encapsulate other dependencies or requirements, like fetching from an FTP, copying data from A to B, writing a batch-file. You do that and everything else in the same place.

Features of Airflow

More relevant features of Airflow are that you write workflows as if you are writing programs. External jobs like Databricks, Spark, etc. are no problems.

Job testing goes through Airflow itself. That includes passing parameters to other jobs downstream or verifing what is running on Airflow and seeing the actual code. The log files and other meta-data are accessible through the web GUI.

(Re)run only on parts of the workflow and dependent tasks is a crucial feature which comes out of the box when you create your workflows with Airflow. The jobs/tasks are run in a context, the scheduler passes in the necessary details plus the work gets distributed across your cluster at the task level, not at the DAG level.

For many more feature visit the full list.

ETL with Apache Airflow

At the same time, there is a Docker container that you can use, meaning you don’t even have to set-up any infrastructure. You can pull the container from here.

For the GitHub-repo follow the link on etl-with-airflow.

Conclusion

My experience so far is that Druid is bloody fast and a perfect fit for OLAP cube replacements in a traditional way, but still needs a more relaxed startup to install clusters, ingest data, view logs etc. If you need that, have a look at Impy which was created by the founders of Druid. It creates all the services around Druid that you need. Unfortunately, though, it’s not open-source.

Apache Airflow and its features as an orchestrator are something which has not happened much yet in traditional Business Intelligence environments. I believe this change comes very naturally when you start using open-source and more new technologies.

And Apache Superset is an easy and fast way to be up and running and showing data from Druid. There for better tools like Tableau, etc., but not for free. That’s why Superset fits well in the ecosystem if you’re already using the above open-source technologies. But as an enterprise company, you might want to spend some money in that category because that is what the users can see at the end of the day.

Related Links:

Originally published at www.sspaeti.com on November 29, 2018.

We’ve moved to freeCodeCamp.org/news

We’ve moved to https://freecodecamp.org/news and publish tons of tutorials each week. See you there.

We’ve moved to freeCodeCamp.org/news

We’ve moved to https://freecodecamp.org/news and publish tons of tutorials each week. See you there.

Simon Späti

Written by

Data Engineer & Technical Author with 15+ years of experience. I enjoy maintaining awareness of new innovative and emerging open-source technologies.

We’ve moved to freeCodeCamp.org/news

We’ve moved to https://freecodecamp.org/news and publish tons of tutorials each week. See you there.