Master DevOps Data Architecture with Apache Airflow, Kubernetes and Talend
Data Architects know that doing DevOps stuff in a data-driven domain is kind of a royal discipline. You need to manage tons of dependencies and stateful components as well as ancient interleaved batch runs. Fortunately technologies evolve rapidly and you can choose from a broad range of open source tools in order to ease the job a bit.
After reading my blog post you will at least have a slight idea how to bring together the Microservice world with the good old Data Warehouse world. Finally you can try it out by yourself using the linked code example.
Why Apache Airflow?
When I first heared about Apache Airflow I was really impressed about the maturity of the platform and I always planned to try it out myself. Airflow was originaly introduced by Airbnb as a concept ‘ to author workflows as directed acyclic graphs (DAGs) of tasks’. When working with complex Data Pipelines you will always need a tool which orchestrates your loosly coupled data job components and which gives your operations team a set of control panels and dashboards. In the legacy world you would choose a heavy-weight solution like Automic or some Cron-based framework. Airflow would be the Engineer’s choice #1 as it is Open Source, Python-based and equipped with a really nice frontend.
The latest release contains new Kubernetes Operators which perfectly fit the modern idea of composing data pipelines out of isolated containered Microservice components. Please read Daniel Imberman’s Airflow on Kubernetes blog post to get an idea about the concept.
Kubernetes Installation
We will run the whole example on Kubernetes. That will give as all advantages of the Docker world combined with scalability, performance as well as a broad range of management tools and options. As I am forced to use my Windows 10 Developer Notebook (helpful for offline demos) I had to install Virtual Box and a 64 Bit 18.04 Ubuntu image. First the VM must be equipped with a Docker 18.06 installation. Next Minikube 28.2 needs to be installed.
Airflow Installation
Now you can clone and install the pre-configured Airflow Incubator repository in order to deploy the tool into your Minikube cluster. After running sudo kubectl get pods you should see something like this:
Well done! You will be able to connect to the Airflow UI on port 30809 and run some examples if you like to (attention: some will fail due to wrong configuration).
Talend Open Studio Installation (optional)
TOS will be our ETL tool. It is Open Source, Java-based and very simple to use. Apart from this you can easily build each ETL job as a standalone Java zip which makes TOS a perfect choice for Container deployment. In case you want to view or change the ETL example jobs, feel free to install TOS and the example code by following the install guide. Our Kubernetes/Airflow demo will run without a Talend installation. All jobs have been pre-compiled and checked-in within our little example project.
Understanding the Example Architecture
The example simulates typical data requirements. We receive some input data from different source systems. We load the data in raw format into our Operational Datastore. Than we aggregate and enrich some stuff and finally extract the result as CSV for further processing within some consumer target system. A PostgresDB serves as persitent datastore.
Actually we load a file with 5000 generic customers and simply lookup the States for each of them. You might see that even this simple example needs some kind of orchestration. The ODS load jobs can run in parallel but the aggregation ETL job can only start after both load jobs have been completed. Finally the extract job needs to wait for the aggregation job to be finished. We will prepare an Airflow DAG to manage dependencies.
Appart from that we want to make sure that we can package, test and deploy each ETL job in isolation. Docker and Kubernetes will do it for us. We simply need to use the KubernetesPodOperator within our DAG config.
Install and run the Example
In order to execute the example you need to follow the steps of the installation guide. With make build you build four docker containers — one for each ETL job. Use make deploy_dag to install the DAG within our Airflow container. After some seconds Airflow will automatically detect the new or changed DAG and show it up within the GUI:
Now you should see with kubectl how airflow creates your pods:
You might also want to check the tree view in Airflow in order to check status and logs:
If everything went fine you should see a lot of green signals in Airflow and find some output data with the mounted output directory:
Conclusion
Compared to legacy data technology we really made two steps ahead. It became relativly easy to set things up. Nevertheless as things like Kubernetes, DevOps and Microservers became commoditiy for standard packaged solutions we still need to go some extra miles for Data Architectures. That relates to the data work with have to do. Our Data Pipelines contain stateful components which still mutiply the complexity. I will invest some extra time on this topic in one of my follow-up posts.
Finally
Thanks for reading and visit my Xing profile.