A gentle introduction to Data Workflows with Apache Airflow and Apache Spark

Antonio Cachuan
Analytics Vidhya
Published in
12 min readMar 2, 2020

--

Imagine you’d developed a transformation process in a local Spark and you want to schedule it so a simple Cron Job would be sufficient. Now think that after that process you need to start many other like a python transformation or an HTTP request and also this is your production environment so you need to monitor each step
Did that sound difficult? Only with Spark and Cron Job, yes, but thanks we have Apache Airflow.

Airflow is a platform to programmatically author, schedule and monitor workflows [Airflow docs].

Objective

In our case, we need to make a workflow that runs a Spark Application and let us monitor it, all components should be production-ready. First, let review some core concepts and features.

Features and Core Concepts

Features

  • To create a workflow in Airflow is as simple as write python code no XML or command line if you know some python Yes! You can do some Airflow.
  • Airflow is not just for Spark It has plenty of integrations like Big Query, S3, Hadoop, Amazon SageMaker and more.
Airflow Integrations [Airfow documentation]

Core concepts

  1. DAG

--

--

Antonio Cachuan
Analytics Vidhya

Google Cloud Professional Data Engineer (2x GCP). When code meets data, success is assured 🧡. Happy to share code and ideas 💡 linkedin.com/in/antoniocachuan/