A gentle introduction to Data Workflows with Apache Airflow and Apache Spark

Published in

Analytics Vidhya

12 min readMar 2, 2020

Imagine you’d developed a transformation process in a local Spark and you want to schedule it so a simple Cron Job would be sufficient. Now think that after that process you need to start many other like a python transformation or an HTTP request and also this is your production environment so you need to monitor each step
Did that sound difficult? Only with Spark and Cron Job, yes, but thanks we have Apache Airflow.

Airflow is a platform to programmatically author, schedule and monitor workflows [Airflow docs].

Objective

In our case, we need to make a workflow that runs a Spark Application and let us monitor it, all components should be production-ready. First, let review some core concepts and features.

Features and Core Concepts

Features

To create a workflow in Airflow is as simple as write python code no XML or command line if you know some python Yes! You can do some Airflow.
Airflow is not just for Spark It has plenty of integrations like Big Query, S3, Hadoop, Amazon SageMaker and more.

Airflow Integrations [Airfow documentation]

Core concepts

A gentle introduction to Data Workflows with Apache Airflow and Apache Spark

Objective

Features and Core Concepts

Written by Antonio Cachuan