Management is doing things right;
leadership is doing the right things.
— Peter Drucker
part1: Introduction, Installation, and Setup Apache Airflow in Python
Data is at the center of every business today. As companies become more reliant on data, the importance of data engineering continues to grow. Data engineering helps make data more useful and accessible for consumers of data. Data Engineers play a major role in the end-to-end process as “data pipelines.” Each pipeline has one or more sources and one or more destinations. Within the pipeline, data may undergo several steps of transformation, validation, enrichment, summarization, or other steps. Data engineers create these pipelines and manage the workflow using some efficient engineering tools. While data technologies are emerging frequently, Apache Airflow stands at the top when we think about workflow management.
What is Airflow?
As a Data Engineer there will be a point when you think up to orchestrating your workflows, and here come Apache Airflows to fill the gap. Originally created at Airbnb in 2014, Apache Airflow is an open-source tool for orchestrating complex workflows and data processing pipelines. It is a platform to programmatically Authoring, Scheduling, and Monitoring workflows.
Authoring: Workflows in Airflow are written as Directed Acyclic Graphs(DAGs) in Python Programming Language. It allows the user to write their custom workflows in python.
Scheduling: The user can specify when a workflow should start, end, and after what interval it should run again.
Monitoring: The Airflow UI makes it easy to monitor and troubleshoot your data pipelines. It has a bunch of tools to track your workflow in real-time.
Installation and Setup
Now that we know about Airflow’s different components, let’s start with setting up Airflow on our workstation so that we can locally test and run the pipelines that we build. The first thing we will do is create a new folder and create a virtual environment with Python 3 and activate it in which we will install and run Airflow. Let’s do it quickly,
$ mkdir Airflow # make a folder
$ cd Airflow # directed to the directory
$ python3 -m venv airflowenv # create a vertual environment
$ source airflowenv/bin/activate # activate the environment
The easiest way to install the latest stable version of Airflow is with pip but before that upgrade, the pip version and install airflow:
$ pip install --upgrade pip # pip upgrade
$ pip install apache-airflow # installing airflow
After installation check the airflow version you have installed on your system,
$ airflow version
Airflow requires a location on your local system to run known as AIRFLOW_HOME. If we don’t specify this it will default to your route directory. I recommend you to set up the AIRFLOW_HOME under the same directory where you are currently in i.e where you created the virtual environment.
$ export AIRFLOW_HOME=.
After configuration, you’ll need to initialize the database before you can run tasks:
$ airflow db init
This will initiate the prerequisite to run your airflow task. If you see closely there will be some folder created under your Airflow directory. Will look into it closely some other day.
As you are getting started with airflow for the first time, it might ask you to log in. To set up your user id and password run the following command. This will set up your username as “admin” and password also as “admin”.
$ airflow users create --role Admin --username admin --email admin --firstname admin --lastname admin --password admin
Now we are ready to run Airflow Web Server and Scheduler locally. Use two separate shell tabs to run both processes and don’t forget to activate the virtual environment in the new tab.
For good practice, I suggest you run Scheduler first and then go for the Web Server.
Start the Scheduler:
$ airflow scheduler
Start the WebServer
$ airflow webserver — port 8080 #default port is 8080
To access the Web Server UI, visit localhost:8080 in the browser and use the admin account you just created to log in.
Congratulation, you have successfully set up your Airflow environment. In the next blog post, I’ll show you how to run a real-time workflow using Airflow. Till then keep your setup ready and stay motivated.
It’s expected that more than 10K software engineers, data scientists, business intelligence analysts, and DevOps engineers will virtually attend the Airflow Summit 2021 (July 8th-16th). Airflow Summit is a free online conference for the worldwide community of developers and users of Apache Airflow. The event will consist of keynotes, community talks, and in-depth workshops.
Register for Airflow Summit 2021.
Thank you for reading!
Follow me on Medium for the latest updates. 😃
Apache Airflow Core, which includes webserver, scheduler, CLI and other components that are needed for minimal Airflow…