Running Airflow on Heroku

Avram Dames
8 min readJul 30, 2017

--

The scope, of this tutorial, is to walk you through a minimal Airflow deployment using Git and Heroku. It assumes you have basic knowledge of Airflow and its concepts. At the end, you will have:

  • A running Airflow instance on Heroku (secured with user and password).
  • A Heroku Postgres instance for storing your jobs history and users.
  • An easy way to create and deploy data pipelines.

Prerequisites:

Setup:

  • Git, Python3 and pip are installed on your local machine.
  • Heroku CLI is installed. If not, follow these steps!
  • I am using Ubuntu. Please adapt the instructions to your OS.

The whole application in the final state is available here. I recommend building it from scratch on your local machine following the instructions below. But feel free to peek in, at any time, if the instructions are not clear.

What is Heroku?

You can skip this part if you are already familiar with Heroku.

In case you haven’t used Heroku before, here is the gist of it: Heroku is like GitHub, Docker and AWS, all combined into one interface.

Heroku lets you build, run and manage your application. All you have to do, is to provide Heroku the source code of your application together with its dependencies (pandas, numpy, database connectors, etc).

For instance, consider a local Python project. In order to run it on Heroku, you’ll have to follow a few general steps:

  1. Set up your project with Git, for version control.
  2. Define a remote Heroku repository for your project.
  3. Push your local files to the remote Heroku repository.
  4. Heroku will build the app, based on your files.
  5. Heroku will run the app, based on your instructions.
  6. You can access your app on the public internet.

In the context of our deployment, the question is: what is Airflow? is it our app or is it a dependency? It looks like, it’s both.

Airflow is a Flask app (and a python module) that we will run on Heroku. The source code for this app is already written by other people and is easily available via pip. Hence, the app will be installed with pip as a dependency, but will also be ran as the main app.

Will we write any code?

Yes. We’ll also use Airflow, as a python module, from where we would import the basic building blocks for our DAGs, defining the data pipelines to be executed by the Airflow’s scheduler. The DAG files, will be part of our application.

So what do we need to start?

The answer is pretty straightforward. Our deployment will consist of 2 files, plus our DAG folder and its content. Hence, this is what we need:

  1. A requirements.txt file, for our dependencies (ex. apache-airflow, pandas, etc)
  2. A Procfile, to instruct Heroku what to run (ex. airflow webserver)
  3. DAG scripts, placed in a sub-folder called “dags”

Get familiar with Heroku by reading their excellent “Getting Started on Heroku with Python” guide. Once you get familiar with Heroku, you can read more about how it works.

The local project

We’ll create the above files in a local folder called “airflow_tutorial”, preferably created inside your workspace directory. I personally use /home/my_user_name/dev to store all my projects.

~$ cd <your_workspace_directory> # let's call it "~/wd"
~/wd$
mkdir airflow_tutorial

Have a look here to see how the project is organized, in case the instructions are confusing.

Create a Python virtual environment to install Airflow together with its dependencies. This way we get to keep things clean and easy to debug.

~/wd$ cd airflow_tutorial
~/wd/airflow_tutorial$ python3 -m venv .venv
~/wd/airflow_tutorial$ source .venv/bin/activate

Install apache-airflow package together with the PostgreSQL dependency using pip. We will use Postgres as a back-end, hence it is a good idea to already install the Python connector for it. Once, installation is successful, we can freeze the state of the virtual environment into a requirements.txt file. This file will be automatically used by Heroku to install our app’s dependencies. It is mandatory that the requirements.txt is located in the root folder of the application.

(.venv) ~/wd/airflow_tutorial$ pip install "apache-airflow[postgres, password]"
(.venv) ~/wd/airflow_tutorial$ pip freeze > requirements.txt

In case you are running Ubuntu, you might encounter a bug when running pip freeze. Either manually delete the following line from your requirements.txt pkg-resources==0.0.0 or use this command to generate it correctly: pip freeze | grep -v "pkg-resources" > requirements.txt .

If you are encountering other issues while installing airflow and its postgres dependency, you’ll have to google the error. Most likely some system level packages are missing.

The Procfile tells Heroku what to run, once the application is build. Create one in your application’s root folder.

(.venv) ~/wd/airflow_tutorial$ touch Procfile

Copy the following line into your Procfile, in order to tell Heroku to execute “airflow initdb” command as soon as the container starts. This will create all the necessary tables in the database that Airflow will use to store its metadata.

As the two most important files of your deployment are defined, it is time to commit the project to version control (git). It is a good idea to also create a .gitignore file, so git knows what files not to track, such as python byte code files, the virtual environment directory and others.

(.venv) ~/wd/airflow_tutorial$ touch .gitignore

Copy the lines below into your newly created .gitignore file:

As your project will grow, do not forget to update this file in order to make sure git is not tracking sensitive files (credentials) as well as project specific directories that will just create a mess and maybe errors down the road.

Final steps, add your files to version control and do an initial commit.

(.venv) ~/wd/airflow_tutorial$ git init
(.venv) ~/wd/airflow_tutorial$
git add .
(.venv) ~/wd/airflow_tutorial$
git commit -m "initial commit"

Deploy to Heroku

Assuming Heroku CLI tool is installed on your computer and your are already logged in, creating a new app and assigning your local project to it is piece of cake.

(.venv) ~/wd/airflow_tutorial$ heroku create

This will create a new app on Heroku service and will assign a remote Heroku repository to your local project. Pushing files to this repository is just as easy as creating the app. Before we do that, let’s add another piece of the puzzle: the back-end database.

Heroku will provision a free Postgres instance for our app in the form of an add-on. Simply run the following command:

(.venv) ~/wd/airflow_tutorial$ heroku addons:create heroku-postgresql:hobby-dev

If you are familiar with Airflow, you know that the airflow.cfg file has big role in the set up process. This is where you define the connection string to your database as well as many other settings. Personally I like to use environment variables to set up this file instead of having local copies of it. Heroku makes setting up environment variables very easy. For example, run the following line to check out what environment variables you already have:

(.venv) ~/wd/airflow_tutorial$ heroku config

the output should provide you with at least the DATABASE_URL for the newly created Postgres instance, DATABASE_URL: postgres://<secret_string>

Setting up environment variables is just as easy:

(.venv) ~/wd/airflow_tutorial$ heroku config:set AIRFLOW__CORE__SQL_ALCHEMY_CONN=<your_postgres_con_string>(.venv) ~/wd/airflow_tutorial$ heroku config:set AIRFLOW__CORE__LOAD_EXAMPLES=False(.venv) ~/wd/airflow_tutorial$ heroku config:set AIRFLOW_HOME=/app(.venv) ~/wd/airflow_tutorial$ heroku config:set AIRFLOW__CORE__FERNET_KEY=<secret_key>

By the way, if you don’t already have a Fernet key, this is how you create one. Just open a Python console and type the below instructions:

>>> from cryptography import fernet
>>> fernet.Fernet.generate_key()
b'pZcwcoB8RQfjtE9n0Du5Weu8zLKoFphKkiGDBihOwcM='
>>>

Once you have set your environment variables, all you have to do is to push your local files to Heroku and watch the app being build in front of your eyes.

(.venv) ~/wd/airflow_tutorial$ git push heroku master

As our Procfile, instructed Heroku to initiate the databse only, and not to run Airflow’s webserver or schedule, we need to verify the logs that the tables were indeed created and no error occur during the process.

(.venv) ~/wd/airflow_tutorial$ heroku logs --tail

In case there are no errors encountered during the build, we need to adjust a few things before we start the Airflow webserver and access the web interface.

First, we must modify the Procfile, instructing Heroku to run our webserver instead of initiating the database. Modify the local Procfile to look like the one below.

Second, we need to secure our app by adding an extra environment variables.

(.venv) ~/wd/airflow_tutorial$ heroku config:set AIRFLOW__WEBSERVER__AUTHENTICATE=True(.venv) ~/wd/airflow_tutorial$ heroku config:set AIRFLOW__WEBSERVER__AUTH_BACKEND=airflow.contrib.auth.backends.password_auth

Commit the changes to git and re-deploy the app.

(.venv) ~/wd/airflow_tutorial$ git add .
(.venv) ~/wd/airflow_tutorial$
git commit -m "change Procfile to start the webserver"
(.venv) ~/wd/airflow_tutorial$ git push heroku master

Again, once the push is successful, you can check the logs:

(.venv) ~/wd/airflow_tutorial$ heroku logs --tail

In case there are no issues with push and subsequent build, you can access your application by using the following command:

(.venv) ~/wd/airflow_tutorial$ heroku open

If everything went well, you should be able to see this screen in your browser:

Now is time to create the first user. For that we’ll need to ssh into the Heroku instance of our app:

(.venv) ~/wd/airflow_tutorial$ heroku run bash

… open a Python console:

<remote_host>$ python

… and run the following commands as also described in Airflow’s official documentation, replacing the quoted strings with your own credentials.

>>> import airflow
>>> from airflow import models, settings
>>> from airflow.contrib.auth.backends.password_auth import PasswordUser
>>> user = PasswordUser(models.User())
>>> user.username = 'new_user_name'
>>> user.email = 'new_user_email@example.com'
>>> user.password = 'set_the_password'
>>> session = settings.Session()
>>> session.add(user)
>>> session.commit()
>>> session.close()
>>> exit()

These credentials will be saved inside the Postgres instance you have created earlier. As you redeploy your Airflow app on Heroku, these credentials as well as everything else saved in the database (like job history, connections, variables, etc) will persist. However, resetting the database will delete them.

Time to exit the remote host:

<remote_host>$ exit

Now, you can finally log in into your brand new Airflow app and check it out.

Scheduling DAGs

Now that the app is live and you can log into the web interface, you probably wish to set up a data pipeline. Let’s start by creating a new folder inside your project, called dags.

(.venv) ~/wd/airflow_tutorial$ mkdir dags
(.venv) ~/wd/airflow_tutorial$ touch dags/tutorial_dag.py

Copy the instructions below into the newly created “tutorial_dag.py”.

Heroku will place the dags folder inside the “/app” directory, which is also the home folder for Airflow (if you paid attention when we’ve created the environment variables, you already know that). Hence, our Airflow app will know how to find our dags as soon as we push them to Heroku, which we’ll do in the next step, but not before we make one last change to our Procfile.

(.venv) ~/wd/airflow_tutorial$ git add dags
(.venv) ~/wd/airflow_tutorial$
git commit -m "create test dag"

As you probably know, last time we modified the Procfile, we instructed Heroku to run the Airflow webserver, but we did not specify anything about the Airflow scheduler, without which Airflow is not going to run our dags any time soon. So, let’s replace our previous Procfile with the one below.

As the free tier of Heroku, allows us to only run one process, we have to start the webserver as a background daemon and run the scheduler as the main app.

(.venv) ~/wd/airflow_tutorial$ git add Procfile
(.venv) ~/wd/airflow_tutorial$
git commit -m "change Procfile to run the webserver as daemon and scheduler as main app"

Finally ready to push our changes to Heroku.

(.venv) ~/wd/airflow_tutorial$ git push heroku master

That is it! At this point you should have a running Airflow app with a secured web interface and a Postgres backend database. Of course, there are many other options available for both Airflow and Heroku, in order to take this app to the next level, but the point of this tutorial is to provide a minimal deployment structure to get you started.

Hope you enjoy following along, and please feel free to provide any criticism or suggestion regarding this tutorial.

--

--

Responses (3)