Triggering your Jupyter Notebooks with Apache Airflow
Introduction
Jupyter Notebook provides a great environment to load, explore, analyze and export large sets of data. It’s also a recognized tool among Data Analysis and Data Science communities.
Once you have explored and extracted the data the way you want, you may want to enter a production process that can process automatically everything that has been done in your Notebook.
That’s where you must have a look at Apache Airflow!
Apache Airflow is an Open Source workflow orchestrator, which appears to get better than crontabs when you’re dealing with complex workflows and parallel processes. It also provides a web user interface on which you can monitor your task executions.
This article will help you to get your first steps combining those great tools!
The data used in this article is available at https://rebrickable.com/downloads/ and is updated every month.
You can download all the material for this article on my GitHub account: https://github.com/pcoipeault/jupyter_airflow
A last valuable resource is the Airflow GitHub where additional operators can be downloaded : https://github.com/apache/airflow/tree/master/airflow
It is supposed you have a dedicated server architecture for each component. One server for Juypter, another for Airflow and a last one for your Database.
Let’s start writing your Notebook!
Giving the case that will have to handle data periodically, you may not want to process your Jupyter Notebook every time. Your code will have to be generic and adapted to handle some variable parameters during the time. For example, the files you retrieve can have a date variable in the filename.
Assuming you’re a LEGO fan and you love analyzing each LEGO set that has been created since the beginning of the firm, you create your Jupyter Notebook to handle this data and you want your database to be updated every month.
The first step, in this hypothetical use case, consists of downloading CSV files from an SFTP server. Then, we read some of those CSV to join them. The last step is to save the data in a new file and uploading it on a Google Cloud Storage bucket and archive the reference files.
Make our Jupyter Notebook executable via SSH
Now that our Notebook handles LEGO data with a generic form, we want Apache Airflow to run it.
The first thing to do is to save our Notebook as a Python script file. To do so, on your Notebook web interface, click on ‘File’ > ‘Download as’ > ‘Python (.py)’.
This will download the file to your computer. A quick SCP command, and it’s on your Jupyter server!
Then on your server, you create an SSH Key that includes the execution of the Python script (bash scripts are included). For example, you type the following commands on your prompt :
> mv myscript.py /home/airflow/
> su airflow
> ./install_python_script.sh myscript.py
At the end of the script, it will display an SSH Key. Copy this key to a new file on your computer.
Onto Airflow configuration!
Before coding the DAG which will execute the Python script, you have to configure it.
Begin with uploading the Python script SSH key on the Airflow server. Take note of the path on which you store it.
On the Airflow web interface, click on the ‘Admin’ menu, and on ‘Connections’. Then, create a new connection. Fill the form with your server information, and in the ‘extra’ field, give the ‘key_file’ parameter the path to the SSH key.
Don’t forget to configure the Connection to your PostgreSQL server :
Now that Airflow is configured, it’s time to code a DAG !
Coding your DAG
The DAG we will implement is quite simple. Firstly, it will execute our Python script made from our Jupyter Notebook. Finally, it will import the data contained in the file from our Google Cloud Storage bucket in a PostgreSQL database.
Now the DAG is coded, you upload it to your Airflow server, and you can have it running automatically with the schedule set, or trigger it.
Final thoughts
With Jupyter Notebooks, you can handle data the way you want. With Airflow, you can create complex workflows and tasks, with lots of operators and plugins (and you can create your own !). Combining those tools can help you with orchestrating data workflows and a lot more :)
Happy coding!