From Zero to Apache Airflow Contribution — Part 1

Rafael Bottega
6 min readMay 10, 2020

--

This post intends to guide you from a clean Macbook laptop, installing all the required software to finish your first pull request on Apache Airflow project.

Install Homebrew

If you are a Mac user, you already know about brew, it is comparable to apt-get from Linux. Homebrew will help to install pyenv.

To install Homebrew you can follow the installation from brew.sh, which is:

/bin/bash -c “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"

Follow the steps and you will be able to install a load of packages in just one command using brew.

Install Python3

You probably already have python installed in your laptop, it should come as default, you can test it running:

which python3

or

python3 -V

If there is no Python3 installed, you can use brew to install. Don’t worry about the python3 version you have installed, we will solve it with pyenv.

brew install python3

Install Pyenv

Eventually, you will need to have different versions of python in your machine to be able to test your project for different versions, like Apache Airflow, or you will have projects not using the same version.

To install pyenv you can use brew:

brew install pyenv

After installing the package, we will need to configure it correctly (you can follow it at pyenv Github installation). You will need to have the folder .pyenv in your home directory, to have always the latest configuration, get it from git:

git clone https://github.com/pyenv/pyenv.git ~/.pyenv

Then add the Pyenv folder into the Path and define the PYENV_ROOT (I strongly advise you to be using ZSH otherwise you can check how to do on Bash or others):

echo ‘export PYENV_ROOT=”$HOME/.pyenv”’ >> ~/.zshrc
echo ‘export PATH=”$PYENV_ROOT/bin:$PATH”’ >> ~/.zshrc

Now add pyenv init to your shell to enable shims and autocompletion.

echo -e ‘if command -v pyenv 1>/dev/null 2>&1; then\n eval “$(pyenv init -)”\nfi’ >> ~/.zshrc

After this restart the terminal.

Done, now we can install other python versions, let’s install all the latest versions of python 3:

pyenv install 3.6.10
pyenv install 3.7.7
pyenv install 3.8.2

Now let’s check the versions and set the main one to `3.7.7`:

pyenv versions
pyenv global 3.7.7

If you’ve done everything correctly now if you run python3 -V you should see Python 3.7.7.

Create an airflow project folder and set the python version

Create a folder where you will store the airflow repository and create the virtual environment, the folder structure will look like:

airflow_contrib
|-> airflow (github repo)
|-> env (virtual environment)

Creating airflow_contrib:

mkdir airflow_contrib

Set the local python version for this folder (I am using python 3.6.10 for this):

pyenv local 3.6.10

Install virtualenv

Even with different versions, I advise for each project you create a virtual environment to store packages in the correct version. To install it you can use:

pip3 install virtualenv

Create the env folder with this virtual environment (make sure you are inside of airflow_contrib folder):

python3 -m venv env

Activate the virtual environment:

source env/bin/activate

Fork and clone Apache Airflow repository

Go to Apache Airflow Github page and click on Fork button on the top right corner. It will create a version of this repository into your account.

Clone your version of the Airflow repository into the airflow_contrib folder:

git clone git@github.com:[your_account]/airflow.git

Update upstream (do it every time)

After some time your repository needs to be updated with the new commits done into the upstream repository (apache/airflow), first check if your repo is pointing to the original one as upstream:

git remote -V

Here you will see https://github.com/apache/airflow.git as upstream, if not run:

git remote add upstream https://github.com/ORIGINAL_OWNER/ORIGINAL_REPOSITORY.git

Now fetch and merge the new changes:

git fetch upstream
git checkout master
git merge upstream/master
git push origin master

Install Docker

Download and install Docker Desktop for Mac. If you’ve done it correctly you should run this command without error:

docker --version

Install getopt and gstat

Some utilities are necessary to run Breeze on Mac, to install there run:

brew install gnu-getopt coreutils

After that add getopt into the PATH (ZSH):

echo ‘export PATH=”/usr/local/opt/gnu-getopt/bin:$PATH”’ >> ~/.zprofile

Restart the terminal after that.

Start Breeze

Now you are ready to start the Breeze environment, Breeze is an easy-to-use development environment using Docker Compose and is available for local use and CI tests.

Up to this point, you should be using Python 3.6.10 and the virtualenv activated on airflow_contrib folder. Enter the airflow folder (Airflow repo folder) and run:

./breeze

This command should take several minutes, take a coffee and wait. This command will start a docker container with all the required packages for us start to develop and test changes into Apache Airflow. After it finishes you will be connected to the container showing:

root@[container_id]:/opt/airflow#

If you leave the container it automatically will turn the container off, to enter again just run the ./breeze command and it will start a fresh new container, but any change you’ve done is the last container will be lost.

Configure integration from the image into your local folder

You will see that the image has some port opened and also a mounted folder. We will use it to start Airflow service integrated to changes done into our repository folder.

The mounted folder points to files folder inside the airflow repository folder, add there another folder called airflow-breeze-config and inside a file called variables.env, this file will run when you start the breeze environment. Copy this inside of the variables.env file:

export AIRFLOW__CORE__DAGS_FOLDER=/files/dags
export AIRFLOW__CORE__BASE_LOG_FOLDER=/files/logs
export AIRFLOW__CORE__SQL_ALCHEMY_CONN=sqlite:////files/airflow.db
export AIRFLOW__WEBSERVER__EXPOSE_CONFIG=True
airflow initdbairflow webserver -D
airflow scheduler -D

It will make the dags and logs of your airflow accessible outside, then be able to see logs and errors, also add dags to dags folder. Another command there is to expose the SQLite database to be able to save configurations and runs between restarts. Lastly, start the webserver and scheduler as daemon, so you still can have access to the bash terminal running airflow in the background.

You will see that when you run breeze without any parameter it will run the version 2.0.0dev, but the commands we added to variable.env file works only for 1.10.*, so let's specify the airflow version we want to run:

./breeze -a 1.10.10

If up to this point you’ve done everything right you will be able to have access to airflow through: http://127.0.0.1:28080

Install TravisCI into your forked Github repository

Travis CI is the continuous integration service used by the Airflow project to test and build new versions. It is free to use, just need to allow it into the local repository.

First access Travis CI website and sign up with the Github account, It will request an authorisation to access the Github account, after accept go to Github Settings and on Applications, there you will see Travis CI, click to configure. At this screen give Travis CI access to the local airflow repository (USERNAME/airflow).

Now every time a commit is pushed from the local repository, it will start a build process in Travis CI, it will make all tests necessary in all versions of Airflow with support. Access Travis CI for your fork at https://travis-ci.com/USERNAME/airflow.

See you in the next part

Here we close the first part, you have the environment ready to make changes and propose PRs, in the next part we are going across the step-by-step how to make your first collaboration into the Apache Airflow project.

Part 2

--

--