CI/CD pipeline with Google Compute Engine and GitHub Actions, part I

I was sitting and looking at the formula that calculates ROI for task automation:
TIME (spent on a single manual task) x FREQUENCY (of performing task per month) x Hourly Pay X 12 MONTHS = YEARLY SAVINGS/VALUE FROM AUTOMATION

But the perspective of automatically building and pushing new DAGs / updated DAGs from my local branch upstream, hence saving me from the hassle of doing extra clicks to upload the scripts to Google Compute Engine was not the single reason why I started looking into how to build a CI/CD pipeline for Apache Airflow DAGs.

“A CI/CD pipeline gives developers the power to fail fast and recover even faster…”. Have you had the “Unexpected indent” error in your Airflow DAG that you discovered only when you pushed the DAG to production? With the CI/CD pipeline, you can test feature branches in GitHub before they are merged into the main Git branch.

In this post, I want to show you how to get CI/CD configured using GitHub Actions. The requirement I was working on was to enable the automatic delivery of updates to the remote repository on a Virtual Machine when one makes a code change and merges the feature branch with the master branch.

Developer icons created by surang

Why GitHub Actions? The major reason is that as we use GitHub to store and share files/code, GitHub Actions being the native continuous integrations and continuous delivery tool for GitHub projects was our first pick. Another reason is that even though GitHub Actions is not free for private repositories, “each GitHub account receives a certain amount of free minutes and storage for use with GitHub-hosted runners, depending on the account’s plan. Any usage beyond the included amounts is controlled by spending limits.” 2,000 minutes of GitHub Free was enough to satisfy our current and near-future needs.

Prerequisites:

  • A Google Cloud Platform account >> Google Compute Engine service;
  • GitHub account.

Steps:

Google Compute Engine:

(1) To install Git on your Google Compute Engine, run sudo apt-get install git-core

(2) Create a new user deploy, add the user home directory, and block deploy from being able to log in with root credentials by running

sudo useradd --create-home --shell /bin/bash deploy
sudo usermod --lock deploy

(3) Switch to the new user (deploy) to generate an SSH key in order to execute commands from GitHub Action via SSH and modify the public key to allow the user to access the server:

sudo -i -u deploy
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -C "deploy@server"
cat .ssh/id_ed25519.pub > .ssh/authorized_keys

(4) Open files with the public and private keys by running the following commands and copy them somewhere to use later:

cat ~/.ssh/id_ed25519.pub
cat ~/.ssh/id_ed25519

(5) Change to the root user. Switch to the folder that contains your DAGs as it will be the folder you will synchronize to GitHub — by default, the DAGs folder is located in the ~/airflow/dags directory. As per the requirement, I need to keep my VM project folder synchronized with the master branch which assumes that the user deploy is able to read, write, or modify the files in our /home/airflow/dags directory. To give permission to users to write and read from the DAGs folder, runchmod -R u=rwx,go=rwx /home/airflow/dags

GitHub:

(6) To access the repository programmatically, you need to add the deploy key. Open your GitHub repository >> Settings >> Deploy keys and paste the public key you copied when you run cat ~/.ssh/id_ed25519.pub

Make sure you check “Allow write access”.

(7) To store and access the server credentials, you could add “secrets” to your GitHub repository. To do so, go to Settings >> Secrets and Variables >> Actions >> New repository secret and add the following secrets:

SERVER_IP: yourGoogleComputeEngineExternalIP

SERVER_KEY: the private key that you copied when you run cat ~/.ssh/id_ed25519

SERVER_USERNAME: deploy

Google Compute Engine:

(8) Because I already have the project folder (/home/airflow/dags) on the VM, I need to attach this folder to the newly created GitHub repository.

Swith to the deploy user — sudo -i -u deploy

git init
git config --global --add safe.directory /home/airflow/dags
git add .
git config --global user.name "username"
git config --global user.email "email"
git commit -m "first commit commit"
git remote add origin git@github.com:username/githubrepo.git #to clone your GitHub repo with SSH keys
git pull origin master --allow-unrelated-histories
git push origin master

GitHub:

(9) In the GitHub universe, GitHub Actions workflows are attached to GitHub repositories. Each repository could contain one or several workflows that can triggered upon different events. Each workflow, in turn, can include one or several jobs that define an execution environment. Each job can contain one or several steps that execute a shell script or any other action.

To create a new workflow, Go to Actions >> “Skip this and set up a workflow yourself”, paste and commit the below code snippet.

name: Remote update execution
on:
push:
branches: [ "master" ]
pull_request:
branches: [ "master" ]

jobs:
build:
name: Build
runs-on: ubuntu-latest
steps:
- name: executing remote ssh commands using password
uses: appleboy/ssh-action@master
with:
host: ${{ secrets.SERVER_IP }}
username: ${{ secrets.SERVER_USERNAME }}
key: ${{ secrets.SERVER_KEY }}
script: |
cd /home/airflow/dags
git config --global --add safe.directory /home/airflow/dags
git checkout master
git pull origin master --allow-unrelated-histories

There are many repository-related and ‘other’ events to trigger your workflow, such as push, pull_request, ‘workflow_dispatch’, etc. In the current setup, I wanted to trigger a GitHub Actions workflow on push — as per the original idea, the workflow is to be triggered when a pull request is merged but because there is no way to specify it in GitHub and because a merged pull request always results in a push, I will use the push event type.

(10) Save your workflow, make some changes to your DAGs and push the local changes to the GitHub repository. You could monitor and troubleshoot the runs by visiting your GitHub repository >> ‘Actions’. Review the /home/airflow/dags folder on your VM to see if the changes were reflected.

And you are done!

The solution above definitely helps me reduce clicks when trying to synchronize my local machine and the Airflow server, but it doesn’t help me to “fail fast and recover even faster”.

To be continued…

P.S. In case you start getting SSH errors, please refer to https://github.com/appleboy/ssh-action/issues/80 for potential solutions

--

--