Github Actions in Action 🎬

Published in

NYC Planning Tech

4 min readJul 17, 2020

Starting from 2020, data engineering committed to publish PLUTO on a monthly cadence. Despite massive efforts in modernizing the process using SQL and Python, we still faced challenges in operationalizing the PLUTO production process, especially in data ingestion, geocoding and execution. Github Actions, a Continuous Integration / Continuous deployment (CI/CD) service completely transformed our data production experience and greatly improved our development efficiency.

Data Production workflows are complex, and data engineers constantly run into issues such as:

Who should be running a data pipeline workflow? Engineer A or B?
How do we run it? Should I wait for this script to finish, or go get coffee?
Where do we run it? Which server/database are we talking about here?
When do we run it? When is the code finalized or when all the datasets are ready to go?

It might be obvious that some kind of CI/CD tool might be able to help us. However, most CI/CD tools are built for software production. Data Production is unique in that it is much more time consuming, less portable, and different procedures require different computing environment. Github Actions provides enough flexibility that we are able address all above issues:

Who? Since Github Actions is orchestrating and running the workflows, this question is no longer relevant.
How? Workflows are defined in .yml files, which allow you to specify each step of the production process, environmental variables and container environments.
Where? Github Actions can spin up a Ubuntu VM for each workflow. We also have the option to use our own servers (self-hosted runners) as the workflow environment.
When? An action workflow can be triggered by different Github events, such as a git commit, opening of issues, creation of releases, etc. For routine tasks, we can also set up cron jobs.

Let’s take PLUTO production as an example:

1) Geocoding in Github Actions

One of PLUTO’s core source dataset is the DOF Property Tax System (PTS) file. PTS is a mainframe data extract. To make it compatible with PostgreSQL, we have to use a shell script to clean the data, load it into Postgres and run it through Geosupport.

We start by defining the workflow. As you can see below, we are telling Github Actions that this workflow is triggered by a git commit, to run this workflow only if [pts] is mentioned in the git commit message. If this workflow is not skipped, then run this workflow in a container using the nycplanning/docker-geosupport:latest docker image with below environmental variables.

We then continue to define the steps to cleans the PTS files. We first checkout the code and install software dependencies, including: psql cli, which allows us to copy files into our Postgres database; minio cli, which allows us to extract the raw PTS file from our object storage.

Once the dependencies are in place, we proceed to clean the PTS file and load it into our database. Since we are in a docker-geosupport container, we are able to use the python-geosupport bindings to quickly geocode PTS and upload the corresponding files to our database.

Once a PTS processing action is triggered, we can click into the Actions tab to observe our workflow status.

You can find all the workflows run or skipped under the actions tab. For each action, you see the commit, author and branch info that triggered the action.

Click into an action, you can see logs, error reports and time elapsed. For a running workflow, the logs are updated in real time similar to terminal outputs, but in a browser 🎉

2) The PLUTO Pipeline

Similar to PTS processing, PLUTO follows a water flow procedure that include the following steps, (you can find the workflow file here):

Install dependencies (psql cli, minio cli, zip, curl).
Data loading (pulling 41 different data sets, including PTS, and the geocoded PTS).
Build (which creates the PLUTO table using all 41 source tables).
Apply research corrections.
Archive output and perform QAQC.
Generate PLUTO and MapPLUTO Shapefile and FileGDB, then push to DigitalOcean Spaces object storage.

As you might have noticed, these processes are not complicated. Previously we lacked a standardized procedure and stable execution environment, however, so each step had potential for errors.

For example, when we execute scripts directly in our server, if we closed the terminal window or if we lose internet connection, the production would still proceed but we would have lost all the logs, making the process untraceable. If another data engineer picks up the work, different server configurations and different environmental variables could easily turn an easy update into an endless debugging session.

This back and forth rerunning of scripts created a lot of confusion for data engineers, introduced uncertainty to our data quality, and added maintenance overhead.

Using Github Actions, we are able to solve a lot of we were able to overcome a lot of these challenges, which slowed down our work considerably. Now we are able to operationalize data production, improve data quality and increase our publishing frequency. Most important of all, we are freeing up more time for our data engineers to explore opportunities in developing new open data products and to create better information access.

Github Actions in Action 🎬

1) Geocoding in Github Actions

2) The PLUTO Pipeline

Written by sptkl