IDinsight Blog
Published in

IDinsight Blog

Verify your configs using GitHub Actions

How (and why) we use GitHub Actions to help us deploy new surveys even faster.

By Jeenu Thomas and Sid Ravinutala

At IDinsight, we do a lot of surveys. Informing decisions often means collecting primary data. For example, last year we ran 21 survey campaigns where we reached over 100 thousand respondents. With our Data on Demand service, we are able to run representative surveys substantially faster than average. One way we do this is by deploying the right tools to optimize and automate processes. In this post, we discuss how we used GitHub Actions to do automated configuration checks and show how you can set this up for your project.

To get us from raw data collection to insights significantly faster, we built an internal system called SurveyStream, a set of data pipelines built on Apache Airflow and PostgreSQL and hosted on Amazon Web Services, SurveyStream makes it easy for the Data on Demand team to spin up a new survey and it comes with some nice bells and whistles like automated emailing task lists to enumerators, data quality reports, and enumerator productivity reports. Look out for another blog post on the details of SurveyStream and where it is headed. For now, here’s a high-level diagram of what it currently looks like.

High-level diagram of SurveyStream Interfaces

Configuring a new survey

You don’t need to write any code to set up a new survey in SurveyStream. But you do need to provide the details of the new survey — like SurveyCTO form ID, frequency of sending out assignment emails and which data quality checks to run. We plan to build a web front-end where our teams can enter these details into a nice interface and click a button to set up the survey systems. While we build this front end, we ask our survey teams to fill in a set of JSON files[1] that contain all the required configuration information.

JSON files are great for a lot of reasons (see Box 1) but it’s easy to make mistakes. You might leave out a tag or you might make a typo in the tag name.

Though a number of survey teams are quite technically savvy, we can’t demand that all of them use linters or static checkers to catch these errors. Then there are things that a linter would not be able to check: ensuring mandatory fields are not empty or checking that the correct data type (for example a valid date in a date field) has been entered.

A lot of time was lost correcting these issues. We might fix one only to discover the next issue. The Data Engineer would end up being the bottleneck in getting the survey setup. All in all, it was a time-consuming and frustrating process.

An example of a JSON configuration file

Enter GitHub Actions

Step 3 and 4 were the real bottleneck in the process. Waiting on the Data Engineer to get around to reviewing the JSON files took a while. And it also meant that she was linting JSON files at the expense of working on more complex problems.

We decided to automate a lot of this checking using GitHub Actions. Here’s how we replaced Steps 3 and 4.

The pull request triggers an action that runs a series of tests on the JSON files. It checks to see if the mandatory fields were filled, if the correct data type checks were used, and some additional logical consistency checks. And finally, it “pretty formats” the JSON files to make them more readable. All of these checks are run on the GitHub server, so every Survey Team Member gets to run it on an identical environment — without having to set anything up on their own machine.

Message shown on successful completion of tests

The Survey Team Member gets near-instant feedback on the JSON files. By taking the Data Engineer out of the loop, errors are fixed faster.

How to setup GitHub Actions

Here’s our entire workflow

A simple GitHub Action workflow file

Before we get into how to set one of these up, you need to be familiar with the terminology.

Workflow: A series of automated jobs that need to be run in order to achieve some business logic. They can be triggered by GitHub events like a push or a PR being raised. Each workflow is defined in its own YAML file that you can name whatever you like. In our case, we created a workflow called “pre-commit”, in a file called `pre-commit.yml`, and it was triggered on a pull request to the master branch.

Jobs: A workflow can consist of multiple jobs. Each one is run in its own virtual environment. We only had a single job called “validate” and it runs on an ubuntu virtual machine.

Steps: Each job is made up of steps. Each step is an individual task. It could be a shell command or an “action” (see below). Since all the steps within a job are run in the same environment, they can share data — so the output of step 1 can be used as input of step 2. You see in the screenshot above, we only have 7 steps.

Actions: This is the smallest block of logic and usually runs a single command. The GitHub community has created a bunch of actions that you can use. We are using ones from “trilom”, “actions”, and “pre-commit” repos.

That’s all we need right now for the basic setup.

Quick start guide

1. Click on Actions tab on your GitHub repository

2. The screen will show a handful of suggested workflows and some commonly used CI/CD workflows[2]. The suggestions are based on the languages that make up the repository. Select an option closest to your intended workflow or click on “set up a workflow yourself”.

3. This opens a YAML file in edit mode. Based on the option you selected in the previous step, it will be filled with some details to help you get started.

In all workflow files, you will find blocks that answer the questions when (on) and where (runs-on).

When — is configured at the level of the workflow. In this example, the workflow runs on a push or pull request to the master branch. You can select a trigger from a wide range of GitHub events like creating a branch, closing an issue etc. You can also run the workflow periodically on a scheduler, trigger it manually or use a combination of these options.

Where — is configured at the level of jobs. Each workflow is made up of one or more jobs which in turn is made up of one or more steps. Jobs run in parallel by default. If needed, they can be run in sequence. Each job runs in its own environment configured using the runs-on option. In our case, the job called ‘validate’ is set to run on an Ubuntu machine.

4. Next stage is defining the sequence of steps to be run within each job. In order to fill out the steps, you can use the panel on the right to look for publicly available Actions.

Each step in the workflow either runs a shell command or uses a pre-built Action. There are Actions available for a number of activities like checking out a repository, setting up a python environment, posting slack messages etc. For example, you can use a File Changes Action to get the list of files that changed in a Pull Request and use this output to selectively test and deploy the changed files in subsequent steps.

Instead of using these pre-built Actions, you may choose to run shell commands directly or create and use your own action. All you have to do is string together these steps in the right order and with required arguments and dependencies. Notice that we are just using 5 pre-built Actions in our workflow and only a couple of our own commands. Thanks, GitHub community!

5. When the YAML file is ready, commit it to a folder `.github/workflows/` in the repository.

And that’s it — you are all set to run and test your first GitHub Action!

What’s next

Once we have a front-end, a lot of these business rules can be built into it and the survey team will never have to think about JSON files again. But while the front-end is being built, the show must go on.

We have also been exploring using GitHub Actions for other CI/CD actions like deploying code to a cloud platform, building images or packages from code when changed, and running scheduled workflows. Look out for future blog posts on this.

References & Additional Resources

Interested in souping up your surveying process? Contact us at dataondemand@idinsight.org to talk about how our data systems can save time and increase quality.

[1] Why JSON? The final solution will have a front end anyway and we want to take advantage of Postgres JSONB data type to store this config.

[2] Repository with GitHub starter-workflow suggestions: https://github.com/actions/starter-workflows

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store