PushMePullYou — Dr Dolittle had his git workflow sorted, now we do too!

This is the third in a series of posts charting the design choices, open source tools and analytical workflows that the Trafford Data Lab are adopting.

The Trafford Data Lab supports decision-making in Trafford, a local authority in Greater Manchester, by revealing patterns in data through visualisation. It is committed to publishing open data and using open source tools to encourage a transparent and reproducible analytical workflow.


This article explains how we are using Git and GitHub as an integral part of our workflow and why we have found it to be beneficial. Whilst it is not our intention to go into a full description of either, a little introduction is useful.

OK, so Git is?

Git is a piece of version control software, specifically a Distributed Version Control System (DVCS). It was written by Linus Torvalds, the creator of the Linux kernel, and is not an acronym:

GIT — the stupid content tracker

”git” can mean anything, depending on your mood.

- random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact that it is a mispronunciation of “get” may or may not be relevant. 
- stupid. contemptible and despicable. simple. Take your pick from the dictionary of slang. 
- “global information tracker”: you’re in a good mood, and it actually
works for you. Angels sing, and a light suddenly fills the room. 
- “goddamn idiotic truckload of sh*t”: when it breaks

(from the readme in the initial commit by Linus Torvalds on 7th April 2005)

Git’s job is to track changes in files. This means that changes are documented and can be reversed if necessary (version control). Git also allows collaboration; changes can be made to files by different people and merged together. Git is not alone in providing this functionality, however it is one of the most popular and importantly for us, the system used by GitHub.

…and GitHub?

GitHub is a web-based service for hosting repositories (essentially containers for projects) tracked by Git. By default the repositories (or repos) and files/folders within them are publicly accessible, so everything there can be inspected, commented on, collaborated on, or downloaded by anyone*. This was the primary reason for choosing GitHub over other Git hosting providers. It was launched in 2011 by Tom Preston-Werner, Chris Wanstrath, and PJ Hyett.

* you can have private repositories and/or grant and refuse collaboration permissions on public ones but you need to pay for that service.

Our workflow decisions

A common situation found in public sector organisations is for teams or even individuals to work in silos — isolated from others, not sharing or benefitting from experiences. Data and information is often duplicated, stored in fragmented ways and often only accessible to the departments to which the creators belong.

The aim of the Trafford Data Lab is to be open in everything we do, and to provide reproducible outputs so that others can see not just what we did, but also how and try it for themselves. In other words, the complete opposite of the siloed working environment just described. To achieve this we needed to make some decisions about the tools we would use and how we were going to work as a team:

1. Handling our cleaning, data manipulation, and visualisation in R. This allows us to create outputs which are reproducible and transparent — anyone can inspect the code. Think of the number of times you’ve returned to a spreadsheet only to wonder where you got the information from, what steps you took to manipulate it and why and you will understand the value of this.
2. Writing documents in Markdown. It’s fast, consistent, open, you can output to PDF and HTML and it is rendered nicely on GitHub, (it is used extensively for providing README files like this one for our IMD app)
3. Use Git. This means all our files, whether code or documents, are version controlled and auditable. This prevents the following scenario which I’m fairly sure you will have encountered at some time:
- some_data.csv
- some_data_BACKUP.csv
- some_data_Nov_2017.csv
- some_data_Nov_2017_revision2.csv
 
4. Host on GitHub. Allowing us to collaborate and ensuring all of our outputs are publicly accessible and obtainable.

GitHub also offers a number of other advantages, such as previewing certain file types. For example, upload a CSV file and GitHub will display it in a familiar spreadsheet format. Similarly, upload a GeoJSON file and GitHub will helpfully display the data on a map! This is one of the reasons we have taken the decision to upload open data in a variety of formats — CSV for data analysts, JSON for developers and then GeoJSON if the data is spatial. For example here are the open data files we created showing Jobcentre Plus locations within Greater Manchester in CSV and GeoJSON formats. (This dataset was created as part of our involvement with the OpenGovIntelligence project — the website of which incidentally is also hosted on… you guessed it… GitHub!)

Finally the Dr Dolittle joke explained

When you start learning Git you will quickly be introduced to various commands such as ”push” and ”pull” which are used to publish changes you have made or obtain those made by others. You will also learn that there are various ways to use git, especially when working with others (git workflows). Some of these are quite complicated and are suited to activities such as software engineering with medium to large teams, others are much simpler. Here is ours and the reasons why it works for us.

To understand what follows you will need to read at least one of the many Git tutorials available, such as this one from GitHub, this one from Atlassian who provide BitBucket, another popular Git hosting service, or this one from Alice Bartlett from the Financial Times). Additionally you need to decide how you are going to interact with Git. This can be done via the command line, or by GUI tools such as GitHub Desktop. The examples shown are using the command line. Importantly of course you also need to have Git installed on your machine! That is beyond the scope of this article, however there are many guides available, just search for one covering your operating system.

Working locally

The first aspect of our workflow is that we mainly work locally, i.e. on our own machines rather than directly on GitHub via their web interface (except for creating repositories and minor edits to README files). Whether creating a new repository or working with an existing one, we clone it, (which effectively means take a copy) to our individual computers. You can clone our open_data respository simply by executing the following command:

git clone https://github.com/traffordDataLab/open_data.git

The local copy of the repository can now be worked on independently of anyone else, but it retains a link to where it came from so that any changes made locally can be applied to the version on GitHub, known as the origin.

If working on existing files, before doing any new work we ensure we have the latest version of the files by using the git fetch command, followed by git status. The result tells us whether there have been changes made which we need to download. If so we do this by using git pull. As the name suggests, this pulls any changes from the repository on GitHub into our own local repository.

Branching out on your own

The second aspect to our workflow is the use of branches. A good way to understand these is to think of a train track. When you create a repository in Git an initial branch is created called master. A branch in Git is where work is done. Using our analogy of train tracks, each time changes are made and those changes committed to the branch (using the git add and git commit commands), a new piece of track is created. If everyone worked on the same line we risk interfering with each other’s work. Additionally we wouldn’t want a piece of track to be laid that wasn’t fully completed otherwise the train could be derailed. Therefore developments are best done away from the master branch until the changes are completed, tested and ready to be added to it. We therefore create a new branch on which to make our changes using the git branch <branch name> command, and switch to that to start working. Once our changes have been made and we have checked them to ensure everything works correctly, we merge the changes into the master branch (git merge <branch name>) before pushing the changes back to GitHub so that they are publicly accessible: git push origin master.

Although this methodology involves more steps than simply editing and developing on the master branch, it allows for greater flexibility and is safer, since the master branch always contains complete, working files. This is a somewhat simplified software development approach, but is equally applicable to the outputs we are producing, be they open data, infographics, profiles or apps etc. Using Git for version control in a non-software development environment is also discussed by Jennifer Bryan in the article Excuse me, do you have a moment to talk about version control?

As I mentioned earlier, there are many different workflow types using Git that can be adopted. This git workflows article from Atlassian provides a good overview of the different types and their respective advantages.


Written by James Austin, Trafford Data Lab


References

Bryan J. (2017) Excuse me, do you have a moment to talk about version control? PeerJ Preprints 5:e3159v2 https://doi.org/10.7287/peerj.preprints.3159v2