Towards open health analytics: our guide to sharing code safely on GitHub
In September 2019, our team at the Health Foundation took a big step towards making our analytics more transparent and reproducible: we launched our GitHub Page and by default, made all our code repositories public. Since then, we’ve uploaded more code, welcomed contributions from external collaborators, and added new projects in progress. And another team at the Health Foundation, the Improvement Analytics Unit, has joined us by opening up their GitHub profile too. Along the journey we’ve learned a great deal and so, we’d like to share our experiences to inspire others.
Launching our GitHub page is part of a wider strategy to make our work more open and collaborative, with the twin aims of tackling bigger issues in analytics and sharing learning. We hope this will help accelerate innovation in analytics and data-driven technologies across the health and care ecosystem.
Building on others’ work
As a team, we are lucky to stand on the shoulders of open data science giants. What we’ve achieved so far has only been possible because we can access an incredible range of open-source tools and the support of individuals and communities, who have done lots of hard work already. We owe much of our progress to resources and advice from the NHS-R community, the Turing Way and the wider open data science community on Twitter and Stack Overflow.
We know this is only the start of our journey towards open analytics, but we’ve been encouraged by the positivity of the data community. We’ve had many inspiring conversations with health care analysts around the UK and have lots of ideas about where we might head next.
Sharing our insights
Sharing code is rapidly becoming the norm in many research fields. For example, NHSX — the unit leading on digital transformation in the NHS — has made open code and open-source technology part of its core strategy to create better technology.
Yet, organisational and cultural barriers still exist for many analysts in health and care, and many struggle to access open-source tools and training. This might keep the NHS from realising the full potential of its data and its analytical workforce.
To encourage others, we want to share how we got started on GitHub and now routinely share our code. This blog will cover:
- the unexpected benefits of sharing our code
- how we got started
- building a welcome page
- the challenges of working with sensitive data
- principles and R tools for sharing code safely
- our plans for the future.
The advantages of sharing code
Working transparently increasingly enables others to scrutinise, validate and use our work, which might also reduce duplicated efforts (‘analytical waste’). Though some of our projects still contain code for closed source software and might not be reusable as is, sharing it creates a permanent record of the analysis and can be a useful starting point for others. And while it has been hard work to get to this point, we have also noticed some unexpected benefits to us as analysts and to how we work as a team.
It is now much easier to share our work internally, keep up with each other’s progress, and celebrate milestones and coding successes. In addition, knowing that our code is likely to be seen by others has given us fresh motivation to improve suboptimal coding and project management habits (we all have them!) and fired our enthusiasm to present our work and share our skills and knowledge.
Another great feature of our public ‘code shop window’ is that both internal and external contributors are now automatically and visibly credited for their work.
Setting up a GitHub profile: a checklist
When you work with sensitive health care data, sharing code has particular challenges, which we will explore later. However, it’s too easy to get stuck figuring out the basics before you even get to that point.
To help you get up and running, here’s a checklist similar to ours:
✔️ Figure out what type of GitHub account you need. As a team, we needed a paid organisation account. By default, these come with five members included and on top of this, you can add external collaborators to public repositories. Remember, there are discounts for non-profit organisations and academic institutions.
✔️ Get buy-in from stakeholders and decision-makers. If you need advice, check out the arguments for collaboration and pooling of analytical resources from The Strategy Unit’s draft Analysts’ Manifesto and our recent report on investing in health and care analytics. What also helped us build a case was gathering advice and agreement on the open-source licensing options, the account budget and set-up, any website copy and our plans to share the news about the launch.
✔️ Decide on an open-source code license. All code shared on GitHub is protected by copyright law by default, meaning that the creator retains exclusive rights. In order for others to use or modify the code, you must add a licence specifying what they can or cannot do with it. Ideally, this will also protect developers and contributors from litigation by limiting liability and warranty. We found Karl Broman’s R package primer, Hadley Wickham’s R packages and tldrlegal.com useful to learn about the different licensing options. We also chose the MIT Licence for our current projects because we want our code to be widely and easily used by others. However, a different licence might be a better fit for you and your team and might depend on the needs of your organisation.
✔️ Create a GitHub account.
✔️ Build navigable and user-friendly project templates. The README is the first file in the project someone will read and should cover what the code does, why it is worth looking at, and serve as a manual for the code. README files are written in markdown syntax, which is very easy to pick up if you don’t already use it. Stick to a consistent README structure across projects to make things easier for yourself and others. This README template has worked well for us so far.
✔️ Create repositories and start committing code. Before we went public, we found it easiest to start by adding existing code to private repositories. Once we were comfortable, we shared work-in-progress code. For R users, Jenny Bryan’s Happy Git and GitHub for the useR is a good place to get started.
Optional: Give your account a public face with a GitHub page
GitHub also provides organisations with the option to host a free public website, which can introduce your team, give an overview of the work or even describe and link to individual projects. Depending on the level of customisation, setting up a GitHub page might require basic knowledge of CSS and HTML (or alternatively, advanced googling skills), but the results are certainly worth it. For us, this involved the following steps:
✔️ Design the content and write the copy.
✔️ Decide on a theme. There are a number of supported themes to choose from. We used Steve Smith’s Minimal theme and customised it according to our needs.
✔️ Create a repository. Call it ‘username.github.io’ and follow GitHub’s guidance to add a theme and publish the website.
✔️ Add contents to the README.md file in markdown syntax.
For inspiration, visit our team’s GitHub page and the repository containing the source code.
Once we had the basic infrastructure in place, we also needed content for our repositories — the code! As we often work with confidential health care information, a priority was to develop processes that enabled us to share code while maintaining data security.
Keeping sensitive data safe
Much of our in-depth analytics work relies on pseudonymised patient data, which we hold and access exclusively on our secure server. Any outputs, for example statistical results or code, are checked before leaving the secure server to ensure they do not contain any confidential information or could be used to re-identify patients.
This process, also known as Statistical Disclosure Control (SDC), is performed by colleagues who are not involved in the project themselves and who have additional training in data safety. There are comprehensive guidelines on how perform disclosure checks on statistical results, which were developed by a group of Safe Data Access Professionals. Depending on the complexity of the analysis, this process can take a fair amount of time.
Principles (and our favourite R tools) for sharing code safely
In addition to our robust guidelines on preparing statistical results for disclosure checks, we have also developed the following best practice methods to safely open up our code:
1. Keep server architecture confidential: don’t release absolute file paths
To minimise any security risks to our secure data server, we make sure our code does not contain any absolute file paths, for example to access raw data. First, we figured these wouldn’t be useful to anyone else and second, it meant we could avoid disclosing the layout of our secure server.
In practice, we had to find a convenient way to refer to file locations in our scripts without using explicit paths. Unfortunately, this wasn’t as easy as setting a default working directory through an R Studio projects (although we would recommend using them anyway), as raw data is often stored in a separate location. We are now using a combination of two approaches:
- The here package 📦 to refer to files within the project directory. This is preferable to using relative file paths, as it works from within subdirectories and because the file paths will be generated correctly on other operating systems.
- A separate script containing absolute paths to locations outside the project directory, which can be sourced but is never released from our secure server.
2. Keep code clean: don’t refer to raw data or statistical results
Potentially the biggest confidentiality risk when releasing code is referring to either the raw data or aggregate results in the comments. Although generally considered bad practice, it is easy to fall into this bad habit and to copy-paste any outputs into a comment while working on a particularly tricky bit of analysis. Once disclosive comments are in the code, it’s easy to overlook them later.
We believe that sticking to good habits right from the start is the best way to manage this risk. Knowing that the code will definitely go through Statistical Disclosure Control has helped us to think twice about what we put in the comments. This, in turn, helps our output checkers as they don’t have to spend as much time going through the code.
As a result, it’s also motivated us to write more concise, readable and self-explanatory code. A tool that we can highly recommend here is the tidylog package 📦. Simply loading the package will print feedback on dplyr and tidyr operations in the R console. In this way, it often eliminates the need for hard-coded checks of these operations.
3. Be kind to colleagues: minimise friction during disclosure control
Releasing code on a regular basis could increase the time our colleagues spend on Statistical Disclosure Control and their overall workload. To minimise this risk, we make reading and checking our code as straightforward as possible. This, of course, starts with the point we made in the last section (no disclosive comments!), but there are several other things we do:
- A consistent and logical structure for individual scripts and for projects; including splitting analysis workflows into smaller portions, rather than working in one enormous script.
- Increasing readability through consistent formatting and indentation. We found the tidyverse style guide very useful, as well as the Collin Gillespie’s guide to RStudio features that can automatically reformat and indent code. We are also experimenting with tools such as the styler package 📦 that can flexibly re-format whole scripts or projects.
We are still working on finding the right balance between the added value of openness, by frequently releasing and sharing code in active development, and the workload this creates for our output checkers. At the very least, we’d like to partially automate this process in the future, and we’d welcome any suggestions for existing tools that could help us.
Next steps
While it has been a great experience so far, we have a long list of things left to figure out. These include:
- How to best version-control clinical code lists, such as Read or ICD-10 codes to identify a diagnosis of interest, and reference files, such as lookup tables for geographies. Once published, databases such as the University of Manchester’s ClinicalCodes repository are a good option, but we also want to keep track of them internally. In the future, we might wrap these up into data R packages, rather than using CSV files.
- How to encourage and manage contributions from the wider community. A first step is to add guidelines for external contributors and a code of conduct to each of our repositories.
- How to find a balance between ensuring that each analysis is self-contained (everything in one place) and avoiding unnecessary duplication of code. Part of the solution will be to move from isolated scripts towards reusable, generalisable elements and to automate some data cleaning and routine analyses (such as the pipeline for Hospital Episode Statistics data we are currently working on).
- How to capture the computational environment, as well as the code, to move towards better reproducibility.
We will continue to share our progress. In the meantime, let us know what you think about our GitHub profile and get in touch with your approaches, experiences and the challenges you have faced in sharing code.
This blog was written jointly with Karen Hodgson (@KarenHodgePodge) and Sarah Deeny (@SarahDeeny). All three of us are part of the Data Analytics team at the Health Foundation.