Tutorial 1/3: Starting a Reproducible Coding Project with Popper

Wolfgang Traylor
getpopper
Published in
21 min readMar 30, 2021

Introduction

Welcome to the first of a series of tutorials on coding reproducible experiments and analyses with Popper. In easily digestible steps I will guide you through composing an example project from scratch. We will use Git, Bash scripts, R, and Docker. With REUSE we will follow best practices for licensing our project.

As researchers, we write code for scientific progress. We are eager to place our little contribution on top of the ever-growing tower of scientific knowledge. What keeps this tower strong and healthy is reproducibility. It is like mortar between the bricks: the glue that keeps the tower from falling apart.

Popper is a tool to achieve reproducibility for computational experiments and analyses. Have you ever tried to run an old script of yours again … only to see a stream of error messages? But hadn’t it worked just fine a few months ago?! I find it hard to admit, but I know that situation … and it got me thinking. If we struggle to reproduce our own experiments already after a few months-what will be in just one decade, or longer …? To me, it hurts to think that my hard work of today might be obsolete in a few years, just because technology has changed and I didn’t follow best practices. We like to think that research is just about increasing knowledge and understanding, but a lot has to do with not forgetting. How then can we hold back that tide of ignorance and obsolescence?

Popper helps us keep our research useful for others and meaningful into the future. You don’t need to think in decades here, but just that a reviewer can re-run your analysis, or that the next PhD student can continue your project. Popper is a tool to execute a computational experiment or analysis in a so-called container. You can picture a container as an isolated “capsule” in your computer that provides the ideal environment for your scripts to thrive. This environment is clearly specified and very controlled. That makes your project portable across systems and reusable into the future.

In order to be reusable, the products of your work need to be explicitely licensed. If you don’t attach a license, nobody else is allowed to copy, distribute, or modify your work. That kills collaboration and scientific progress. Therefore I consider licensing (preferably as open source) as an essential aspect of reproducibility and will place special emphasis on it.

In this series of tutorials I will guide you through a very minimalistic example project. The content is irrelevant; only the framework is important. I will show you the structures and workflows that have worked for me. Once you have grocked the principles, you will be able to apply them to your own project and adapt them to your needs.

My academic background is in ecology and computer science. Many of my fellow ecologists write software and scripts on a daily basis, but have never received formal training in software development. Therefore I will explain snippets of coding best practices here and there, which may help you compose a clean project.

In the spirit of Agile development, I will build up the project incrementally. At the end of each tutorial, the project will be complete and functional-with room for improvement, of course. In this public Git repository you will find a branch for each tutorial, reflecting the state of the example project at the respective step.

Now, let’s get started!

Prerequisites

In order to follow this tutorial, you need the following:

A Linux system. You will need to have root/sudo permissions for installing and using Docker. On Windows (10), the Windows Subsystem for Linux (WSL) might work fine, too, but I haven’t tested it. Since MacOS is a UNIX system, you might be able to follow in the MacOS terminal, but I haven’t tested it either.

Basic knowledge of the shell/Bash: How to open the terminal, execute a command with parameters, and save a sequence of commands in a script. We will also use file redirects with > and Bash variables. Ubuntu provides the nice beginner's tutorial "The Linux command line for beginner" (ca. 1 hour). The Software Carpentry's tutorial "The Unix Shell" is also recommendable and covers a bit more than what we need here.

Docker Engine: Docker is a container engine, which means that it creates and manages virtual operating systems (containers). Popper will take care to call the appropriate Docker commands for us. Here are the official detailed installation instructions. On Ubuntu/Debian you can also just install the docker.io package: sudo apt-get install docker.io. Then you need to add your own user account to the docker user group with sudo usermod --append --groups docker YOUR_USERNAME. Afterwards a restart might be required to use Docker. (I have tested Docker version 19.03.13-ce, but any more or less recent version should do.)

Git: On Ubuntu/Debian: sudo apt-get install gitI won’t explain the git commands in detail. So if you’ve never worked with Git, I suggest to go through Roger Dudler’s simple guide to jumpstart into Git.

Popper, version 2020.09.1 or later: install instructions. Check that Popper has been successfully installed by executing popper version or popper help.

Creating the Barebone Structure

In order to be reproducible, your project must be self-contained, i.e. include (almost) everything necessary for execution. For that we use Git. Although Git is not good at handling large files, it serves well as a tool for collaboration and tracking versions. With some tricks for including large files, which I will explain below, Git becomes the perfect tool for managing scientific coding projects.

Create the Git repository somewhere on your computer:

git init tutorial
cd tutorial

With cd we have changed our working directory to the root of the Git repository. Currently the repository is empty. There is only the hidden folder.git in there, which Git uses to store Git settings and the version history. Most of the commands in this tutorial series will be executed from the root of the repository.

How you structure your coding project depends on your topic and your personal taste. The Guide to Reproducible Code in Ecology and Evolution by the British Ecological Society and Wilson et al. (2017) provide some helpful guidelines, which I recommend. In this example project, I start out with the directories scripts/ and input/. I do not store any output in the Git repository. Just like compiled software, it would clutter the repository and create version inconsistencies. In the Appendix1 of this tutorial, I describe what I do with output files. For now, we just add the line output/ into the file .gitignore at the root of the repository. This instructs Git to not track any output/ folder that we might create.

mkdir scripts
mkdir input
echo "output/" > .gitignore
touch README.md

The README File

The touch README.md command has created an empty file. You can open the README.md in any text editor or a designated Markdown editor. If you are new to Markdown, you might find this guide from GitHub helpful.

This is what our project looks like currently:

Current project structure.
Current project structure. Output files are not tracked by Git.

The README.md file is at the core of every Git repository. It is the landing page that informs the reader about everything necessary to navigate and use the project (hence the signpost symbol in the figure). On any Git web interface, like GitLab, GitHub, or Gitea, the README.md will be shown first thing when someone opens your repository. This is what a README.md file should contain in a research project:

  • General overview on what this project is about.
  • If applicable: abstract, DOI, and URL of the connected publication.
  • All authors and contributors with affiliations, ORCID, and potentially email address. (You might say that you can see the contributors in the Git log, too, but for an archival-ready research project we shouldn’t rely on that.)
  • How directories and files are structured, including naming conventions.
  • External dependencies, e.g. datasets one needs to download.
  • Usage instructions for re-running the experiment/analysis, including versions of required third-party software. With Popper, this boils down to popper run.
  • A license or copyright statement. I recommend to comply with the REUSE standard (described below).
  • Known bugs.
  • Specifications of the hardware and operating system that you used to run your experiment/analysis.

For our tutorial, I wrote this README.md:

# Example ProjectThis is an example for how to use [Popper](https://getpopper.io) to build reproducible projects in computational science.## Authors
- Firstname Lastname (email@address.com), Affiliation ![ORCID][orcid-logo] <https://orcid.org/0000-0000-0000-0000-0000>
- Firstname Lastname (email@address.com), Affiliation ![ORCID][orcid-logo] <https://orcid.org/0000-0000-0000-0000-0000>
[orcid-logo]: https://orcid.org/sites/default/files/images/orcid_16x16.gif## Project Structure
- `.popper.yml`: The Popper workflow.
- `input/`: Input datasets.
- `LICENSES/`: Licenses used in this project.
- `output/`: Folder for all output files. Not under version control, but created on execution.
- `README.md`: Landing page for this project.
- `scripts/`: Driver scripts to run the analysis.
## UsageYou need:- Linux
- [Docker](https://docker.com) Engine (tested with 19.03.13-ce)
- [Popper](https://getpopper.io) (>= 2020.09.1)
Open a terminal in the root of this repository and execute the analysis with `popper run`.
You will find the output files in the newly created `output/` folder.
## System Specifications
This has been successfully run on a PC with these specifications:
System: Kernel: 5.8.14-arch1-1 x86_64 bits: 64 Console: N/A
Machine: Type: Desktop System: Hewlett-Packard product: HP ProDesk 600 G1 SFF v: N/A
serial: <filter>
Mobo: Hewlett-Packard model: 18E7 serial: <filter> BIOS: Hewlett-Packard v: L01 v02.21
date: 12/17/2013
CPU: Info: Quad Core model: Intel Core i5-4570 bits: 64 type: MCP L2 cache: 6144 KiB
Speed: 1599 MHz min/max: 800/3600 MHz Core speeds (MHz): 1: 2445 2: 2096 3: 2444
4: 2474
## License
This project is compliant with the [REUSE][] standard:
Each file has a copyright notice; all licenses are in the `LICENSES/` folder.
[REUSE]: https://reuse.software/

Hardware Information

How do you get this nice overview of the system specifications? I used the handy little command-line tool (Free Software). It’s written in Perl. Go ahead and download or install it according to these instructions. Then call inxi -MSCzc0 to get the above output for your system, which you just copy-paste into the README.md. To learn more about the many different possible options, call inxi --help. If you can't or don't want to install Perl, you can download and run inxi in a Perl Docker container like this:

docker run --entrypoint /bin/bash --rm perl:latest -c "wget --no-verbose smxi.org/inxi && perl inxi -MSCzc0"

Licenses

Licensing is a topic that many scientists seem to just plainly ignore. I can understand that copyright questions can be extremely complex and overwhelming. Fortunately, we are not left alone here because others have already blazed the trail by creating helpful guides and easy-to-use tools. You can just pick a ready-made open-source license and attach it to your files. Then it’s clear what others can and cannot do with your work. By choosing an open-source license you actively support the free exchange of information and tools, thereby boosting scientific progress!

I recommend these three steps:

First: Pick your favorites among the available licenses. Check out choosealicense.org for an accessible overview. For further reading pertaining the specifics of scientific works take a look at Stodden (2009) and Morin et al. (2012).

  • Note that code needs a different license than media or texts.
  • For software, I generally recommend the copyleft license GPLv3 (or later), which forces everybody using your code to release their project as Free Software, too. However, the GPL can cause trouble if, for instance, your code is part of a larger project that has no license (and is thus proprietary). In those cases, use the more permissive MIT license, which allows your code to be used in non-free projects, too.
  • For media and texts, use a Creative Commons license. The CC-BY-4.0 is a good choice for scientific works.

Second: Talk with your employer or supervisor and your collaborators about licensing. Come to a joint decision and write it down. Also check in with your legal department whether there are any rules from your institute or university to consider.

Third: Apply your license(s) to your project following the REUSE standard. We will do that now for our example project.

The Free Software Foundation Europe (FSFE) started the REUSE project to create a standard to make licensing your project easy. There is a little tutorial avaiable on the REUSE website. It explains a similar procedure to what I am showing here, but gets into more detail. The REUSE FAQ are also very informative. For our purpose, we will use the command-line tool. You have two options here:

  1. Follow the installation instructions and install REUSE with pip3 install --user reuse.
  2. Execute reuse in Docker with the REUSE Docker image. Using Docker has the advantage that you don't need to install anything. However, you will need to replace the reuse command in all my instructions with this Docker command: docker run --rm -it -v $(pwd):/data fsfe/reuse.

Tip: You can create an alias in Bash: alias reuse='docker run --rm -it -v "$(pwd):/data" fsfe/reuse'. Then you can simply type reuse (until you close the shell).

Our first command is reuse init (from the root of our repository):

# With the PIP installation or with alias:
reuse init
# With Docker:
docker run --rm -it -v "$(pwd):/data" fsfe/reuse init

The program will ask you a few questions. If you don’t have an answer to some of them, just hit RETURN. The first questions is to specify a license. You need to specify the exact SPDX identifier from this list of open-source licenses. If you misspell the identifier, the reuse tool will suggest a correction.

You could skip this step and add the licenses later because we will be selecting licenses on a per-file basis anyway. But for the sake of practice, let’s pick one here. I am selecting CC0-1.0, which will be for the README.md of the example project. It is the Creative Commons Public Domain license and basically says: Do whatever you want with it. :)

You can choose another license to play with. Then hit RETURN. The following questions are about the name and website of your project and the maintainer (which is you). Fill in something that suits your fancy. Afterwards, reuse will download the license texts in a newly created LICENSES/ folder and create a file .reuse/dep5 with metainformation, basically your answers to the questions.

Next, we put README.md and .gitignore in the public domain with the reuse addheader command. In this and all following snippets, change the copyright holder from "Jane Doe" to whatever is applicable to you. Usually that's just your name (and email address), but it might be your employer, too.

reuse addheader --copyright="Jane Doe <jane@example.com>" --license="CC0-1.0" README.md .gitignore

reuse has added a comment at the top of the README.md with a reference to the license, which is kept in the LICENSES/ directory. I have put the.gitignore file in the public domain, too. It really contains only one line, but the REUSE standard is strict and wants license information for each and every file. So as it is recommended, we just put these not-worth-anything configuration files into the public domain on principle.

In order to complete the licensing step, let’s check that reuse is happy with everything. Use the reuse lint command to check for compliance with the standard:

reuse lint

If the linter complains about missing license files, you can now use the command reuse download --all to automatically download all licenses used in the project. It's good practice to run the reuse lint command every now and then to check if all your licensing information is sound and solid. Make this a routine before calling git add so that you're sure every new file is licensed.

This is all we need for now before we start filling our project with content. Let’s commit what we have to our Git repository:

git add --all
git commit -m "Create barebone project structure"

Writing an Example Script

A coding project in science is typically either a computational experiment or a data analysis. Input (usually datasets) are fed into scripts and produce output (figures, datasets, papers, reports, …). For your research project, the analysis or experiment is, of course, the most important part. In this tutorial, however, the analysis is reduced to the bare minimum. Our objective is nothing more than reading a file with random numbers and creating a box-and-whisker plot from it. It’s up to you to flesh out what is input, code, and output in your own projects.

The objective of a research coding project is typically to produce output from input.

You can download my example input file here: . Save it in the input/ directory. Since this is a small file (48 KB), we can add it directly to Git. This is what it looks like in the command line:

curl "https://gitlab.com/wtraylor/popper-tutorial/-/raw/tutorial_1/input/input.txt?inline=false" > input/input.txt

If you’ve downloaded input.txt through your browser, the file might have ended up in your downloads folder. In that case you can move the file to your repository folder with mv ~/Downloads/input.txt input/.

The data in input.txt are just random numbers, so we place it under the CC0-1.0 license. Since this is not a source code file, the reuse tool is not able to add a comment block. Therefore we need to call reuse addheader with the --explicit-license argument flag. reuse will then create a separate file with license information called input.txt.license. (For files that are obviously in binary format, like images, reuse will do that automatically without us needing to specify --explicit-license.) Afterwards we can create a new commit with our new input file:

reuse addheader --copyright="Jane Doe <jane@example.com>" --license="CC0-1.0" --explicit-license input/input.txt
git add input/
git commit -m 'Add input file'

From this input, let’s create a simple plot using R. If you are not familiar with R, that’s no problem; at this point you just need to copy and paste. If you want to get started with R, though, consider the Software Carpentry’s introductory lesson “Programming with R”. The following script shall be in written as scripts/plot_box_and_whisker.R. For that you open your text editor (or RStudio), copy-paste the script into a new text document, and then save it as scripts/plot_box_and_whisker.R in your Git repository.

#!/usr/bin/env Rscript# This script creates a box-and-whisker plot in the file `output/box_and_whisker.png`.
# Call this script from the root of the repository.
message("Creating box-and-whisker plot from input data.")dir.create("output", showWarnings = FALSE)numbers <- read.delim("input/input.txt")png("output/box_and_whisker.png") # Open a PNG “device”.
boxplot(numbers)
dev.off() # Close the “device”, i.e. write the image file to disk.
message("Done!")

The first line is called shebang. It instructs UNIX (i.e. Linux or MacOS) how to interpret the script. In our case we say that our script should be fed to the R interpreter Rscript. The prefix/usr/bin/env makes the script more portable. While we cannot know in advance where the Rscript executable resides on a host system, env finds the path automatically for us. The rest of the script should be self-explanatory: reading the data, creating the plot, and sending some messages.

Let’s make the script executable with chmod +x:

chmod +x scripts/plot_box_and_whisker.R

I think that The Unlicense is an appropriate choice for licensing the source code parts that I am providing here. Comparable to the Creative Commons Zero license, The Unlicense puts my code in the public domain. Everybody can use it and doesn’t need to give me credit. The SPDX identifier for The Unlicense is “Unlicense”. Because reuse currently (version 0.11.1) doesn't recognize the .R extension, we need to specify that our script need Python-style comments (with the # symbol). Since we don't have the text of The Unlicense in our LICENSES/ folder yet, we need to ask reuse to download it for us. Afterwards we can add everything to Git:

reuse addheader --copyright="Jane Doe <jane@example.com>" --license="Unlicense" --style python scripts/plot_box_and_whisker.R
reuse download --all
git add scripts/plot_box_and_whisker.R LICENSES/
git commit -m 'Add script to create boxplot'

If you have R installed, you can call the script right away to give it a test run: ./scripts/plot_box_and_whisker.R However, the point is to run this in a container. So let's move on and use Popper for that.

Containerize the Script with Popper

For the sake of reproducibility, we run our project in a container. Containerization means that we don’t execute our script on our normal operating system, but in a controlled environment. All the configurations, system libraries, and applications we have installed on our computer will not be visible to the script. Instead, it will only see what’s inside its container. It’s like doing a biological experiment with plants in a greenhouse as opposed to the field. In the greenhouse you can exactly control light, temperature, humidity, nutrition, etc. If you recreate this exact environment and run the experiment again, you will likely get similar results. In the field, however, chances are that some natural hazards jeopordize your experiment: drought, voles, hail, you name it. In computational experiments, these hazards are differing system settings, dependency issues, library versions, etc. They can make your experiment not reproducible. That’s why we put it in a container.

While a container is the actual environment for executing your script, an image is the “blueprint” that defines what a newly-created container instance looks like. Consequently, there can be many different containers created from one image. A container engine is the tool for achieving this containerization, which is also called OS-level virtualization. To stick with the above metaphor, images are the construction plans for a greenhouse, along with all the settings for light, temperature, humidity, and so on. Containers, then, are the actual greenhouse built in the garden. You can picture the container engine as the crew of architects, builders, and gardeners who set up and maintain the whole infrastructure. (If you like to learn from videos: “What is the difference between Docker image and Docker container?”)

There are different tools for containerization. In this tutorial series, we will work with Docker, but Popper also supports Singularity. For installing Docker you need root access, and for using it you need to be in the docker user group. Singularity doesn't need root rights to run, which makes it great for HPC (high performance computing) applications. Another difference is how Docker and Singularity handle images. Docker stores its images in the container engine on your particular system, whereas Singularity stores images in portable image files. Docker provides us with a wide range of ready-to-use images for free download on Docker Hub, and fortunately you can use these images with Singularity, too. All that makes Docker and Singularity interchangeable for many use-cases.

Popper makes containerization easy. It manages the container engine commands for you, and you can easily switch between Docker and Singularity. You only need to define a workflow file, and feed it to Popper by calling popper run. Then Popper goes through all the steps in the workflow file, creates the containers and executes our scripts in them. So let's take a look at our project's workflow file. Store this in a newly created file called.popper.yml in the root of our repository:

steps:
- uses: "docker://r-base:3.5.2"
args: ["scripts/plot_box_and_whisker.R"]

As usual, we put this new file under Git version control, and I choose the Unlicense for it (since I consider it rather source code than configuration file):

reuse addheader --copyright="Jane Doe <jane@example.com>" --license="Unlicense" .popper.yml
git add .popper.yml
git commit -m 'Add Popper workflow file'

Popper’s workflow files are in YAML format. You don’t need to learn all the complicated YAML syntax, but Wikipedia’s list of the most important points might be worth a look. For now, keep in mind that whitespace indentation matters, and that the hash sign (#) turns everything that follows into a comment. Popper's workflow syntax is documented here. In this tutorial, we will only focus on those syntax elements that we need at the moment.

Our file consists of a single step that executes our script in a container based on the r-base:3.5.2 image. Here is more information about the r-base image on Docker Hub. Because we specified docker:// in the workflow file, Popper will automatically search for the image on Docker Hub and download it for us. For science projects, it is very important to specify the exact image version after the colon. If you were to write r-base:latest or just r-base, Popper would use the latest version of R, and that might stop working with the rest of your project if a breaking update gets released. When choosing an image from Docker Hub, only selectofficial images; with those you can be reasonably sure that they will be around for a little while.

This is our new project structure:

Current project structure. We depend on Docker Hub to download our container image.

Now, after you have installed Docker and Popper, you can run your workflow by executing popper run from the root of the repository. Try it out! If everything is working, you should get an output similar to this:

$ popper run
[1] docker pull r-base:3.5.2
[1] docker create name=popper_1_862936d4 image=r-base:3.5.2 command=['scripts/plot_box_and_whisker.R']
[1] docker start
Creating box-and-whisker plot from input data.
null device
1
Done!
Step '1' ran successfully !
Workflow finished successfully.

Note that if you choose a filename other than .popper.yml for your workflow file, you need to specify that when calling Popper. Suppose you decided to call your workflow file workflow.yml, you need to execute popper run -f workflow.yml. This way you can have multiple workflow files in one project.

Now check that the file output/box_and_whisker.png has been created correctly. Just open it in an image viewer or double-click in a file explorer.

Congratulations, you have created your first reproducible workflow with Popper! Your only dependencies are Popper, Docker Engine, and the image download from Docker Hub. That’s much better than most projects I have seen.

The final state of the example project is available here on GitLab.

For now we have only used base R, but most R projects require additional packages. In the next installment of this series, I will show you how to include R packages with Packrat. Hope to see you next time, and enjoy coding!

Literature Cited

Appendix: Sharing Output on Open Science Framework

I explained that it is not a good idea to store your output files in the Git repository. So, from the viewpoint of reproducibility, what’s the best way to manage output files?

Output typically includes images and numbers, which are often combined with text. So usually we’re dealing with image files, tables, or PDFs. R Markdown and Jupyter are great tools to compose the results of your analysis/experiment to one PDF. In any case, the result of a Popper workflow is one file or a set of files from which we want to draw conclusions in order to move forward.

Conclusions (or inferences) ought to be reproducible the same way as our scripts. That’s called inferential reproducibility ( Goodman et al. 2016). To achieve this, we need to make our train of thoughts accessible to other researchers (and to our future self). In the context of computational experiments and analyses, that train of thoughts is typically a train of computations. We run our software, look at the output, draw conclusions, change our code or input, and start the cycle again. Ideally, each iteration of this cycle is documented, and each execution of our software is reproducible.

Cycle: Execute experiment/analysis with popper; analyze and discuss output uploaded to Open Science Framework; draw conclusions in your lab journal; change code or input; execute again…
Reproducible research cycle. Popper logo courtesy of Ivo Jimenez.

I have found that Open Science Framework (OSF) provides the best place to archive and share my intermediate research results. You can sign up for free and create personal or collaborative projects. For each of my projects I create a Git repository and an OSF project. They form a pair. Through storage addons, OSF currently supports linking to GitHub, Bitbucket, and GitLab. Whenever I have a new result, I upload the output file(s) to the OSF project, where it receives a permanent URL. I make sure that I can track back with which revision (commit) I have created a file. In this figure I depict that with a chain:

Ideally, one should be able to trace every output file back to the repository version from which it was produced.

Every Git commit is identified with a hash sum. This command will give you the (abbreviated) hash sum of your current commit:

git rev-parse --short HEAD

How you store the information from which Git commit the output file was generated, depends on your specific needs. Here are some options I can imagine:

  • Add the commit to the filename of your output, e.g. analysis_plot_33bfe5d.png.
  • Create a folder in the OSF project where you put all output files, e.g. results_from_commit_33bfe5d/, which then contains analysis_plot.png.
  • Automatically write the commit in your output file. For example in R, you could generate a caption at the bottom of your figure with system("git rev-parse --short HEAD").
  • Write the commit in the metadata of the file (if the format supports it), using the powerful exiftool: exiftool -comment='Created from commit 33bfe5d' box_and_whisker.png
  • Manually keep a list or table saying, “File https://osf.io/hybf8?version=1 was created with commit 33bfe5d.”
  • After having uploaded your output to OSF, create a Git tag (a descriptive label for a commit) that states which files have been produced here, e.g.: git tag -m 'Create https://osf.io/hybf8?version=1' first_results
  • … I’m sure you will come up with some more solutions yourself. ;)

For a chronological research journal (also called notebook or log) I suggest three alternatives:

  • The Wiki of your OSF project, which is in Markdown format.
  • A file in an OSF project, which you can overwrite by uploading a new version with the same filename. (OSF allows to restore old versions.)
  • In the Git repository, e.g. prominentaly named JOURNAL.md in the root of your repository.

I hope I could explain enough of the principles so that you can get creative yourself. Just don’t forget to document your procedure and naming conventions in detail.

Creative Commons Attribution logo

This work is licensed under a Creative Commons Attribution 4.0 International License.

--

--