Tutorial 2/3: Include R packages in Popper with Packrat

Introduction

Wolfgang Traylor
getpopper
14 min readMay 26, 2021

--

Welcome back to my tutorial series on how to set up a reproducible research project with Popper. In the first tutorial, we started an example project in a Git repository. We wrote a super simple R script that we executed in a Docker container with the help of Popper. For licensing, we followd the REUSE standard and used the reuse command-line tool. Now we will expand that example project.

This tutorial series is also available in this public Git repository. The state of the example project at the end of each tutorial is captured in one branch each. So branch tutorial_1 is the state where we left off the last time, and tutorial_2 is where we will get to this time. Consequently, if you don’t have your project from the previous tutorial at hand, you can make a clone from branch tutorial_1 like so:

In order to follow this tutorial, you will need the same things as before: Linux, Docker Engine, Git, Popper (version >= 2020.09.1), and some basic knowledge of Bash and Git. You can install reuse or use the Docker command docker run --rm -it -v $(pwd):/data fsfe/reuse instead (for which you can create a Bash alias as explained in the first tutorial). It’s useful if you know some R, but don’t worry if you are not an R user. You can just copy-paste the example scripts, and perhaps you’ll pick up some tricks that can help improve the workflow for the tools you are using.

Note that I don’t use RStudio for an IDE (integrated development environment). The functions of RStudio are great for interactive scripting. So if you prefer writing your R scripts with RStudio, that’s fine. However, when it comes to designing your project to be portable and reproducible by another person, don’t assume that this person has RStudio. A GUI (graphical user interface) like RStudio might get redesigned in the future, might stop being developed, or won’t run on other operating systems. Therefore, I only show shell commands and text files. They have been around since the dawn of computers and will likely persist for a while.

Using R Packages

Nearly all R scripts will use some R packages that extend the functionality of base R. Our current script in scripts/plot_box_and_whisker.R looks really boring right now (I omitted the top of the file with shebang and license info):

This version is a bit fancier:

  • I used the checkmate package to assert various assumptions that I made while composing the script. If anything doesn’t go as expected, checkmate will provide us with a user-friendly error message.
  • The here package creates filepaths that are relative to the root of our Git repository. This way, it doesn’t matter anymore where we place the script and from which working directory we call it. Our filepaths always relate to the root of our self-contained project.
  • By using the file.path() function (from base R), I have eliminated the slash (/) in the filepath. That is generally a good practice for the sake of portability because Linux uses slashes and Windows backslashes.
  • The shebang (#!/usr/bin/env Rscript) and REUSE license information at the top of the file remain the same.

Let’s create a new commit with our changes right away:

What is Packrat?

Now we have introduced two R packages as dependencies into our script, here and checkmate. How do we include those in our reproducible, Popper-driven project? The answer is Packrat! In a typical workflow, you install R packages with install.packages() into your home folder (usually ~/.Rpackages). In contrast, Packrat stores them per project in the project’s folder. That makes your project portable, reproducible, and isolated because it delivers all dependencies in the correct versions together with your project.—All dependencies? Well, not quite. Even with Packrat, you still have no control over the version of the R interpreter itself and over the operating system. That’s why we integrate Packrat into Popper.

In order to use Packrat, we first need to install it. Now I could ask you to fire up RStudio or R in the terminal window to install Packrat on your system. However, as you might have noticed, I didn’t include R in the list of requirements, i.e. you don’t need R installed on your system at all! Instead, we will use an interactive container to initialize Packrat.

Exploring Interactive Containers

So far we have written scripts and had them executed by Popper. Executing scripts is a non-interactive way of working. In contrast, interactive means that you type your commands into a prompt and immediately execute them one after the other by hitting return. When you open a terminal window you are in an interactive shell on your operating system. You can enter also an interactive shell for the virtual operating system inside a Docker container. Imagine this switch into the interactive shell of a Docker container like leaving the laptop you’re working on and going over to another desk in your room. There you find another laptop with its own operating system, its own file system, and different programs installed.

Conveniently, Popper provides an interactive mode out of the box. Earlier, we called popper run from the root of our project repository, and Popper automatically executed the steps in the workflow file .popper.yml. Now, we use popper sh to open an interactive shell in a Docker container. While popper run executes all steps in the workflow file, popper sh can only open a container for one step. Therefore we need to specify that (unnamed) step with a number; in our case 1:

Now you should be in an interactive Bash session inside a newly created r-base:3.5.2 container. This is your chance to go out exploring what this ominous “virtual operating system” actually looks like:

  • What kind of system are you on? → cat /etc/os-release, hostname
  • What files do you see here? → ls -a, pwd
  • Which software versions are installed? → apt update && apt list --installed
  • How do you install new packages? → apt update && apt install ...
  • Where do those packages come from? → cat /etc/apt/sources.list

You can close the interactive session by issuing exit or pressing Ctr-D. All changes you have made to the container system will be gone—except for changes on files in your project folder. In the Popper/Docker container, your project folder appears by default in the path /workspace. In the Appendix2 at the end of this tutorial, I point out a pitfall we can run into when creating files in the /workspace folder from within the container. But for now, let’s continue with our project.

Installing Packrat

After our little exploration, close the old container in order to open a fresh one with Popper. This time, however, we don’t want to use the Bash, but an interactive R session in order to install Packrat. By passing the argument -e/--entrypoint to Popper, we can choose to execute a program other than Bash. And, obviously, that program is R:

Now we have been dropped into an interactive R session and can do stuff inside the container. Let’s set up Packrat according to the official instructions:

This has created the directory packrat/ and a number of files in our repository:

  • packrat/packrat.lock is the list of all the packages we are using in our repository with their exact versions. In the init() call, Packrat has automatically searched through all R scripts in our repository and listed the libraries that are used. Then it downloaded them from CRAN and installed them into the packrat/ folder.
  • packrat/src/ contains the source code for of all R packages, each in a .tar.gz archive file in its own subfolder. Packrat will automatically compile them for the specific platform on which it is started.
  • .gitignore was changed to ignore the compiled R packages in the packrat/lib* folders. Since the compiled binaries are platform-specific and built automatically, we don’t want to include them in the Git repository.
  • .Rprofile loads Packrat automatically when we open an interactive R session in the root of our repository. The actual code for that is in packrat/init.R.
  • packrat/packrat.opts contains options for Packrat. You can change them with the packrat::get_opts() function.

After leaving the interactive Popper container, we should do a test run:

Does it work? Check if the plot in output/box_and_whisker.png looks alright.

If there are no errors, we can proceed and include the new Packrat files to Git.

But stop! Suddenly we’re dealing with some (comparatively) big binary files in the packrat/src/ folder. Git is not particularly excited about swallowing those; it only likes to eat bite-sized files. Let’s be friendly to Git and prepare to eat this better.

Tracking Large Files with Git-LFS

Git has been designed and optimized for managing source code files. Therefore we shouldn’t just blindly add big files to it. We would bloat the repository and jeopordize Git’s excellent performance. When I say “big”, I mean, as a rule of thumb, anything above 1 MB. That applies in particular to compressed files, like our .tar.gz R packages, because Git would try to compress them again (which is nonsense). (If you want to learn more about size-related best practices, take a look at git-sizer.) Fortunately, there is a ready-made solution for our problem: Git-LFS.

The Git extension Git-LFS (Git Large File Storage), developd by GitHub, helps us to easily include large files to a repository. Once Git-LFS has been told which files to track, it will handle them automatically in the background. We can apply to them all the Git commands we are already familiar with.

Consider, though, that even Git-LFS is not very good at handling really big files, like several Gigabytes. As a rule of thumb, I recommend to not store in sum more than a few Gigabytes of Git-LFS files in one repository. One reason is that Git-LFS typically uses two times the space of what it is actually storing: one copy in the .git folder in your repository, and another copy in the checked-out working directory. I am planning to write another tutorial about this issue some time in the future. It will appear on my yet-to-build website wtraylor.de.

First, install Git-LFS on your system. Note that, after having installed the software, you need to execute git lfs install once. Then come back to our project repository and tell Git-LFS which files it should be in charge of:

Git-LFS has created the .gitattributes file, where it saves what to track. For that information to be persistent, we have added the file to our repository. Of course, we have added the REUSE license information right away. (Remember from the last tutorial that, if you don’t want to install reuse, you can set an alias for running it in Docker: alias reuse='docker run --rm -it -v "$(pwd):/data" fsfe/reuse')

Don’t forget to mention our new packrat/ directory in the “Project Structure” list in the README.md:

Now everything is prepared for adding all Packrat files. Git-LFS will handle the archives, normal Git the rest:

Note that for Git-LFS to work, the Git server needs to support it. That is the case for all major Git hosting services, like GitLab, GitHub, and Bitbucket, and also for the self-hosted Gitea.

This is what our project looks like now:

Current project structure. R packages are included now.

But wait a minute. Didn’t we forget something? What about licenses for all the Packrat files we’ve added? We have just included third-party software in our repository, which we will redistribute. Legally, that can be thin ice. Fortunately, the CRAN policy prescribes that every R packages must be under one of the accepted licenses, which at the time of writing are all Free Software licenses. We can redistribute Free Software with our code, even if we choose a different license (as long as we don’t aggregate the third-party code with ours into one piece of software and as long as we provide the third-party source code). If you want to be sure, look at the DESCRIPTION in each of the .tar.gz package files: There is a line “License: …”. In regards to being allowed to add the packages, we are good to go. (Note that I’m not giving any legal advice here…)

Now, shouldn’t we include license information for each of the packages? Ideally, yes, because only then we would be fully REUSE-compliant. The drawback is that it’s a lot of work to check the license and copyright holder of each R package. Therefore, I have decided to break REUSE compliance at this juncture and simply explain the situation in the “License” section of the README.md:

Even if we are not fully REUSE-compliant anymore, it’s still worth to run reuse lint to get an overview. Only the Packrat files should be listed as files without license information.

Bootstrapping Packrat

Each time we execute popper run, a clean r-base container is created, which doesn’t know anything about Packrat yet. When we ran popper run earlier to test our Packrat setup, Packrat restored itself because we had all the compiled R packages already lying in our project folder (in packrat/lib*). We just didn’t track them with Git. Now, if we push our repository to the server, and another person clones it, they will not have those compiled packages conveniently in place. Packrat (currently v0.5.0) is pretty smart, but still needs a little bit of help to bootstrap itself.

To illustrate the issue, let’s delete all packrat/lib* folders and then try to run our workflow:

We enter the interactive Popper session here because the files we want to delete are owned by root. I describe more details on that in the Appendix2 and offer alternative solutions. Now trying to run the workflow without the compiled packages, I get this error message:

The solution I use for bootstrapping Packrat is currently not well documented. I came across it in Packrat’s issue #158. Let’s create a Bash script in scripts/bootstrap_packrat.sh with this content:

This script requires no dependencies installed, just plain base R. The magic happens in the packrat/init.R script, which interprets the --bootstrap-packrat argument and, accordingly, installs the Packrat package from packrat/src/packrat/. Afterwards, the call to packrat::restore() compiles the other packages. In the end, all compiled packages are ready for use in the packrat/lib* folders.

Let’s incorporate that in our Popper workflow. It should be the first step, of course, because it is prerequisite for the plotting script. This is what my .popper.yml file looks like:

Let’s make our script executable and try it out:

The output should start with something like:

And then a string of installation output follows.

Now we can add our bootstrap script and the changes in the Popper workflow to Git:

Congratulations! Your R project now ships with all of its dependencies, while execution happens in a breeze. Let’s call it good for now. In the next tutorial, we will take it one step further and customize our Docker container with some additional software. Until then, happy coding!

Appendix: File Ownership

Within a Docker container you are (by default) the root user. That means, the files you create from within the container are marked as owned by root. However, in your normal Linux working environment, you don’t work as root and therefore don’t have permission to change or delete those files. Let’s walk through this together with an example.

In order to not mess with our existing example project, let’s create a new, temporary project for that. You can do that anywhere in your system. On most Linux system, the temporary file folder /tmp provides a good playground because (on most systems) it gets cleared after every reboot. I call the folder delete_me so that I know it’s trash.

The popper scaffold command creates an example workflow file in wf.yml. What’s in there doesn’t matter right now, we just want to enter an interactive session in a container with popper sh. Inside that session we create an empty file and leave:

You will see that file appear in the project folder outside of the container. Check it out in your file explorer.

In your file explorer, you can right-click on the file test in your project folder and look at its properties. Or you use the command line:

Here we see that the file is owned by root, as is the folder. (The root root part of the output means that user “root” in group “root” is the owner.) I, as a regular user, cannot delete that folder with rm -r folder because I don’t own it. You have different options now:

  1. Use sudo to remove/change the root-owned files/folders. For example sudo rm -r folder.
  2. Go back into an interactive Popper session with popper sh and remove/change the file from there.
  3. Select Singularity or Podman instead of Docker as the container engine in Popper (compare Popper issue #859). In Singularity and Podman containers, the default user is not root.
  4. Make yourself the owner using chown. First, find out your user ID (“uid”) and group ID (“gid”) by executing id. Then execute chown --recursive UID:GID . to make yourself owner of all files in our project. You can do that with sudo or in an interactive Docker session as in the following code snippet. Since my user and group ID are both 1000, this is what it looks like on my system:

At this point I don’t want to recommend a particular option. Popper and the container engines are developing, and in the future, the phenomenon of those root-owned might not be an issue anymore. Experiment and find out what works best for you at this moment.

--

--