Tutorial 2/3: Include R packages in Popper with Packrat
Introduction
Welcome back to my tutorial series on how to set up a reproducible research project with Popper. In the first tutorial, we started an example project in a Git repository. We wrote a super simple R script that we executed in a Docker container with the help of Popper. For licensing, we followd the REUSE standard and used the reuse
command-line tool. Now we will expand that example project.
This tutorial series is also available in this public Git repository. The state of the example project at the end of each tutorial is captured in one branch each. So branch tutorial_1
is the state where we left off the last time, and tutorial_2
is where we will get to this time. Consequently, if you don’t have your project from the previous tutorial at hand, you can make a clone from branch tutorial_1
like so:
git clone https://gitlab.com/wtraylor/popper-tutorial.git -b tutorial_1 tutorial
cd tutorial
In order to follow this tutorial, you will need the same things as before: Linux, Docker Engine, Git, Popper (version >= 2020.09.1), and some basic knowledge of Bash and Git. You can install reuse
or use the Docker command docker run --rm -it -v $(pwd):/data fsfe/reuse
instead (for which you can create a Bash alias as explained in the first tutorial). It’s useful if you know some R, but don’t worry if you are not an R user. You can just copy-paste the example scripts, and perhaps you’ll pick up some tricks that can help improve the workflow for the tools you are using.
Note that I don’t use RStudio for an IDE (integrated development environment). The functions of RStudio are great for interactive scripting. So if you prefer writing your R scripts with RStudio, that’s fine. However, when it comes to designing your project to be portable and reproducible by another person, don’t assume that this person has RStudio. A GUI (graphical user interface) like RStudio might get redesigned in the future, might stop being developed, or won’t run on other operating systems. Therefore, I only show shell commands and text files. They have been around since the dawn of computers and will likely persist for a while.
Using R Packages
Nearly all R scripts will use some R packages that extend the functionality of base R. Our current script in scripts/plot_box_and_whisker.R
looks really boring right now (I omitted the top of the file with shebang and license info):
# This script creates a box-and-whisker plot in the file `output/box_and_whisker.png`.
# Call this script from the root of the repository.
message("Creating box-and-whisker plot from input data.")
dir.create("output", showWarnings = FALSE)
numbers <- read.delim("input/input.txt")
png("output/box_and_whisker.png") # Open a PNG “device”.
boxplot(numbers)
dev.off() # Close the “device”, i.e. write the image file to disk.
message("Done!")
This version is a bit fancier:
# This script creates a box-and-whisker plot in the file
# `output/box_and_whisker.png`.
library(checkmate)
library(here)
out_dir <- here("output")
dir.create("output", showWarnings = FALSE)
out_file <- file.path(out_dir, "box_and_whisker.png")
assert_path_for_output(out_file, overwrite = TRUE)
input_file <- here("input", "input.txt")
assert_file_exists(input_file)
numbers <- read.delim(input_file)
assert_data_frame(numbers, any.missing = FALSE, ncols = 1, nrows = 9999)
png(out_file) # Open a PNG “device”.
boxplot(numbers)
dev.off() # Close the “device”, i.e. write the image file to disk.
assert_file_exists(out_file)
message("Plot created: ", out_file)
- I used the
checkmate
package to assert various assumptions that I made while composing the script. If anything doesn’t go as expected, checkmate will provide us with a user-friendly error message. - The
here
package creates filepaths that are relative to the root of our Git repository. This way, it doesn’t matter anymore where we place the script and from which working directory we call it. Our filepaths always relate to the root of our self-contained project. - By using the
file.path()
function (from base R), I have eliminated the slash (/
) in the filepath. That is generally a good practice for the sake of portability because Linux uses slashes and Windows backslashes. - The shebang (
#!/usr/bin/env Rscript
) and REUSE license information at the top of the file remain the same.
Let’s create a new commit with our changes right away:
git add scripts/plot_box_and_whisker.R
git commit -m 'Add assertions and project-relative paths to script'
What is Packrat?
Now we have introduced two R packages as dependencies into our script, here
and checkmate
. How do we include those in our reproducible, Popper-driven project? The answer is Packrat! In a typical workflow, you install R packages with install.packages()
into your home folder (usually ~/.Rpackages
). In contrast, Packrat stores them per project in the project’s folder. That makes your project portable, reproducible, and isolated because it delivers all dependencies in the correct versions together with your project.—All dependencies? Well, not quite. Even with Packrat, you still have no control over the version of the R interpreter itself and over the operating system. That’s why we integrate Packrat into Popper.
In order to use Packrat, we first need to install it. Now I could ask you to fire up RStudio or R
in the terminal window to install Packrat on your system. However, as you might have noticed, I didn’t include R in the list of requirements, i.e. you don’t need R installed on your system at all! Instead, we will use an interactive container to initialize Packrat.
Exploring Interactive Containers
So far we have written scripts and had them executed by Popper. Executing scripts is a non-interactive way of working. In contrast, interactive means that you type your commands into a prompt and immediately execute them one after the other by hitting return. When you open a terminal window you are in an interactive shell on your operating system. You can enter also an interactive shell for the virtual operating system inside a Docker container. Imagine this switch into the interactive shell of a Docker container like leaving the laptop you’re working on and going over to another desk in your room. There you find another laptop with its own operating system, its own file system, and different programs installed.
Conveniently, Popper provides an interactive mode out of the box. Earlier, we called popper run
from the root of our project repository, and Popper automatically executed the steps in the workflow file .popper.yml
. Now, we use popper sh
to open an interactive shell in a Docker container. While popper run
executes all steps in the workflow file, popper sh
can only open a container for one step. Therefore we need to specify that (unnamed) step with a number; in our case 1:
popper sh 1
Now you should be in an interactive Bash session inside a newly created r-base:3.5.2
container. This is your chance to go out exploring what this ominous “virtual operating system” actually looks like:
- What kind of system are you on? →
cat /etc/os-release
,hostname
- What files do you see here? →
ls -a
,pwd
- Which software versions are installed? →
apt update && apt list --installed
- How do you install new packages? →
apt update && apt install ...
- Where do those packages come from? →
cat /etc/apt/sources.list
- …
You can close the interactive session by issuing exit
or pressing Ctr-D. All changes you have made to the container system will be gone—except for changes on files in your project folder. In the Popper/Docker container, your project folder appears by default in the path /workspace
. In the Appendix2 at the end of this tutorial, I point out a pitfall we can run into when creating files in the /workspace
folder from within the container. But for now, let’s continue with our project.
Installing Packrat
After our little exploration, close the old container in order to open a fresh one with Popper. This time, however, we don’t want to use the Bash, but an interactive R session in order to install Packrat. By passing the argument -e
/--entrypoint
to Popper, we can choose to execute a program other than Bash. And, obviously, that program is R
:
popper sh 1 -e R
Now we have been dropped into an interactive R session and can do stuff inside the container. Let’s set up Packrat according to the official instructions:
install.packages("packrat")
packrat::init()
quit(save = "no")
This has created the directory packrat/
and a number of files in our repository:
packrat/packrat.lock
is the list of all the packages we are using in our repository with their exact versions. In theinit()
call, Packrat has automatically searched through all R scripts in our repository and listed the libraries that are used. Then it downloaded them from CRAN and installed them into thepackrat/
folder.packrat/src/
contains the source code for of all R packages, each in a.tar.gz
archive file in its own subfolder. Packrat will automatically compile them for the specific platform on which it is started..gitignore
was changed to ignore the compiled R packages in thepackrat/lib*
folders. Since the compiled binaries are platform-specific and built automatically, we don’t want to include them in the Git repository..Rprofile
loads Packrat automatically when we open an interactive R session in the root of our repository. The actual code for that is inpackrat/init.R
.packrat/packrat.opts
contains options for Packrat. You can change them with thepackrat::get_opts()
function.
After leaving the interactive Popper container, we should do a test run:
popper run
Does it work? Check if the plot in output/box_and_whisker.png
looks alright.
If there are no errors, we can proceed and include the new Packrat files to Git.
But stop! Suddenly we’re dealing with some (comparatively) big binary files in the packrat/src/
folder. Git is not particularly excited about swallowing those; it only likes to eat bite-sized files. Let’s be friendly to Git and prepare to eat this better.
Tracking Large Files with Git-LFS
Git has been designed and optimized for managing source code files. Therefore we shouldn’t just blindly add big files to it. We would bloat the repository and jeopordize Git’s excellent performance. When I say “big”, I mean, as a rule of thumb, anything above 1 MB. That applies in particular to compressed files, like our .tar.gz
R packages, because Git would try to compress them again (which is nonsense). (If you want to learn more about size-related best practices, take a look at git-sizer.) Fortunately, there is a ready-made solution for our problem: Git-LFS.
The Git extension Git-LFS (Git Large File Storage), developd by GitHub, helps us to easily include large files to a repository. Once Git-LFS has been told which files to track, it will handle them automatically in the background. We can apply to them all the Git commands we are already familiar with.
Consider, though, that even Git-LFS is not very good at handling really big files, like several Gigabytes. As a rule of thumb, I recommend to not store in sum more than a few Gigabytes of Git-LFS files in one repository. One reason is that Git-LFS typically uses two times the space of what it is actually storing: one copy in the .git
folder in your repository, and another copy in the checked-out working directory. I am planning to write another tutorial about this issue some time in the future. It will appear on my yet-to-build website wtraylor.de.
First, install Git-LFS on your system. Note that, after having installed the software, you need to execute git lfs install
once. Then come back to our project repository and tell Git-LFS which files it should be in charge of:
git lfs track "*.tar.gz"
reuse addheader --copyright="Jane Doe <jane@example.com>" --license="CC0-1.0" .gitattributes
git add .gitattributes
git commit -m 'Track R package archives with Git-LFS'
Git-LFS has created the .gitattributes
file, where it saves what to track. For that information to be persistent, we have added the file to our repository. Of course, we have added the REUSE license information right away. (Remember from the last tutorial that, if you don’t want to install reuse
, you can set an alias for running it in Docker: alias reuse='docker run --rm -it -v "$(pwd):/data" fsfe/reuse'
)
Don’t forget to mention our new packrat/
directory in the “Project Structure” list in the README.md
:
- `packrat/`: R packages, included with [Packrat](https://rstudio.github.io/packrat/).
Now everything is prepared for adding all Packrat files. Git-LFS will handle the archives, normal Git the rest:
git add .
git commit -m 'Initialize Packrat'
Note that for Git-LFS to work, the Git server needs to support it. That is the case for all major Git hosting services, like GitLab, GitHub, and Bitbucket, and also for the self-hosted Gitea.
This is what our project looks like now:
But wait a minute. Didn’t we forget something? What about licenses for all the Packrat files we’ve added? We have just included third-party software in our repository, which we will redistribute. Legally, that can be thin ice. Fortunately, the CRAN policy prescribes that every R packages must be under one of the accepted licenses, which at the time of writing are all Free Software licenses. We can redistribute Free Software with our code, even if we choose a different license (as long as we don’t aggregate the third-party code with ours into one piece of software and as long as we provide the third-party source code). If you want to be sure, look at the DESCRIPTION
in each of the .tar.gz
package files: There is a line “License: …”. In regards to being allowed to add the packages, we are good to go. (Note that I’m not giving any legal advice here…)
Now, shouldn’t we include license information for each of the packages? Ideally, yes, because only then we would be fully REUSE-compliant. The drawback is that it’s a lot of work to check the license and copyright holder of each R package. Therefore, I have decided to break REUSE compliance at this juncture and simply explain the situation in the “License” section of the README.md
:
## License
This project is compliant with the [REUSE][] standard:
Each file has a copyright notice; all licenses are in the `LICENSES/` folder.
Each R package in the `packrat/src/` directory comes with its own license and copyright holders.
You will find that information in the `DESCRIPTION` file of each `.tar.gz` archive.
[REUSE]: https://reuse.software/
Even if we are not fully REUSE-compliant anymore, it’s still worth to run reuse lint
to get an overview. Only the Packrat files should be listed as files without license information.
Bootstrapping Packrat
Each time we execute popper run
, a clean r-base
container is created, which doesn’t know anything about Packrat yet. When we ran popper run
earlier to test our Packrat setup, Packrat restored itself because we had all the compiled R packages already lying in our project folder (in packrat/lib*
). We just didn’t track them with Git. Now, if we push our repository to the server, and another person clones it, they will not have those compiled packages conveniently in place. Packrat (currently v0.5.0) is pretty smart, but still needs a little bit of help to bootstrap itself.
To illustrate the issue, let’s delete all packrat/lib*
folders and then try to run our workflow:
popper sh 1
# Within the Popper session:
rm -r packrat/lib*
exit
We enter the interactive Popper session here because the files we want to delete are owned by root
. I describe more details on that in the Appendix2 and offer alternative solutions. Now trying to run the workflow without the compiled packages, I get this error message:
$ popper run
[1] docker pull r-base:3.5.2
[1] docker create name=popper_1_ca988fa8 image=r-base:3.5.2 command=['scripts/plot_box_and_whisker.R']
[1] docker start
Error in library(checkmate) : there is no package called ‘checkmate’
Execution halted
ERROR: Step '1' failed ('1') !
The solution I use for bootstrapping Packrat is currently not well documented. I came across it in Packrat’s issue #158. Let’s create a Bash script in scripts/bootstrap_packrat.sh
with this content:
#!/bin/bash
# Restore Packrat environment from base R.
R --vanilla --slave -f packrat/init.R --args --bootstrap-packrat
Rscript -e "packrat::restore()"
This script requires no dependencies installed, just plain base R. The magic happens in the packrat/init.R
script, which interprets the --bootstrap-packrat
argument and, accordingly, installs the Packrat package from packrat/src/packrat/
. Afterwards, the call to packrat::restore()
compiles the other packages. In the end, all compiled packages are ready for use in the packrat/lib*
folders.
Let’s incorporate that in our Popper workflow. It should be the first step, of course, because it is prerequisite for the plotting script. This is what my .popper.yml
file looks like:
steps:
- uses: "docker://r-base:3.5.2"
args: ["scripts/bootstrap_packrat.sh"]
- uses: "docker://r-base:3.5.2"
args: ["scripts/plot_box_and_whisker.R"]
Let’s make our script executable and try it out:
chmod +x scripts/bootstrap_packrat.sh
popper run
The output should start with something like:
Packrat is not installed in the local library -- attempting to bootstrap an installation...
> Installing packrat into project private library:
- 'packrat/lib/x86_64-p<Plug>(neoterm-repl-send-line)inux-gnu/4.0.2'
And then a string of installation output follows.
Now we can add our bootstrap script and the changes in the Popper workflow to Git:
reuse addheader --copyright="Jane Doe <jane@example.com>" --license="Unlicense" scripts/bootstrap_packrat.sh
git add .
git commit -m 'Add script for boostrapping Packrat'
Congratulations! Your R project now ships with all of its dependencies, while execution happens in a breeze. Let’s call it good for now. In the next tutorial, we will take it one step further and customize our Docker container with some additional software. Until then, happy coding!
Appendix: File Ownership
Within a Docker container you are (by default) the root
user. That means, the files you create from within the container are marked as owned by root
. However, in your normal Linux working environment, you don’t work as root
and therefore don’t have permission to change or delete those files. Let’s walk through this together with an example.
In order to not mess with our existing example project, let’s create a new, temporary project for that. You can do that anywhere in your system. On most Linux system, the temporary file folder /tmp
provides a good playground because (on most systems) it gets cleared after every reboot. I call the folder delete_me
so that I know it’s trash.
mkdir delete_me
cd delete_me
popper scaffold
popper sh -f wf.yml 1
The popper scaffold
command creates an example workflow file in wf.yml
. What’s in there doesn’t matter right now, we just want to enter an interactive session in a container with popper sh
. Inside that session we create an empty file and leave:
# Within Docker container
mkdir folder
touch folder/test
exit
You will see that file appear in the project folder outside of the container. Check it out in your file explorer.
In your file explorer, you can right-click on the file test
in your project folder and look at its properties. Or you use the command line:
$ ls -l folder/test
-rw-r--r-- 1 root root 0 Oct 12 16:49 folder/test
Here we see that the file is owned by root
, as is the folder. (The root root
part of the output means that user “root” in group “root” is the owner.) I, as a regular user, cannot delete that folder with rm -r folder
because I don’t own it. You have different options now:
- Use
sudo
to remove/change theroot
-owned files/folders. For examplesudo rm -r folder
. - Go back into an interactive Popper session with
popper sh
and remove/change the file from there. - Select Singularity or Podman instead of Docker as the container engine in Popper (compare Popper issue #859). In Singularity and Podman containers, the default user is not
root
. - Make yourself the owner using
chown
. First, find out your user ID (“uid”) and group ID (“gid”) by executingid
. Then executechown --recursive UID:GID .
to make yourself owner of all files in our project. You can do that withsudo
or in an interactive Docker session as in the following code snippet. Since my user and group ID are both 1000, this is what it looks like on my system:
popper sh -f wf.yml 1
# Within container session:
chown --recursive 1000:1000 .
exit
# Back in our normal shell:
ls -l folder/test
rm -r folder
At this point I don’t want to recommend a particular option. Popper and the container engines are developing, and in the future, the phenomenon of those root
-owned might not be an issue anymore. Experiment and find out what works best for you at this moment.
This work is licensed under a Creative Commons Attribution 4.0 International License.