Stata on Code Ocean: the case of '
Tracking down a wayward statistical routine
I recently enjoyed (mostly, just watched) a friendly debate among economists about the virtues of different programming languages. One point of general agreement is that Stata makes common operations simple to execute, both syntactically (
regress y x, vce(robust))¹ and in terms of necessary dependencies (very few).
When Stata code does call for external libraries, the Boston College Statistical Software Components (SSC) archive is generally users’ first place to look; Stata also includes built-in syntax for installing such packages (e.g.,
ssc install outreg2). No repository is truly exhaustive, however, and certain packages — often, those that are very new or very old — are likely to slip through the cracks. This was made painfully clear to me as I prepared to publish a project on Code Ocean: The contact hypothesis re-evaluated: code and data, accompanying an eponymous article in Behavioural Public Policy.
While uploading our code, I (re)learned that the a set of routines we used called
meta.ado wasn’t available via
ssc install meta. I must have known this at some point, but a few months passed between when we finished our statistical analyses and when we published the results on Code Ocean, and frankly, I had forgotten some of the details, and we hadn’t documented how we downloaded
meta.ado. (As it happens,
meta.ado is available via the
metatrim package — but I didn’t know that at the time.)
There were a few different possible solutions to this. The first was to just upload the .ado (re: set of Stata routines) to my capsule’s
/code folder; but in general, best practice is to separate dependencies from code. Intermixing the two can create confusion about what authors created specifically for a project, and what they found and relied on — the line between general tools and the code used to generate a paper’s unique claim(s).
Had I been preparing to publish my code on a static repository, I likely would have written instructions for how to find
meta.ado with Stata’s built-in search functions. But this solution is hard to future-proof because such syntax can change over time, as can the URLs where dependencies are located.
In the end, I used the setup script — Code Ocean’s escape hatch for complex operations that fall outside the scope of the built-in package managers — to download the relevant file and place it where Stata was set to look for it.² Hereafter, anyone who reproduces our analyses avoids the headache I went through, and, to boot, has a example of how to manage such issues for themselves.
I wouldn’t call this life-changing, but let’s say that it took me 15 minutes to solve this issue, and that I encountered about 8–10 similar such minor issues, for a total of about two hours of labor. Let’s further suppose that each published compute capsule requires, on average, about that much work to fix all ‘it worked on my machine’ problems. Multiply that by A) the number of capsules on Code Ocean and B) the total audience who might wish to reproduce published analyses, and we’re starting to look at a lot of time saved and tedium foregone.
Export capsule: another insurance policy for package preservation
On a related note, we recently released the option to export capsules, which offers another layer of preservation and interoperability.
A core technology on which Code Ocean builds is Docker, software that lets you package up code and all its dependencies (in Code Ocean’s case, the code and data accompanying research articles), down to the level of the operating system, into what’s called a container.³ A Dockerfile is a formula for constructing that container. Each exported capsule comes with code, data, metadata, and a Dockerfile.
Additionally, all published capsules have an associated Docker image on Code Ocean’s public Docker registry. An image is a static snapshot of a Docker container. The upside of this is that for each published capsule, all necessary packages have been stored in these images; you don’t need to re-run the commands that installed them. This solution, therefore, is robust to URLs changing or packages disappearing.
To keep with The contact hypothesis re-evaluated: to reproduce the results on your computer, there’s no need to download the meta package from its current home; instead, if you have Docker installed and running, you can docker pull the relevant image.⁴ So if the meta package should ever go out of existence, we’ve got you covered; the version that this particular capsule used is archived and accessible regardless.
For some example of Stata code on Code Ocean, see:
Code and Data for “A Practical Introduction to Regression Discontinuity Designs: Volume I”;
Should I Stay or Should I Go? A Behavioral Approach to Organizational Choice in Tajikistan’s Agriculture;
R/Stata code for: “Synthetic Control Method: Inference, Sensitivity Analysis and Confidence Sets;
Reproducibility during peer review;
The contact hypothesis re-evaluated: code and data;
Who were the voters behind the Schulz effect? A long-term analysis of individual voter trajectories in the run-up to the 2017 German federal election (Wuttke/Schoen).
¹ This is an update to
reg y x, robust.
curl http://fmwww.bc.edu/RePEc/bocode/m/meta.ado > /root/ado/plus/m/meta.ado. Stata has built-in syntax for this as well:
copy http://fmwww.bc.edu/RePEc/bocode/m/meta.ado /root/ado/plus/m/meta.ado.
³ This section is going to gloss over some of Docker’s technical details. For more information, see
⁴ In this specific case:
docker pull registry.codeocean.com/published/f152260c-bebb-4157-a640–44579452b4e4:v5