Keep Your Damn Lawyers Out of My Notebooks
Open science good, FOIA snooping bad.
It’s not complicated. But it’s not easy.
There’s been quite a lot of controversy lately about the extent to which a scientist’s work on the public dime should be public property.
Two different things are being conflated. On the one hand,there’s a very legitimate quest toward scientific openness and away from a proprietary attitude to publicly funded data. On the other, there’s a bizarre insistence that every breath taken by a scientist should be subject to public scrutiny.
The latter is well-known as a tool of harassment against scientists operating in controversial areas. I won’t revisit it all. Readers unfamiliar with the various debacles are invited to Google FOIA Science.
I argue here that the way to resolve the issue is simple in principle but for technical reasons difficult in practice.
I — KEEP YOUR LAWYERS OUT OF MY NOTEBOOKS
The question is at what point (as a publicly funded scientist, which these days I’m not, but please bear with my scenario) … At what point does my monkeying around in Matlab at State U become something that a Mr J Random Harasser is entitled to look at.
The FOIA principle seems to be that if I did it on state equipment, it is everyone’s to examine and mock the instant I did it. Certainly the instant I use email or a telephone to discuss it.
People who can’t imagine the chilling effect that has on scientific exploration seem to lack any imagination whatsoever. Making every little mistake potentially career-threatening may be how politics works these days but that’s no reason to export it to science, where it’s not the bad moves but the good ones that matter.
II — EXPECT SCIENTISTS WRITING CODE TO LEARN HOW TO TEST AND DEMAND PLATFORMS THAT ALLOW IT
On the other hand, if my computational experiment holds water and leads to me making an assertion in a formal publication, it costs me nothing to make my Matlab routine public, and indeed the entire review process (or whatever we replace it with) ought to insist upon it.
People have been joking that it shows nothing to run the same code on the same data and get the same answer.
Here I tediously insist upon an important technical point that is somewhat tangential to the point of most of the public discussion of these matters. But I think it’s important.
Running the same code with the same data and getting the same result does show something, because in scientific coding this anticipated triviality is not the usual case.
People using modern test-driven software development techniques in commercial enterprises take reproducibility at this level so much for granted that it’s not really explicit. But test-driven development in scientific software is dramatically harder. This is because the same code will not pass the same tests every time, even if it is logically correct.
This is because science is constrained to using floating point numbers, and so minor compiler optimizations can change results without changing whether the results are valid.
I can go on at length about how and when this sort of thing bites the practitioner in practice.
This is quite unnecessary. Compiler writers are rewarded for making things go fast. If they can’t change order of operations, they don’t get to do their thing. But they’re doing more harm than help.
I’d rather pay twice as much for a computation I can repeat. But that’s because I’m the rare person who knows something about software engineering AND something about high performance scientific software. Most scientists just think computers are unreliable and difficult.
Let me put it this way — if you run the same code on the same data and you FAIL to get the same answer, does this cause you concern? I suggest that it should.
III — PUBLISH WHEN YOU PUBLISH
Which brings us back to the question of what is the public’s business and what is my own. Once I publish a figure, it costs me nothing to make the data and a script which generates the figure available for download. It costs me nothing, that is, except possibly the exclusive access to my data.
The data should be mine and mine alone until I publish research based upon it. At that point it should be public, and everything between the raw data and the final figure should be public as well, with a pointer to the FTP site right in the paper.
I realize there are cases where data is partly private or proprietary or even classified. The rule has to be more complex in such cases, admittedly.
But those cases are not common in climate science, though, nor in many other disciplines.
Where this doesn’t apply the rule I’d propose is simple. My correspondence (except regarding funding, which is a legitimate FOIA target) is my own. My notebooks, my scratch disk, my whiteboard, my chats with colleagues, my clever witticisms, and that stupid thing I thought was funny at the time that I wish I hadn’t said, these are all mine.
The data I collect are mine until I publish.
When I publish, every graphic or table that I publish should be reproducible to the last pixel. And the scripts (and code and compiler flags and all) to do that ought to be public.
IV — FINAL GRUMBLE
Compilers and platforms which don’t support backward bit-for-bit compatibility are unsuitable for this proposed standard and thus unsuitable for the progress of computational science. Unfortunately, those are the ones we’ve got. So even though you may support my rule, you should not expect me to comply with it overnight.