I released all the code for my publication
Why exposing my code is better than hiding it
Open code is worth the effort
Results are processed, reviews are in, the paper is a go. Why on earth would I spend an extra 8+ hours to clean, comment, and commit my code to GitHub for something that’s technically “done”?
Because science deserves better than shoddy code, or worse, no transparency at all...
Honestly, it would be hypocritical of me to release code that is illegible and difficult to run with “all” the complaining I do about the lack of reproducibility in (computational neuro)science.
I get frustrated when “code is publicly available” and all that code does is produce one part of the first figure. Sure, I can tell the authors have put effort into that piece of code and have even included the generated results in an image, but I just expect more from a discipline that espouses peer review. Theoretical and computational fields have the potential to be even more accessible by sharing not just a virtual experiment’s methodology but also its implementation. This accessibility is especially important when compared to many fields that are restricted by expensive experimental equipment.
What are the minimum requirements for code to meet the specifications of open science?
To many, open science doesn’t just mean logistically accessible, but also intellectually accessible. That is, just because code can be run again by calling a command doesn’t mean it is actually useful for testing hypotheses.
To be more useful than re-generating existing results, code that is well-documented helps for
- understanding the nuances of algorithms from the paper
- changing parameters, which are clearly labelled, for ranges outside the ones in the figure
- extending the model to include one’s own area of interest
- comparing the model to other similar ones
- learning a field by seeing how algorithms that are barely expanded upon because they are “common knowledge” are implemented
- providing a blueprint for attempted independent implementation of the algorithm
- examining the development process by following the history of a repository’s commits
In addition, taking the time to clean one’s code often results in
- catching errors and bugs through re-examination
- improving code efficiency and implementation
“Open science” should stop being a term. I think in this day and age that science should be open and accessible by default. In the same vein that you submit a manuscript for review from peers, code should also be up for inspection and, crucially, for extending ideas. Some journals are working towards this better than others in the effort of changing mindsets.
What my code looked like
The code started as a complete mess.
Files in a single directory. Loads of methods in a few files. And a severe lack of commenting.
Furthermore, it was all in a subdirectory of a bigger code base for a larger project.
This was not in the spirit of open science — sharing code for reproducibility. Just because it ran, doesn’t mean it was particularly good for understanding what was going on.
It gets worse.
The code was started in NEURON’s C-like language HOC for defining, running, and saving the neuron model, with analysis of the output files in MATLAB. This seemed reasonable in 2015 as the Python interface for NEURON was aimed at *nix operating systems while I had a Windows laptop and the lab used Matlab scripts to analyse experimental data — experience I could leverage. At the end of 2017, a Linux machine was born and I wanted to use more Python to reduce the number of languages I was using (Python was used for my machine learning models, Javascript for web, Java for another project, and HOC/MATLAB for this research project).
Hesitant to rewrite all the HOC models in Python, I instead starting using Python as a scripting language to call HOC files to do the heavy lifting. I had spent a lot of time writing what I already had — both in NEURON and in MATLAB — such that my philosophy was to keep what works but wrap it in a nicer language.
Plus, I was the only one using this code base, “how bad could it get?”
Extending the experiments was done by cobbling together configuration files in Python that in turn configured NEURON, and wrapping the principle running of the simulation with Python-based parameters.
Yeah, it got bad…
Fortunately for code quality, a seemingly simple request by a collaborator led to a new Python file that did all the simulation setup and analysis.
Further experiments were thus increasingly setup in Python. And the more proficient I became over the years, the quicker and cleaner the code.
But the results-first approach of my research meant the code was poorly maintained.
The speed of results was traded for code debt. Once publicly released “into the wild”, this code debt no longer becomes a personal problem but slows the speed of understanding the code model and creates intellectual debt — a waste of intellect on understanding what could have been better structured and explained. If I wanted my results to be interrogated and built upon, the code needed to improve.
How I made the code publishable
My first step was to group file types into folders.
Python files were placed in the almost-root, src
, because these were used as the entry into experiments, script or otherwise. HOC files, you go to hoc_files
(note that a hoc
directory would cause name conflicts for PyNEURON). NMODL (a language to specify the behaviour of channels in a neuron) you go to mod
. MATLAB files, you go to matlab
.
Then, for each of these folders, subdirectories were created based on the file function. Neuron models go to hoc_files/cells
, synapse specification to hoc_files/synapses
, etc. Utility files and functions were sent to utils
.
For Python, an IDE with strong refactoring* capabilities comes in handy. This was also the messiest area. Because experiments were contained to files, I changed it so that setup, simulation, analysis, and plotting were separated. In turn, I tried to contain an experiment’s file to a single folder with a corresponding file in src
that was bare-bones and principally ran the experiment with just a few (< 50) lines. Methods common across experiments were moved to a utils
folder and grouped into things like file manipulation (e.g. saving, loading), neuron interface methods, common plotting helpers, etc. Using an IDE’s refactoring, references to methods are kept when moving them between files/folders.
In addition, a config.settings
file kept some constants and configurations that would be generally useful and a config.shared
file allowed a consistent NEURON experience by automatically compiling NMODL files, loading utility HOC files, etc. just by calling INIT_NEURON()
.
With files in the right place, it was important to me to increase my comment-to-code ratio so that my code wasn’t just nicely grouped to re-run, but actually understandable by another human (and me in 6+ months time). Along with commenting in-line, a focus was spent on docstring comments and renaming variable/methods to be self-documenting.
It is a promising idea to create a new environment for each project (using virtualenv
or conda
for example) so that package requirements can be tracked. It also means a person trying your code on their computer can create a new environment that doesn’t pollute or conflict with their existing one. You can export using conda env export --from-history > environment.yml
(--from-history
means only packages you’ve explicitly asked to be installed will be in the file)
or pip freeze > requirements.txt
Finally, in this modern age, an interactive notebook (like Jupyter) makes walking through some code to produce some results significantly easier than files in a directory and a separate readme. Of course, a limit, in this case, was that some analysis was done in MATLAB, so those simulations and results were omitted. I know MATLAB-Python bindings exist, but I also wanted the notebook to be runnable in a service like Binder to reduce friction. Create a pull request if this is something you want.
Although logging is inconsistent (HOC files have their own logging, while I use Python’s logging
module), at least figures from the paper are reproduced. In this case, full figures from the paper are not reproduced, but subplots within them. Some like to combine these in something like Illustrator or Inkscape. My other projects, fortunately, generate full figures in Python (lettering and all) for only minor tweaks in vector-editing software.
Now we have a readme, an explanatory notebook, files in a clear structure, comments for files and methods, and an easy approach to reproduce figures using either the notebook or src
-level Python files that can be called in the terminal.
It’s not perfect, it never will be and that’s okay.
There’s a lot of history to the code. We’ve been through a lot together. So much was learnt along the way and there are loads of assumptions built into the code. But it’s out there. Although it’s scary to have it all bare for anyone to see, it’s also what is best for science. And that is what is more important to me.
*refactoring = restructuring code so that it functions the same but has improved speed, legibility, and/or extensibility.
Lessons
- Comment your code. A lot. Comment too much rather than not enough. Use self-documenting code, write file headers, reuse readme notes in your comments and vice versa.
- Structure your files and methods. Do it early. Restructure often.
- Prevent societal intellectual debt by preventing personal code debt.
- Test your code thoroughly. Formal unit tests are better than informal figures. Figures are better than not testing at all.
- Time spent now is time saved later. There will always be a rush. There will never be enough time. Prioritise legibility as much as possible.
- Plan the layout of your project.
- Prepare to throw away the plan if your requirements change.
- Your code is valuable. Your code is precious. But what’s the use in hoarding it?
A note on scooping
Because of the perils of the “publish or perish” model of science (ugh), scientists have a fear that their work will be published by someone else. Fair enough. There is a larger discussion about this to be had, but here I will claim the following:
- Releasing your code with the right licence limits others from publishing results related to your work without due credit.
- Releasing your code to an online repository gives a credible timestamp of intellectual originality.
- Allowing others to build upon your work makes it easier for new results to be generated. Results which, given any decent scientist, will cite your work.
- If you are unduly scooped, your work is clearly important enough to be scooped in the first place (not sure mine fits that criteria).
- Releasing part of your code is better than releasing none of it.
- Releasing opaque code is nasty: it is the antithesis of open science while claiming to be open. That is, code that is available but illegible can claim to be (logistically) accessible but is preventing others from using it.
- Scientists want to collaborate more than what is currently possible.
- Fostering a “mine” attitude is anti-science. Well, at least science for the general betterment of mankind. And what use is selfish science?
- Laziness is no excuse. You’ve put how much effort into writing your manuscript and producing good figures? You’ve spent hours on a presentation or poster of this work to show others. Surely producing code you’re proud of deserves similar love. And it is more reusable than any paper or poster you’ve made.
Fin
These are the reasons why I decided to release my work.
Maybe you don’t feel as comfortable doing so. I challenge you to overcome that insecurity. I admire those who let themselves be vulnerable.
Maybe you’re waiting for the code to be “perfect”. I challenge you to release it anyway. Done is better than perfect.
Maybe there are more obstacles to share your work. I challenge you to share what you can. As a scientist, you’re a professional problem-solver.
Maybe your collaborators or supervisor don’t want the code online. I challenge you to convince them otherwise. Foster an open environment.
Sharing your code can be done. I would argue it should be done by default.
There may be exceptions, of course. But I think they should be treated as such: exceptions.
I think science can do better.
I think we all deserve better.
Thanks to Grace Lindsay and Kira Düsterwald for their helpful suggestions and editing.
The code in question is for the paper “Chloride dynamics alter the input-output properties of neurons”, published in PLoS Computational Biology
(link to paper pending)
GitHub repo:
Personal website:
Photos from Unsplash.com