Reproducible science for people who don’t code
Science could be much more efficient if more scientists were encouraged to openly share data and methods along with their manuscripts or even publish so called pre-prints before submitting to a journal.
Powerful solutions, such as IPython/Jupyter are available to create and publish such reproducible documents. However, those interfaces currently require programming knowledge and hence have missed to convert scientists who aren’t able to code.
How can we remove that barrier and enable any scientist to open up their data and methods?
Looking at the past 20 years, two types of user interfaces stood out when it came to broad user acceptance. First word processors (Microsoft Word, Google Docs) and second spreadsheets (Microsoft Excel, Google Sheets). Both don’t require programming skills.
Can we use the potential of 1 billion spreadsheet users and turn many of them into scientists who publish data and methods? Can we redefine those two successful interface paradigms to be interoperable and extensible?
Open Science Desktop
This post describes the idea — a result from discussions with my partner Oliver Buchtala and Stencila creator Nokome Bentley — of an Open Science Desktop (OSD), which serves as a workspace for creating and viewing data-driven reproducible publications.
After I started the app and created a new project I can choose from four panels in order to incrementally develop my research and eventually publish it.
- Sheets: My data lives here (in spreadsheets). I can use formulas to experiment with the data.
- Examples: I use real data to exemplify the methods I developed. Often the result of an example is a graphic (e.g a plot).
- Manuscript: I write my actual scientific article here. I can use graphics generated in the examples for my figures and have them update live.
It starts with a spreadsheet
Spreadsheets are a powerful tool, not only for data management, but also for prototyping. It’s like drawing thoughts on a canvas. The sketch below shows an example of the Stencila Sheets interface.
I provide some setup, such as simulation parameters (intercept, slope, variation) as well as sample input values for X. I create random errors with respect to the provided variation. I also define a custom simulation function right in a spreadsheet cell. The function definition conforms to a formula in Excel. I can do basic arithmetics and call existing functions. I apply the simulation function on each input value and plot the results, by calling the scatterplot function in a cell.
My own functions
I am pretty happy with the results, but as with most spreadsheets, after some hacking, things look a bit messy. It’s time to organise my findings, extract a function definition, and implement it in a real programming language of my choice.
OSD provides a simple code editor for implementing custom functions. I can choose one of the supported languages and start coding. In my dream there’s also a built-in visual debugger for each supported language.
I’m going to replicate what I produced in the spreadsheet before, but this time in easy steps to follow my methodology.
This interface is inspired by Jupyter. WYSIWYG text editing is available and code is very high-level, which means only basic arithmetics and functions are available (cf. Excel expressions) . Examples are considered the gluing piece, with the intent to be as human readable as possible. Everything that requires serious programming effort, better lives in a function. It doesn’t matter which language it is implemented in. Use the language that best fits the problem, and mix them. Again, I’m dreaming of putting breakpoints and then debug an example which involves calling functions implemented in different languages.
Last not least, I will write the actual manuscript. Below you see the interface of Texture, a community-developed scientific editor. Through this editor, as an author, I will create valid JATS (the de facto standard for archiving and interchange of scientific open-access contents with XML). Within OSD, I’ll be able to create figures based on the generated graphics of the examples.
I believe that scientific manuscripts haven’t change much for a reason. They are a great way to make a literal argument for scientific findings. They are static and can be printed, which is great. I would not put too much (if any) ‘data-driven’ parts into the manuscript itself, as this would mean loosing the static nature of it, which would be a great loss.
What is unique about this idea?
- A desktop application provides not only a user-friendly authoring environment but a responsive runtime environment for computations. No servers need to be scaled, and no longer there is a need for online services (which are often gate keepers or at least a single point of failure). The default installation of OSD would include commonly used environments such as Python, R, Java, Node.js. It’s crucial that the development experience matches native development on a local machine.
- The manuscript I wrote is a literal description of my scientific argument. It is shared as a static web page and optimised for reading. By providing this classic scientific paper, I make sure my research is readily accessible: Any browser can display it, no servers, databases etc. are needed to view it.
- In addition to the published static manuscript, an open science archive file can be downloaded and opened in OSD. Since OSD provides a functional runtime environment, readers can run the simulations on their computer, use their own data, change input values etc.
- The functions I created as part of the publications are reusable and documented. If other scientists have my publication in their library, they can also use the functions in their projects.
- Examples (also known as notebooks) for me are a perfect addition to the scientific manuscript, not a replacement. First you read the argument, look at figures, later you dive into the methods used by author, experiment with them (e.g. run simulations with your own data) and improve and reuse them.
What do you think? Please respond via comments on Medium or use Twitter.
This post was inspired by a Chan Zuckerberg Science workshop on the future of scientific publishing.
Giuliano Maciocci of eLife was looking at it from the reader’s perspective. He wrote a response proposing Progressive Enhancement to allow navigating from a static, to an interactive, to a reproducible view of the content.