Open Science, but not during submission

You’ve spent what is probably a significant fraction of your life solving problems and doing experiments for that project, and now it’s finally progressed to the point where you think it is ‘publishable’. And that’s when the other important part of a project’s lifecycle starts – the manuscript. The manuscript is too often only an afterthought to the science, but any seasoned scientist will tell you that the manuscript is the pragmatic final piece of what started as an abstract puzzle. Much good science has lost out to an unconvincing or confusing piece of text that fails to reveal and explain to others the findings of that story. You see, it’s that old philosophical question – ‘If a tree falls down in a forest…and there is no one to hear it, did it still fall down?’ . You’d better write a darn good story of that tree if you want people to hear it.

Having recently been occupied with the preparation of 3 manuscripts over the last 4 months, I’ve noted quite a few lacunae with the standards used by journals for accepting data during submission.

A Picture is worth a Thousand Dollars

In general, figures must be submitted in Adobe Illustrator, EPS or PDF. Here are the relevant excerpts from the author guidelines from some top-ranking journals.

Cell : They should be either Photoshop or Illustrator files (in .tif, .psd, .eps, .ai, .pdf, or .jpg format) at 300 dpi resolution (for a figure 3 to 5 inches in width).

Nature : Acceptable formats include: AI, Vector EPS, layered PSD, postscript, PDF, PowerPoint, Word, Excel and CorelDraw (up to version 8) [PDF alert]

Science : Electronic figure files at the revision stage must be in one of the following formats: Adobe Portable Document Format (PDF), PostScript (PS), or Encapsulated PostScript (EPS) for illustrations or diagrams; Tagged Image File Format (TIFF), EPS, PS, or PDF for photography or microscopy. Authors who have created their files using Adobe Illustrator or Adobe Photoshop should provide their files in these native file formats. We cannot accept files in other formats.

If you cannot see the problem here, let me elaborate. The accepted vector formats here are either proprietary (like AI) or badly suited to digital access (like EPS, PS). The case is similar with for the raster formats — PSD is proprietary, TIFF creates giant files, and JPG usually causes degradation in quality due to compression. In most cases, even PDF is an Adobe PDF, which as we all know is another proprietary version of a more or less common format. Adode’s PDF format often does not play well with software that can read the more open versions of this format. In any case, PDF is a distribution format, and not a creation format.

This boils down to the fact that if you need to submit a manuscript with figures , you must purchase licenses to software that typically runs into hundreds of dollars per user, and end up with a locked down format that has limited interoperability.

I am not going to complain without stating the solution. In fact, the solution has been widely agreed upon already. The standard format for creating and displaying vector formats is Scalable Vector Graphics (SVG). It is The SVG specification is an open standard developed by the World Wide Web Consortium (W3C) since 1999. Most open-source and proprietary vector illustration programs can open and edit SVG files. Even browsers can display SVG files, and these files can be animated using another open standard that is fast becoming the basis of the mobile and desktop web : HTML5. SVG is also XML based, which means it can be parsed and data can be extracted and embedded easily.

For raster data (images mostly), the sane choice is to use Portable Network Graphics (PNG). PNG was created as an improved, non-patented replacement for Graphics Interchange Format (GIF), and is the most used lossless image compression format on the World Wide Web. Even a cursory read up of the PNG specification shows that it was specifically created to address the problems with JPG, GIF, TIFF etc. and is , as a result, infinitely better. As with SVG, PNGs can already be displayed in modern browsers without any other requirements.

Both these open formats have excellent open-source, cross-platform and free tools available for creating content in them : Inkscape (vector illustration), GIMP (raster manipulation) and Scribus (layouts, like Adobe InDesign).

Science journals constitute the information distribution system of a community that prides itself on its rationality. Can someone educate me on why then do they not even accept the appropriate open and most convenient formats for the data scientists produce?

If I had a penny for every Word…

The situation with the manuscript text is not too far either – Most journals will accept only Microsoft Word documents. Sometimes only old-style .DOC files, as opposed to the newer and slightly better .DOCX files. Everything that applies to proprietary figure formats also applies to proprietary text formats. I am not going to rant again, but the simply state the solution (which already exists and is standardized). The Open Document Format for Office Applications (ODF), also known as OpenDocument, is an XML-based file format for spreadsheets, charts, presentations and word processing documents. It was developed with the aim of providing a universal document format that could be used with any office software suite.

Again, there is fantastic open-source, cross-platform and free software available for creating these documents. I personally use Libreoffice. You could of course use Apache Openoffice too. Even Microsoft Office can view, edit and save ODF files.

I remain perplexed as before on why journals will not accept OpenDocument files as manuscript text.

Cite as you write

You’ve probably gotten the gist of this article now. All I need to say is why use Endnote when you can use Zotero ?

In conclusion, scientists need to persuade journals to accept these open formats. Their open-ness liberates data for use by everyone, unrestricted by what licenses you can afford to purchase. In my view , it is just as important as the movement to have science journals allow public access to papers. The onus is on scientists to first embrace these open standards themselves (not so easy, but a story for another day).

One day, we might be able to read papers interactively — fit that Gaussian distribution on the fly, see that image of a galaxy in different contrasts and mash that population data from one report with the climate data from another report all in a browser. It’s already possible — just not implemented.