Open access and the versioning issue — do we need to solve this?

Published in

Academic librarians and open access

13 min readJan 12, 2018

One of the major issues with institutional repositories is that it is difficult to get researchers to self-deposit their work. Assuming one could wave a magic wand and solve that, institutional repositories still have another barrier to overcome — the discovery barrier.

With content scattered across thousands of sites, one would need an aggregator site to provide a one-search across all of them.

Fortunately, Institutional (and subject) repositories were not only designed to collect deposits on a local level but it was envisioned that aggregators could be built to centralize all this work together using OAI-PMH. That said even with this in place, aggregating conents of open repositories proved to be not a simple thing.

OAI-PMH

While OAI-PMH easily allows aggregators like BASE/CORE etc to crawl and pull in the metadata from the world’s open repositories, the difficulty lay in the fact that the metadata from all these sources could not be easily combined due to the lack of consistency in standards.

In this post, I discuss the two issues that result from this, one I have already discussed in the past (distinguishing metadata only records vs full text records) but the other (determining version of paper available) is one that has recently become important.

The first difficulty — metadata only record or free full text available?

The foremost issue stemming from this is a problem that sounds odd to outsiders — open repository aggregators have problems figuring out if what they harvested is open access!

The problem is open repositories were envisioned to be filled with 100% open access content similar to arxiv so no-one envisioned the need for a standard tag to indicate if there was open full text (There is actually a fairly new NISO standard on this now). But in fact today, most repositories particularly institutional ones are a mix of open access items and metadata only records, which leads to the situation we have today.

BASE — one of the biggest OA aggregators

For instance of all the items BASE indexes from repositories only 40% of indexed items clearly labelled as open access. My own tests with tools like OA button that rely heavily on BASE also shows quite a bit of false positives so clearly this isn’t a easy problem.

I’ve written about this problem many times, and it also plagues similar tools like Unpaywall, Open access button which are tasked with figuring out whether a certain paper is available in full text in the various open repositories. This is particularly so if they rely on OA aggregators like BASE .

This is of course a big issue but solutions exist. Resource-sync (unlike OAI-PMH which considers only metadata) is a possible new solution but not many support this yet. Aggregators that mimick Google Scholar crawlers that actually download the actual pdf to check is perhaps a better solution. For example oadoi is starting to do this instead of relying just on BASE.

“There are a few challenges to using BASE data. One of the most important, we already solved: they frequently do not know whether a record actually has fulltext available. So we go to the IR and actually check. That way, when we say there’s an OA copy somewhere, we can be sure there really is. We have downloaded the PDF and we know for sure.” — Jason Priem, oaDOI google group Nov 4, 2017

The second difficulty — what version is harvested?

A secondary problem is also implied by this but is seldom mentioned. Open Access aggregators have problems not only telling if there is a full text, but also they can’t easily tell what version of the paper is in the repository.

This is in fact a potentially harder problem than the first.

The versioning problem — why it is important

Let’s shift gears and look at a recent new feature release by Web of Science, where they rolled out a improvement that allowed users to filter results by the type of Open access using oaDOI technology (which is used in Unpaywall as well).

While it’s not new for web of science (and Scopus) to allow filtering to Open Access articles, this has been up to now only for Gold (possibly hybrid) articles but not for Green OA.

Filtering options in Web of Science (Dec 2017 release)

This is a pretty big deal in my opinion. It’s one of the first searches I’ve seen that not only labels what is open access but whether it is Green or Gold. That alone is interesting.

I’m unsure if this will give a big boost in Green OA discovery, as my perception is Web of Science is losing it’s importance for discovery purposes (but it is still important for signalling of prestige and bibliometric purposes), but still one wonders if this will start a wave of similar integrations in other databases.

The different types of Green OA linked in Web of Science

But this labelling goes even beyond telling you if it is Green or Gold, it tells you the version of the paper for Green OA!

Unlike oaDOI which surfaces all versions of Green OA it can find (but notably not from ResearchGate or Academia.edu), Web of Science only links to Green Accepted (labelled as “Free Accepted Article From Repoistory”) and Green Published (labelled as “Free Published Article From Repository”).

A note about version terminology

I’ve always been a little confused by the various terminology used to describe the different versions of papers. Here’s a taste, depending on the discipline you will run into terms like preprint/postprint/author accepted manuscript/working paper, Version of record (VoR) etc.

But in general, I think there are three major versions that are important

1. Version of Record /Final published version — This is the version that is nicely formatted with paper numbers and is the version you find in subscribed databases or journals.

2. Author accepted manuscripts (sometimes also called postprints) — These are versions that have gone through peer review, the authors having made amendments in response to comments made by peer reviewers and is accepted for publication. There might still be minor editing to text and references after this point, but the substantive content is there already.

3. Anything prior to #2 can be versions that have not yet been (if will ever be submitted) for peer review, versions submitted to peer review (but haven’t been updated in response to pre-review) etc. Sometimes they are called preprints or working papers.

There is quite a bit of confusion here due to disciplinary differences and outdated terminology that no longer serves it’s purpose (e.g. Preprint). For instance I was watching a Center of Science webinar on preprint servers and it was stated preprints includes anything up to version of record.

Still I think consensus is starting to form around the fact that versions #1 and #2 are somehow qualitatively different from the other versions as they have gone through peer review and as such despite minor stylistic differences are essentially the same “thing”. This has been recognised by Crossref which recommends these two versions be assigned the same doi, while for preprints a different doi should be assigned (but the 2 dois can be linked with a relationship such as isPreprintOf)

It seems Web of Science by only surfacing links for “Green Accepted” and “Green Published” is consistent with this position and have chosen not to include links to other versions (e.g. preprints) which are seen as different?

How does one determine the version of a paper in the IR?

A leading librarian in Scholar Communications Bianca Kramer noted a discreption between the % of Green OA reported in Web of Science vs the figure reported by the developers of OAdoi in their recent paper — The State of OA: A large-scale analysis of the prevalence and impact of Open Access articles

@oaDOI_org @researchremix @jasonpriem @clarivate Comparing current OA numbers in WoS with analysis of WoS sample in https://t.co/ikg9xHFuVa — why would % green be so much less now? (3.9% for articles+reviews 2009–2015, 11.5% in similar sample in preprint data) pic.twitter.com/WfwjnSPGmX
— Biⓐnca Kramer (@MsPhelps) December 11, 2017

One of the reasons for the lower percentage is Web of Science doesn’t link to all Green OA versions but only a subset of Green OA as already discussed. But is that the full story?

I asked the next logical question.

Btw how does oadoi determine version? If I’m a IR that want to signal a paper is a accepted or published version, what should I put in?
— Aaron Tay (@aarontay) December 11, 2017

Bianca Kramer pointed to me that fact that oaDOI V2 API points to a DRIVER Guidelines v2.0 VERSION standard. This seems to date back to 2007 which seems a bit dated. There’s also The NISO JAV (Journal article version) standard but I’m currently unsure how widespread support for that is either.

In any case, this was a response from oaDOI about the latest development

We found much IR version metadata was inaccurate, alas. So we developed an automated, heuristic-based approach, which we now apply to the IR-hosted PDF itself (eg, looks for crossmark, publisher watermarks, etc). Most of our current version info comes from that.
— Jason Priem (@jasonpriem) December 12, 2017

Intriguing. In a sense this was expected, as always when it comes to institutonal repositories, metadata is omitted or messy and oadoi decided to go with downloading the pdf and trying to figure it out with heuristics.

How easy it is to determine OA version?

Everyone agrees that Google Scholar is the gold standard for finding OA articles. By not relying on oai-pmh they are able to easily identity free full text and avoid the issue traditional OA aggregators have with this issue. In fact, because of this, for most institutional repositories they are the #1 source of downloads.

They even have applied their own machine learning algothrims to automatically identify different variants of the same paper and cluster them together, one of their chief innovations when Google Scholar first emerged.

Google Scholar can cluster different versions of a paper together regardless of source

However they do not attempt to differentitate the type of version of paper. It’s unclear how difficult it would be for them to do this, though they do try to identify the “primary version” of the paper to display as the title link.

The question then is , how well does oaDOI correctly classify the versions of papers found in open access repositories? For institutional repository managers the question is , is oaDOI picking up the Green OA that are author accepted manuscripts or version of records and labelling them as such?

It’s hard to say, but so far I suspect they are currently missing a lot of Green Accepted from Open access repositories particularly from institutional repositories. I highly recommend IR managers to try it themselves by using the oadoi API or even easier to just use Web of Science and restricting to their institution output and Green OA.

For example, I tried restricting the search to MIT output in Web of Science, following by further refinement to Green OA published and Green OA accepted. So far after some sampling, I was not able to find even one paper linkng back to the MIT Dspace repository. The vast majority was to Pubmed Central and arXiv.

It’s possible of course that oaDOI is able to find Green OA accepted and published versions in MIT dspace but gives preference to the ones found in disciplinary repositories like PMC. But does this happen for every case?

More testing is required.

How do library discovery services handle different versions?

Besides BASE, the other major OA aggregator is CORE and they recently announced a tie up with Proquest such that contents in CORE will be eventually indexed in Summon and Primo two common library discovery services.

CORE — open access aggregator

Someone on Twitter asked the logical question, which version would Primo or Summon display if both the version in CORE (say it was a preprint) and the version of record was available for that instance of Summon/Primo?

Primo and Summon are not strangers to the idea of clustering or merging metadata from different sources.

Primo clusters metadata from different sources into one search result

For instance, Primo might have the metadata for a paper from Wiley — the main publisher as well as metadata for the same paper from JSTOR or aggregators like Proquest and it cleverly clusters them and shows them only once , instead of multiple times in the search results.

But notice the difference here, we are talking about basically the same content or version — the version of record , it’s just that they have slightly different metadata (e.g. subject headings) from difference sources.

This is unlike Google Scholar where it is clustering/merging different versions of the same paper.

In fact, I believe how Primo/Summon will or in fact currently handles different version of papers is they will be treated independently.

There is also currently no attempt to merge or show any relationship between say a preprint in something harvested from a subject repository and the version of record from a publisher.

Post blog note : I think the above applies only to Primo. Summon does merge them together. Below smentions case of Summon merging a record from ArXiv with Proquest. Presumably the arXiv version might not be VoR

“Your results set contains an article found in a subscription package from ProQuest which you track as part of your holdings. This same article also merges with a record from an Open Access result from arXiv.org which you do not include in your library holdings. Without the “add results beyond collection” feature selected the arXiv results are NOT returned together with the ProQuest record because the library doesn’t subscribe to arXiv as part of its holdings. Therefore, no OA symbol is shown. “

While Summon now has a open access filter and label (it will be coming to Primo next), it does not tell you what version the paper is (or even if it is Green or Gold).

Label for open access in Summon

In other words, you are likely to see seperate entries in the search results for each version and for the user it would be very difficult to understand why this happens or to know which version is which without clicking in.

Perhaps this is something the current Basecamp set up by Exlibris on Improving Open Access visibility in Summon and Primo can work on?

Display decisions — how do we show the different versions in search?

Let’s take it that we can solve the versioning technical issue and can even tell how two versions are related. How will the world look then?

In “Academic libraries in a mixed open access & paywall world — Can we substitute open access for paywalled articles? “ I discussed Ryan Regier’s idea that libraries should show postprints/author accepted manuscripts in preference to subscribed paywalled version of records even when available. His idea is that this shows the real demand of the need for paywalled journals and in long run can show which titles can be cancelled.

Similarly work to integrate ILL services with OAbutton in pilots by JISC seem to be done with the same idea in mind.

Yet another variant of the idea is this. There are also services appearing that aim to allow you to determine levels of Open access at the journal title level to aid renewal decisions.

All this of course relies on 1) the accuracy of the system in detecting the version of Green OA found and 2) the institution making a decision on whether earlier versions are acceptable compared to the version of record. (or perhaps it can be a user configured option for what the default is)

On one hand, author accepted manuscripts are in theory substantially the same as the version of record, so Ryan’s idea sounds doable if we only show those as default.

Of course, Not everyone agree on what to do, some argue if we have the version of record we should always show that because it is superior and most users would prefer that even to a author accepted manuscript much less a preprint. Also there are difficulties like lack of paging, and the difficulty and uncertainity of citing a postprint as compared to a version of record. If you talk about preprints the problem gets worse.

In fact, I have a gut feel a lot of cites to versions of records are actually based on just viewing Author accepted manuscript or even gasp preprints….

Regardless of your stance on this view, the minmum seems to be clear labelling of exactly what paper version is available and relationships between them. If one wants to make Green OA more successful (and perhaps you subscribe to the idea of rising Green OA levels lowering the price of APC Gold OA), lobbying citation styles to clarify citing of versions other than the version of record is useful as well.

Conclusion

I hope I have shown in this long article why versioning of papers is becoming a important issue. With preprint servers posed to take-off , discovery of open access now needs to go beyond saying a OA version exists, but also labelling it with the version of paper that is found.

Still I can envison scenarios where solving this issue is less important.

In particular, a world where Gold OA is dominant (presumably with APCs as most traditional publishers hope) , perhaps the need is less for solving this versioning issue.

A world that coexists with Gold OA and Subject repositories/Preprint servers with Green OA, might make the versioning issue easier to solve because you would have less players and easier to coordinate standards or for aggregators to do ad-hoc workarounds for the dozens of big players.

A world we have now where there is Gold OA, Subject repositories/Preprint servers and thousands of institutional repositories is where the versioning issue is hardest to solve.