The open access aggregators challenge — how well do they identify free full text?

Published in

Academic librarians and open access

10 min readJan 8, 2017

Bielefeld Academic Search Engine (BASE) created by Bielefeld University Library in Bielefeld, Germany is probably one of the largest and most advanced aggregator of open access articles (hitting over 100 million records), others on roughly the same level are CORE (around 60 million records) and OAIster (owned by OCLC).

One way of seeing this class of open access aggregators is to see them as similar to web scale discovery search engines like Summon, EDS, Primo and WorldCat Discovery service. but focusing mainly in the open access context.

How well do web scale discovery engines cover open access?It seems natural to think that index based solutions like Summon, Primo, EDS should cover both paywall contents as well as open access content, particularly since they typically can use OAI-PMH to harvest the institution’s own Institution Repository. In reality, their coverage of open access material can be spotty. The best ones have indexed OAIster or BASE. But even when open access sources are available in the index, many institution’s choose not to turn them on  for various reasons. This includes unstable links, inability to correctly show only open access material as well as flooding of results by inappropriate data (e.g foreign language or irrelevant subjects).

Just as Web Scale Discovery services like Summon, Primo and EDS naturally draws comparisons with Google Scholar, BASE, CORE and those in that class can naturally be compared with Google Scholar.

Like theirs cousins in the Web Scale Discovery area, BASE (as well as CORE) have an advantage over Google Scholar in that they are designed for more advanced users with more filters and facets as well as an advanced search that is well more “Advanced” (more options) anyway compared to Google Scholar.

Still users who are not well versed with the difficulty of aggregation might complain that while BASE has more filters than Google Scholar, it’s still a far cry from what a full fledged database like Scopus or even Psycinfo offer.

Of course, it’s not realistic to expect search engines like BASE to have detailed filters (and the same reason applies for Summon and company) because they need to aggregate a big mass of different sources together where each source has different fields and more than that have different vocabulary, data schema.

For web scale discovery services, this has been a constant thorn in the promise of “one search” and I understand companies like Proquest spends tons of efforts hiring many staff and manhours to clean, harmonize and even FBRize all the different data sources they get from various publishers, aggregators and other content providers, particularly when the type of content (think Journals vs books vs music scores vs images) indexed differs.

Sadly people who aim to aggregate the open access material in the world face a similar daunting task even if we restrict it to the world of papers and thesis.

As I wrote in “Aggregating institutional repositories — A rethink” about the inconsistency of metatadata harvested from institutional repositories via OAI-PM is an issue.

Under the heading for “Minimal Repository Implementation” in “Implementation Guidelines for the Open Archives Initiative Protocol for Metadata Harvesting” we see it advises that “It is important to stress that there are many optional concepts in the OAI-PMH. The aim is to allow for high fidelity communication between repositories and harvesters when available and desirable.”
Also under the section on dublin core which today is pretty much the default we see “Dublin Core (DC) is the resource discovery lingua franca for metadata. Since all DC fields are optional and repeatable, most repositories should have no trouble creating at least a minimal mapping of their native metadata to unqualified DC. “
Clearly, we see the original framers of OAI-PMH decided to give repositories a lot of flexibility on what was mandatory and what wasn’t and only specified a minimum set.

So yes, aggregators like CORE and BASE actually don’t have it easy when aggregating on fields for filtering.

A unique challenge for open access aggregators

One area where BASE and CORE may differ from Summon and Primo is in that open access aggregators need to be able to tell if an article they harvest from a subject or institutional respository has free full text and this isn’t that easy.

Other challenges of aggregatorsWeb scale discovery services struggle with problems like the appropriate copy problem (which content provider should I send the user to so he can access the article) which is weakly analogous with the problem of identifying which items in a open access repository is free. One other area that open access aggregators may face is the challenge of handling varying versions of the same item. An item may exist as a working paper, a conference proceeding/paper, a preprint, a submitted & accepted postprint & version of record. Is there a way to link them together and show that they are associated? Can you even reliably tell they are associated?

This seems odd if you do not understand the history of open access repositories, but suffice to say when OAI-PMH (which is the standard way of harvesting open access repositories and was established as a way of harvesting metadata only and not full text) was established it was envisioned that most if not all items in such open access repositories would be open access (following the example of Arxiv), so no provision was made to have a standard way or of a mandatory field to indicate if the item is free to access.

In today’s world, of course subject and in particular institutional repositories are a mix of free full text and metadata only records. This happens in particular for institutional repositories because they have multiple goals beyond just supporting open access.

What are the multiple purposes of Institutional repositories?While most librarians are familiar with Institutional repositories mission to support open access they may not be aware that it is not their only purpose (I also argue even advocates who support self archiving in the open access agenda can have different ultimate aims). Other purposes includea) “to serve as tangible indicators of a university's quality and to demonstrate the scientific, societal, and economic relevance of its research activities, thus Increasing the institution's visibility, status, and public value” (Crow 2002)b)"Nurture new forms of scholar communication beyond traditional publishing (e.g ETD,  grey literature, data archiving" – (Clifford 2003)It is purpose A, tracking the institutional’s output that results in Institutional repositories hosting more than just full text items. Many institutional repositories have in fact more metadata only items than full text. It’s a rare Institutional respository that has more than a third full text records.

Truth be told, most open access aggregators I have seen simply give up on this problem and just aggregate the contents of whole institutional repositories giving users a mistaken idea that everything is free.

This leads to users wondering if something is wrong when they click through and get led to a metadata only record in the repository. This btw was the reason why I and I suspect many librarians tend not to turn on open access repositories available via Summon/Primo because it doesn’t really show only open access items and it’s a rare few that is say 99% free items (typically ETD or electronic thesis dissertations collections but even though has the occasionally embargoed ones), while many have in fact more metadata only records then full text records particularly if they blindly pull in metadata content via their institution’s research publication systems and/or Scopus/Web of Science.

There are of course ways to identify full text in repositories and Google Scholar seems to do it beautifully on an item level (via spidering to detect pdfs?) but that doesn’t seem common for non-google systems. As it stands, Google Scholar is current my #1 choice whenever I need to check if free articles exist.

One possibility is for institutional repositories to create “collections” that are 100% or near 100% full text and pull in such items by collections. This usually is what happens for ETD.

The other way of course is to set a metadata tag for each item that has full text but I’m not sure if there is 100% universal standard for this. A good start might be OpenAire’s standard.

BASE indeed does suggest you to support this for optimal indexing. I am not sure how wide spread this is outside the EU.

I’m not a repository manager so I’m not sure how this works, but I get the distinct impression that Digital Commons repositories can definitely reliably identify full text records, given that there can be full-text PDF RSS feeds, I’m just not sure how a third party aggregator can exploit that to identify full text and whether it can be generalised to all Digital commons repositories.

In any case, I think one can probably “hack” and create workarounds to reliably detect full text for one repository the trick is to do it without much work for most of them.

In a sense centralised Subject repositories have the advantage over institutional ones here because by the virtue of their mass, there is great incentive for aggregators to tweak compatibility with them compared to any individual institutional repository.

In any case, both BASE and CORE are capable of identifying full text records in their results, the question is how accurate are they?

How well does BASE and CORE do for identifying full text?

The nice thing about BASE is that it allows you to run a “blank search” which gives you everything that meets the criteria (similar to Summon). So one can easily segment the index based on criteria you desire without crude workarounds like searching for common words that all records would have.

Base results restricted to Source: Singapore

The above shows that when restricted to Singapore sources, BASE knows of

66,934 records from National University of Singapore’s IR — dubbed ScholarBank@NUS (using Dspace)

21,199 records from Nanyang Technological University’s IR — dubbed DR-NTU (using Dspace)

16,625 records from Singapore Management University’s IR — dubbed INK (using digital commons). [Disclosure I’m a staff of this institution]

Based on my colleague's recent Singapore update on open access figures for total records in each of the repositories — this shows a rough coverage of 67%, 89%, 98% respectively in BASE.

Take this figures with a pinch of salt because the total records I am getting are based on different times, e.g the NUS total record is as of 30 Sept 2016, NTU total record is of 18 October 2016. NUS also has fairly substantial non-traditional records eg. patents and music recordings so that might affect the result. Lastly, I did the search in BASE in early Jan 2017 while the total records are from a quarter earlier, so the actual coverage is probably a bit lower.

Overall, the coverage shown isn’t too bad, but the more important point is how well does BASE identify full text? Let us filter to Access : Open Access

Not very well it seems.

It is only able to see 75 free records in National University of Singapore’s IR, 654 free records in Nanyang Technology University’s IR, 143 free records in Singapore Management University’s IR

I did not do a check to see if there were false positives in BASE’s identification of full text but in the best case scenario they are 100% correct, we see only a full text identification ratio of 0.6%, 3.8% and 2.7% respectively!

If you consider the case of Singapore Management University (disclosure again I am staff there), BASE is able to index practically every record in our Repository and yet only identifies 2.7% of our free full text. It’s in the same ballpark vfor the other Singapore repositories.

Let’s do the same for CORE. How many records does it index for the 3 Singapore repositories?

Here are the results

National University of Singapore’s Scholarbank.

Records (100,657) + Full text (12)

Keyword : repository: (“Scholarbank@NUS”)

Singapore Management University — INK

Records (18,312) + Full text (166)

Keyword : repository: (“Institutional Knowledge at Singapore Management University”)

Interestingly enough I was unable to find any articles indexed in CORE from Nanyang Technological University’s IR, it’s possible I might have missed them somehow.

In any case, I won’t calculate the percentages for the other 2 IRs, there are broadly similar to the case in BASE, except CORE seems to show substantially more records (including metadata only records) indexed than in BASE.

In fact, CORE is showing more records indexed for both universities then the total records listed in the Singapore update on open access figures (e.g 100k vs 99k in NUS and 18k vs 16k for SMU). This possible because the total records from the Singapore update on open access figures generally refer to 3Q 2016 figures so since then the number of records would have grown.

Still I suspect that’s not the full reason, there could be duplicates archived in CORE inflating the result.

More importantly in terms of full text identified the results for CORE are as dismal as BASE.

Conclusion

Both BASE and CORE are extremely sophisticated open access aggregators. For example they offer APIs (BASE, CORE), are indexed by some web scale discovery services, are doing various interesting things with ORCID, here also, creating recommendation systems or working with OADOI to help surface green open access articles hiding in respositories.

A difference is that BASE currently doesn’t search through full text while I believe CORE does.

However identifying which articles they have harvested has free full text is still problematic, BASE claims to be able to reliably identify 40% of their index as full text though the other 60% is still unknown due to lack of metadata. My own quick tests shows that it’s accuracy is quite bad for certain repositories. My hunch is that BASE either works very well with some respositories or not at all with others.

So this is a major challenge for the open access community and in particular institutional repositories to answer. The alternative is to shrug one’s shoulder’s and let Google Scholar be the default open access aggregator.

The open access aggregators challenge — how well do they identify free full text?

A unique challenge for open access aggregators

How well does BASE and CORE do for identifying full text?

Conclusion

Written by Aaron Tay