I tried using oaDOI.org API to find the amount of Open Access and you won’t believe what I found!
By now most people would have heard of the very useful Unpaywall service that allows you to bypass many paywalled articles by sending you to free versions.
While not perfect and all comprehensive, it’s one of the easiest ways to find if a free version of an article exists. The underlying service of unpaywall is oaDOI.org, which provides a free nifty API service that you can use.
Note: Unpaywall itself can find more articles compared to the oaDOI.org service, because it “supplements oaDOI with other data sources, too; for instance, Unpaywall tries to parse and understand scholarly article pages as you view them. Consequently, Unpaywall’s results are a bit more comprehensive than what you’d get by calling oaDOI directly.”
In this article I show the steps to do the analysis on your own, and in the next article “Open access rates of a institution’s output vs a LIS Journal output — or are librarians walking the talk?”
How to use the oaDOI API to find out how much is free
All you have to do is have a bunch of dois that you are curious to see how much is available free and point the oaDOI.org service API at it and the API can spit out a bunch of useful information like whether a free to read version exists, the URL of the free version (if exists), color of the open access version etc.
The very useful “Collecting Open Access information using OpenRefine and the oaDOI API”, sets out the steps to do so , but what can you really do with this?
Scenario 1 : Check how much of your institution’s scholarly output is free to read.
Ever wondered how much of your institution’s scholarly output is already free to read? While it might be impossible to know with 100% certainty, you can certainly get a ballpark figure using the oaDOI.org API.
First you need a source of dois. Easiest perhaps is to use something like Scopus or Web of Science to find all articles by affiliation search and export the results with dois to csv for that. This is what I did. Other sources of dois, say from your Institutional repository or CRIS system works too.
You may need to do this multiple times for various years if your institution has an output that exceed 2,000 results. Once you have the results you needed checked, select “Export”
In my case, I exported records from 2013 to 2017, a total of 1,968 records. I later did the same for the remaining records from 1999–2012 for a second batch of records.
Then I ran Openrefine and waited for the web interface to load.
Once it’s loaded, I clicked onCreate project->Get data from this computer->Choose file and browse and I selected the file (2013 to 2017) just downloaded from Scopus.
Once I checked everything was loaded okay, I clicked on “create project”
Once everything was imported into openRefine , I checked how many of these records have dois, as oadoi API only works on dois.
To do that, go to the DOI column (click on the down arrow)-> Facet-> Customized facets-> Facet by blank
I see from the side filter panel on the left, 220 records have no dois (blank=true) and hence 1,748 have dois. oaDOI.org API will only work on the 1,748 records with doi. I click on false in the side panel to filter down to just the records that have dois and ignore the blank records.
The next step is where the magic happens. On the DOI column (click on the down arrow), click
Edit column->Add column by fetching URLs.
In the expression space, type
UPDATE : 25/7/2019 , oadoi has been renamed to just Unpaywall API, do instead
This constructs for each record a API call to oadoi using dois.
Don’t forget to name the new column, in my case I used oadoi.
No registration is needed for the use of the API (no rate limits but suggested for below 100k calls per day), but do remember to change the part that says =firstname.lastname@example.org to your email as well as the new column name. In my example below I changed the new column name to be created to oadoi.
It will take roughly 2–3 hours for the API calls to complete but when it is done you will see a new column appear. In my case it is the oadoi column.
The data extracted is in JSON, which may be intimidating but don’t worry, openrefine provides tools to handle it.
On the oaodoi column, click on the down arrow -> Edit Column->Add column based on this column.
In the expression column try
Again remember to enter a name for the new column.
This will allow you to extract the value key pair “is_free_to_read” for each record. Obviously for records that have values “true”, there is a free version found by oaDOI.
You can repeat the process and create new columns with expressions like
a. value.parseJson().results.free_fulltext_url (create column with URL of free full text found)
b. value.parseJson().results.oa_color (create columns with information of OA color — Green/Gold/Blue)
You can of course replace the parts in bold with other values/fields. But which?
Want to know more about the fields you can add beyond oa_color or free_fulltext_url ? Refer here.
Some interesting ones, evidence — shows source that oadoi queried.
Another one is License — License free copy is made available with
And finally oa_color_long
How much of the content is free to read?
In my example above, I extracted fields like OA_color, Free to read (set to “true” if free copy found, “false” otherwise), Free Text URL (blank if no free full text found), Found Green OA , Hybrid available (“true” if copy is free to read in hybrid journal) etc.
Using this data with openrefine it’s very easy to work out that out of 1,748 records with doi, I got a total of 260 free to read (14.8%).
You can use Free Text URL column to look for URLs of the free full text to study where researchers are putting up their papers or even download them to deposit into your own IR if you have a practice of doing so.
You can go further and use openrefine with SherpaRomeo API, to check on all articles including those without free to read versions to see how much more potentially could be made open access.
Scenario 2: Check how much of a journal title you are considering subscribing or cancelling is free to read.
One of the most long standing debates in the open access world is whether embargos are needed. Publishers of course claim embargos are needed to protect themselves otherwise librarians would start cancelling subscriptions due to availability of self archive versions.
Some open access advocates claim that librarians can never cancel subscriptions due to self archiving allowed by Green OA because this can happen only if Green OA reaches 100% for the title.
My view is this.
The “ hard to figure out” part s slowly changing with commercial services like 1science’s OAfigr but with the magic of oaDOI and openrefine you can figure out a similar statistic with some effort following the same steps as before.
The only difference is that you use the dois of the articles in the journal title you are interested in. One can again use Scopus or better yet crossref’s api (if title not indexed in Scopus) to do so.Once you have the list of dois, the same steps as above apply.
So far I have only tried to do this for 3 Journals, focusing on LIS related ones. What do you think I found?
What % of LIS journals do you expect to be free to read?
I’ve so far tried with only 2 LIS titles and these are the results
Journal of Business and Finance Librarianship (Taylor & Francis)— Out of 310 articles (1990–2017) with doi, only 11 articles were free (Green), that’s 3.5%.
Journal of Academic Librarianship (Elsevier) — Out of 1,398 articles with DOI (1993–1996, 2001–2017) with doi, only 112 articles were free (11 blue, 101 Green), that’s 8.0%.
I don’t see any particular patterns between years of publication and likelihood to be made free to read, though for the later, the highest counts are in 2014 (20), 2007(14), 2013(11) which are in the later years. This corresponds to 18.9%, 17.1% and 12.4% by yearly output.
For reference this is what Sherpa Romeo says about what is allowed with the journal of Academic librarianship
So there you have it. How to do your own analysis using the free openrefine and oadoi API service.
More analysis can be done on the data for example e.g. How does the open access rates differ by years? Are there more Green OA, Hybrid or Gold OA? What CC licenses are employed?
Go on to “Open access rates of a institution’s output vs a LIS Journal output — or are librarians walking the talk?” for answers to these questions and more.