Cashing the cheque of open access or Machine learning and Scholarly tools — Meta, Scite, Paper Digest and more

Published in

Academic librarians and open access

10 min readDec 25, 2018

I wrote about a dozen new tools of interest if you do academic research in July, however the March of progress never stands still so I am back to talk about a couple of new tools that have launched, or are in closed beta as at time of writing that might be worth keeping your eye out for.

Keyword based search — Lens.org , Meta

2. Citation based tool — Sciride, Scite

3. Autosummarization — Paper Digest , Scholarcy, Gettheresearch, Iris.AI

The key thing to note is that we are in the beginning stages of an explosion in ideas and innovation made possible by the rise of open access and open data. Many of these tools I mention exist today only because they have free access to millions of full text open access articles and can apply the latest Machine learning , NLP and AI techniques .

1. Keyword based search — Lens and Meta

Lens.org isn’t new , but it has recently improved it’s capabilities to the point it is worth a new look.

It is hard to beat Google Scholar for pure discovery but in terms of size — Lens which brings together Scholarly records from Crossref, Microsoft Academic, Pubmed Central certainly isn’t lacking in scale. As I write this, I see almost 200 million records which makes it one of the biggest Scholarly Index out there.

Unlike Google Scholar, Lens is also a power user tool and you will be amazed by the powerful and flexible filters and advanced search allowing you to filter to anything from funding information to author affiliation.

All this is available thanks to the blending of open data sources from ORCID, CORE, Unpaywall and more.

The combination of data and filters means you can do a lot of analysis that you would be hard pressed to do even on commerical tools.

For instance, I was recently asked to check which papers a) written by our institution authors , b) had funding c) and had released papers open access. This is trival in Lens.

In fact, Lens is clearly designed to be an amazing analysis tool, you can set up collections , get alerts and even do bulk uploads to up to 50,000 records (more generous than what you get from commerical expensive tools like Web of Science).

In fact, the analysis panel already shows some simple visualizations you can use without exporting.

Visualization panel showing 482 records with funding information belonging to my institution and OA license

I can go on of it’s many features e.g. patent cites, and if you are interested here’s a very long detailed review, but do give it a try, since it is available for the low price of nothing by the non-profit Cambri.

Meta is a highly anticipated tool sponsored by Chan Zuckerberg Initiative and like Lens is likely to be free forever without ads.

Unlike Lens which is cross-disciplinary this tool is focused on support biomedical researcher.

Meta — new smart tool for Biomedical researcher

As I write this, this tool is still in closed beta and I’m still playing with it.

The main thrust of this tool appears to be to learn what interests you and sets up feeds that you can tweak to teach the tool what interests you.

2. Citation based tool — Scite, Sciride, Semantic Scholar

In a earlier post, I talked about Citation Gecko , a tool that allows you to enter some seed papers and the tool will try to use open citations to try to identify related articles.

As nice as Citation Gecko and other citation based tools are, they only tell you if a citation exists but can’t tell you the exact nature of the relationship between the links, and in the case of a citation between publications we are told there are 13 different reasons to cite. Can we innovate in discovery by exposing the nature of the citations between items and allowing users to filter this way?

This is what Sciride does by indexing citation statements from open access biomedical articles from PMC.

What is a citation statement? It is “sentences from scientific publications, supported by citing other peer-reviewed manuscripts”

In other words a citation statement would be something like this.

“Google Scholar is shown to have high recall but low precision.” (Tay, 2010)

Sciride allows you to do a keyword search of the citation statement. So in this example I search for the terms

Google Scholar high recall and I get….

I’m sure you can think of many uses (e.g. to look for citation statements around a certain software or practice or even person, search for something you know exists but forgot the title), but currently Sciride is of limited use outside the domain it covers (life sciences).

It seems to me that discovery services with full text they have might be able to implement something like this with some effort (assuming rights to do this). Presumably work would be needed to reliably identify citation statements and index them.

Arguably Sciride could have gone further. After all, all it is simply doing allowing keyword search over the citation statement. Can we do some kind of sentiment analysis to see if the citation statement is positive or negative?

Sentiment analysis of citations

http://rfactor.verumanalytics.io/ goes further then Sciride and tells you if a paper supports or refutes the paper it is citing.

How does it do this? Apparently this is via manual tagging which limits the ability for this feature to scale.

Even more complicated relationships for citations have been proposed that go beyond even this.

For example there is the very interesting proposal in CiTO(the Citation Typing Ontology). which proposes to “ to enable characterization of the nature or type of citations, both factually and rhetorically, and to permit these descriptions to be published on the Web.”

Factual typing of citations could include properties like “is cited by” or “has quotation”, while rhetorical is divided into 3 subclasses, positive (e.g. “supports”), negative (e.g. “disputes”) and neutral (e.g “reviews”). See more here.

CitO object property list

The main problem with this of course is who is going to code all the citations? This paper reviews some of the author annotation tools like Chrome extensions and other writing tools but I doubt this is enough without a automated or semi-automated coding system created by machine learning, see for example CiTalO or CiTO algorithm.

But do automatic sentiment analysis methods do exist for telling if a citation is a positive or negative cite?

Scite — Shepardizing for Science

Based on the above, we now come to the logical idea, a system that can automatically learn to classify citations. Indeed Scite is a system that uses machine learning to categorize citations into “mentioning”, “supporting” and “contradicting” (and when the system isn’t confident a “unclassified” category)

Scite example showing cites of 10.1056/nejmoa1200303

The tool has at the time of this post is still in closed beta (EDIT : It is now available), but as you can see from above, besides allowing you to filter by type of citation, you can also filter citations by the section where they appear (e.g. “intro”, “method”, “Results”, “Discussions”).

One can think of how such tools could be used e.g. mapping claims by supporting evidence, new metrics, but this hinges on the precision and recall of the judgements made by the machine learning tool.

Can we determine which citations are influencial in a paper?

If telling whether a citation is a positive or negative cite is hard, how about telling if a cite is important or critical to the paper? We know that a lot of cites people make are not really critical to the paper but what if we could identity the cites that are significant?

In fact, yes we can , and this is a feature in Semantic scholar — another fairly new niche search limited to the domain of Computer Science.

Semantic scholar not just shows cites but tries to identify influential citations.

How does that work? In a fasincating paper entitled Identifying Meaningful Citations, the authors described the work they do to identity which citations to a given paper are important and which are not.

Using a hand-coded set of citations , they try to use machine learning to train the system to recognise important citations. Impressively it is designed to try to catch not just direct citations as well as “indirect citations”.

Some citations are direct, i.e., the citation follows an established proceedings format, others are indirect, where the work is cited by mentioning the name of an author, typically the first author, the name of the cited algorithm, of a description of the algorithm

For instance,

Some indirect citations it is trained to recognise

They tested a bunch of features but it turns out the number of times a citation appears in the paper (both in total throughout the paper and in each section, the section it appears in (e.g. appearance in methods section is usually more important than in review section) , author overlap are important features.

Their system has a high recall for recognising important citations but moderate precision 0.65.

Another interesting bit about Semantic Scholar , is thhat they can identify Surveys and reviews using heuristics.

3. Autosummarization of text — Scholarcy, Paper Digest , Iris.AI

In an earlier post, I’ve talked about Scholarcy a tool that can take text from articles and book chapters and provides a auto summary of the text.

Scholarcy summary of abtract, key points etc

A similar tool is Paper Digest that takes pdfs of articles and summarizes it.

I’m personally a little on the fence of the utility of such tools. I suppose it is a question of value added of such tools vs skimming a paper by looking at abstracts and discussions.

Of course, the more familiar you are with the field, the less help you need with auto-summarizes so the question is can these tools go beyond summarizing but help novices in the area get their bearings? In other words, can they help provide context?

Scholarcy provides some basic support of this by pointing to Wikipedia articles on the topics it detects.

Background reading with links to Wikipedia provided by Scholarcy

But this only scratches the surface of what might be possible in this area. A very ambitious project that has been announced is “Get the Research” by Impactstory, the people behind unpaywall.

Unpaywall has build up an archive of 20 million open access papers and now have obtained fundings to see if they can use the latest Machine learning techniques to build a “AI-powered Explanation Engine”.

What kind of tools? Well, let’s go back to the Hamlet example…today, publishers solve the context problem for readers of Shakespeare by adding notes to the text that define and explain difficult words and phrases. We’re gonna do the same thing for 20 million scholarly articles. And that’s just the start…we’re also working on concept maps, automated plain-language translations (think automatic Simple Wikipedia), structured abstracts, topic guides, and more. Thanks to recent progress in AI, all this can be automated, so we can do it at scale. That’s new. And it’s big. — https://gettheresearch.org/

Finally we come to Iris.AI , a interesting tool that aims to outdo traditional search tools , where you can do searching by writing problem statements or uploading papers. It uses

a combination of keyword extraction, word embeddings, neural topic modeling, word importance based similarity of document metrics and hierarchical topic modeling. The approach is mainly unsupervised but we utilize an evaluated annotation set from our community of AI Trainers for benchmarking and improving our tools.

Conclusion

Many of the tools above are very new and many take advantage of the open access and open data wave where anyone can gain access to millions of full text to apply the latest machine learning and AI techniques to help researchers do Science better.

Indeed, we are in the very early stages of this development, and I suspect we will be seeing quite a few new innovative tools appearing along the same lines.

As “Get the research” notes, such developments will allow us to “finally cash the cheques written by the Open Access movement.”