The next generation discovery citation indexes — a review of the landscape in 2020 (I)

Aaron Tay
Academic librarians and open access
30 min readOct 7, 2020

--

Chinese translated version available here

Some Discovery Citation Indexes in 2020

In terms of cross disciplinary citation indexes that are used for discovery, everyone knows of the two incumbants — Web of Science and Scopus(2004). Joined by the large web scale Google Scholar (2004), these three reigned as the “Big 3” of citation indexes for roughly a decade more or less unchallenged.

However 10 years later, around 2015 and in the years after, a new generation of citation indexes started to emerge to challenge the big 3 in a variety of ways .

As of time of writing in 2020, some of these new challengers have had a couple of years of development. How do things look now?

First off, using newer techniques and paradigms, we have for-profit companies like Digital Science launching Dimensions (2018) which strike me as challengers to Scopus and Web of Science in the arena of citation/bibliometric assessment, just as Scopus itself was a challenge to the older Web of Science back in 2004.

On the other end of the spectrum we have the rise of more “open” citation indexes . In particular, a very important player in this area is the relaunched Microsoft Academic(2016) which not only uses web crawling style technologies like Google Scholar to scour the web, applies the latest in Natural Language Processing (NLP) /“semantic” technologies and makes the dataset dubbed Microsoft Academic Graph (MAG) available with open licenses.

New [May 2021]: Microsoft Research has announced they will be discontinuing MAG as of 31 Dec 2021. This is a great loss, though other sources of open citations and data remain available. Our Research (the team behind Unpaywall) has announced a possible partial replacement called OpenAlex

Semantic Scholar(2015) is yet another project with Microsoft ties ( funded by the Allen Institute for AI) that plays in the same arena and releases data with open licenses(The S2ORC Semantic Scholar Open Research Corpus is _the newer version with some significant differences vs older Semantic Scholar Open Research Corpus). One of the more “Semantic” features of this search engine is that it types citations into whether the cite is for citing of background, methods or results using machine learning.

While scite (2018) a new citation index by a startup does not provide open data, it’s selling point is the use of NLP to type citation relationships into “Supporting”, “Disputing” and “Neutral” cites which is yet another way of contextualizing research by describin citation relationships.

Besides the two above mentioned well funded think tanks projects, we also see more grassroot like movements like 2017's I4OC (Intiative for open Citations) — an amazingly successful push to get publishers to deposit and make references open in Crossref as well as efforts by OpenCitations.net (a founding member of I4OC) to extract citations from open access papers from PMC to produce the OpenCitations Corpus (OCC), which have served to further increase the pool of Scholarly meta-data and citations that are available in the public domain/CCO.

New Hybrid/combined discovery citation indexes

For the first time by combining some or all of the below

a) open citations made open in Crossref by publishers (about 50% of works with references in Crossref is now open thanks to I4OC) —

New! [Mar 2021] — Elsevier which was the last hold-out among the 5 biggest publishers to making references open , changed their policy in Dec 2020. This was followed by two other big publishers ACS and Kluwer, resulting in a leap to 87% of Scholarly literature indexed in Crossref having references!

b) citations and metadata from sources like OpenCitations.net ,Wikidata/Wikicite , Fatcat (Internet Archive)

c) data made available in Microsoft Academic Graph and other sources

it is now possible for new big and comprehensive, mostly free to use “hybrid/merged” citation indexes to arise from the aggregation of the above mentioned sources. Still even with the raw materials freely available, one must still not underestimate the effort needed to combine, normalize and clean the data as well as the effort to create compelling user interfaces that add value.

There are at least half a dozen such discovery citation indexes, a typical example is Cambia’s Lens.org (2017) that marries patent sources (their speciality) with Scholarly meta-data and citations from Microsoft Academic, Crossref, Unpaywall, PubMed, JISC CORE and more to create a powerful free discovery citation index with power user search and visualization features.

Others include Scinapse, NAVER academic, Scilit and more. But do these new alternatives add anything interesting or desired by users to the table?

Lens.org, Scinapse, Semantic Scholar — some “hybrid” discovery index

Such hybrid type citation indexes, may also have question marks hanging over them in terms of speed of updates (see for example, see informal analysis by expert Jeroen Bosman here and here) cleanliness and consistency of data from merging so many varied sources and questions of sustainability as they rely on upstream projects and initatives to continue providing the raw materials required to construct the index (e.g. Will Microsoft Research just stop updating Microsoft Academic?).

What follows below is Part I of this series, which is an overview of what I call the “big 3” and my assessment of their strengths and weaknesses.

This will be followed by part II, which is a overview of a selected number of new discovery citation indexes, their interesting features and the value they may have for in two major areas

a. As a discovery tool for individuals

b. As a research assessment tool for tracking and measuring performance of individuals, groups, departments, institution and even countries.

Discovery tools compared

Discovery services/Citation indexes that are included in the comparison are

  1. Dimensions (Digital Science)
  2. Microsoft Academic (Microsoft Research)
  3. Semantic Scholar (Allen Institute for Artificial Intelligence)
  4. Various new “hybrid” engines based on merging open sources like Lens.org, Scinapse, NAVER academic, Scilit — focusing on Lens.org (Cambia)
  5. scite (scite) & others³

Note: To be eligible to be compared they can be either free or commericial, cover cross disciplinary domains (so for example Meta.org is excluded) and provide their own citation counts¹ (so we exclude mega repository aggregators like CORE or BASE as well as Library Discovery search engines like Primo or Summon²).

I’m also focusing more on systems which can be reasonably expected to be used in real world by users, which is why I’m excluding OpenCitations Corpus (OCC) and COCI (the OpenCitations Index of Crossref open DOI-to-DOI citations) which are more datasets than seach engines for end users though some of them are in fact included in some of the above mentioned citation indexes already.

  1. In general, many of the newer discovery citation indexes today blend together and dedupe citations from various open citation sources like Microsoft Academic, Crossref , Pubmed etc. Any such work, creates a “new” citation index in my book.
  2. Library Discovery services like Summon and Primo traditionally are not citation indexes, but libraries who are mutual subscribers of citation indexes like Scopus and Web of Science can see citation counts from those sources in Summon or Primo. However the Citation trails feature in both Summon and Primo , look like the beginning of a citation index with data drawn from Crossref and other sources, however I exclude them from this analysis, as this still isn’t a major component of such systems
  3. This is a fast evolving field, another interesting startup that produced their own discovery index through harvesting the web was the startup 1Science. Their product was initially named OAfinder and later renamed to 1Findr was eventually aquired by Elsevier. As I write this it’s unclear what the future is, but the free version is now up at https://1findr.1science.com/home. Another interesting one is ResearchGate which has it’s own citation index. Yet another I missed is ScienceOpen is a relatively small one at 60 million items indexed. I’m not very familar with this one.

Part I — A overview of the big 3 — Web of Science, Scopus and Google Scholar.

Before going into the new generation of citation indexes, it is important perhaps to understand the positions the big three citation indexes — Web of Science, Scopus and Google Scholar hold in this area.

Web of Science of course is the OG citation index. Started by Eugene Garfield in the 60s the Science citation index as it was known then existed in hard copy form before moving to the first legacy computer systems in the 80s. Owned by a series of company most notably Thompson Reuters in the 2000s, it was spun off as a separate company — Clarivate in 2015.

The clarivate time-line

If you are confused by the terminology, Web of Science is technically the name of the web platform which houses different databases and citation indexes. The most important of these databases/citation indexes are what are called the “Core Collection”. Traditionally they have consisted of Science Citation Index Expanded (SCIE), Social Sciences Citation Index (SSCI), Arts & Humanities Citation Index (AHCI) and there are more now but these are the ones you still hear about the most.

Due to it’s legacy, Web of Science for years has had a dated interface with limitations for searching that look strange to modern eyes (e.g. due to the storage/processing restrictions of bygone days, only the first author was indexed etc), however much of this has slowly been remedied in the last 5 years.

In the meantime, Elsevier launched Scopus in 2004 which took aim at such weaknesses. Scopus could be described as Web of Science but designed from the ground up with the capabilities of the 2000s in mind and as a result boasted relatively modern UI and better searching capabilities.

Typical Scopus interface as of 2020

Content wise, when Scopus first launched it was pitched as a citation index that covered more grounds in terms of journal titles than Web of Science. On the other hand, at the time, the newcomer Scopus could not match Web of Science’s retrospective indexing of sources (many of which was in print format at the time) and restricted itself to indexing content from 1996 and newer.

Today the differences between the two have diminished, Scopus has gone backward to fill their back files up to the 70s (though Web of Science still goes back further), while Web of Science has added additional indexes such as the Emerging Sources Citation Index (ESCI) to counter arguments that Web of Science is too selective.

Both Scopus and Web of Science have also expanded to index source materials that are not just journal articles but also conference papers and books. (e.g. Clarivate’s Book Citation Index (BKCI), Conference Proceedings Citation Index (CPCI))

Critiques of Web of Science and Scopus

However, despite the expansion of the citation indexes, both Scopus and Web of Science has been critiqued by various bibliometric studies and “Science of Science” papers for being skewed towards STEM fields and biased against non-english language journals(e.g. ignoring regional journals).

The coverage as measured in citations of Web of Science and Scopus is far poorer in non-STEM fields (Table 3)

In particular, because use of Scopus and Web of Science as bibliometric sources dominates University rankings (e.g. THE rankings and QS rankings have generally used only Scopus or Web of Science in past editions), it is believed using such citation indexes alone may not be sufficient to get a true picture of research quality and performance.

Times Higher Education World University Rankings has used Web of Science in the past as a source, but as of 2020 is using Scopus

Recent 2020 studies such as — Comparison of bibliographic data sources: Implications for the robustness of university rankings and Evaluating institutional open access performance: Methodology, challenges and assessment suggest that metrics, and ranking orders using different metrics can look very different if you use different and often bigger sources of data than what is just in Web of Science or Scopus.

Let’s now move on to the last of the big 3 — Google Scholar , probably the largest source of data.

Enter Google Scholar

Just as Scopus was entering the market and eventually gaining a foothold next to Web of Science to form a duopoly, another new challenger was entering the market — Google Scholar.

Anurag co-founder of Google Scholar in 2015 — reflecting on change due to the launch of Google Scholar

In a 10 year retrospective published in 2014, Anurag Acharya wrote about his journey inventing and developing Google Scholar. (See also a more recent reflection by Anurag in 2020)

What problem was Google Scholar trying to solve ? Anurag shared about his experience as a student in India and of course access was often an issue. But he found that when he couldn’t get access to something he could at least write letters asking people for it, and surprising (to me at least), when he did that

“Roughly half of the people would send you something, maybe a reprint.

and yet he reflected that

“… if you didn’t know the information was there, there was nothing you could do about it.”

In other words, it looked to me that he felt the discovery problem was almost as important to solve as the access issue and that is reflected in the work of Google Scholar.

It’s safe to say that today, 16 years on, he has been wildly successful to the point that many readers of this piece might be barely able to empathise with Anurag’s experience as a student (access is seen to be more of an issue today). Google Scholar is probably the most popular and widely used cross disciplinary Academic Search engine in the world and nothing comes close.

What are the virtues of Google Scholar?

1. Size of index and speed of updating

First thing that comes to mind is its size and speed of indexing compared to conventional databases.

Ancedotally, I found that Google Scholar tended to be really quick in indexing newly published papers (including “articles in press”, “Early view” type papers) compared to most traditional A&I and library databases.

And indeed A 2016 paper found that

“the median difference in delay between GS and Scopus of indexing documents in Scopus-covered journals is about 2 months. This finding suggests that the indexing speed of Scopus-covered journals in GS is faster than that of the same journals in Scopus. The delay is largely, but not exclusively, caused by the fact that reference lists in articles in press are added with a delay into Scopus”

In terms of size of index unlike Web of Science and Scopus which are famously selective in nature on the journals they index, Google Scholar flipped the whole thing on its head and tried to index anything on the web its harvesters came across as long as it looked Scholarly.

Back in those days, the idea of predatory journals was still not quite a thing and today some like Microsoft Academic do attempt various statistical methods to filter them out. But even today, some citation indexes simply try to include everything and can be said to have a policy of, “indexing them all and letting God (or the reader) sort them out” approach.

While all this sounds easy to say, it is hard to to describe how big a technological leap and paradigm shift this was in 2004, compared to what Web of Science and Scopus was doing at the time, particularly when this was done almost all automatically (today Google Scholar still has a small team of people).

It required Google to solve a series of tricky technical problems like navigating and harvesting content in repositories , often accompanied by poor metadata in repositories (to the point they invented the Highwire tags because Dublin Core didn’t cut it) ,scraping academic paper PDFs for data and identifying, grouping and merging different variants of papers to identity the primary item together so that citations could do their magic for relevance ranking.

This is not to say that in the early years when Google Scholar was launched in 2004, there were no problems, but my sense is by the 2010s and certainly 2015s a lot of early critiques of Google Scholar which were essentially holes in coverage and extremely poor quality metadata that could be easily surfaced from Google Scholar with a few click was mostly remedied

How big is the index of Google Scholar?

Unlike other indexes both traditional ones and the new up and coming ones like Microsoft Academic (fueled by Microsoft Academic graph — MAG), Semantic Scholar etc, Google Scholar has been coy about revealing the size of their index, either in terms of article numbers or even in terms of sources/journals titles indexed.

The lack of an API to extact data adds to the difficulty.

This has led to a series of academia papers that try to estimate the size of Google Scholar’s index using a myriad of indirect methods e.g. Khabsa & Giles (2014)’s Capture-Recapture technique leverages on the known size of Microsoft Academic Search for comparison to estimate the size of Google Scholar while 2015’s Methods for estimating the size of Google Scholar(2015) employed as many as 6 different methods, many involving “Absurd queries” and range queries to try to force Google Scholar to return all results. See also a more recent 2019 paper comparing sizes of Google Scholar against other big indexes.

Methods for estimating the size of Google Scholar — Figure 6

So by 2015, a probably safe estimate is that Google Scholar had maybe 160 million articles and it may be past 200 million by now. In comparison Scopus in 2020 today is showing about 70 million articles.

When I write about indexes with more than 100 million articles, I often get comments that this impossible. After all, Crossref is the main (but not only) DOI Registration Agency that grants DOIs for Scholarly content — just hit 100 million DOIs issued in Sept 2018 and not even all Crossref DOIs are registered to journal articles. The answer is multifold, for example not all journal articles have a DOI (or even necessarily one from Crossref). But a lot of entries in some indexes may indeed not be “journal type” content, not even preprints (which today are starting to be issued with DOIs by some preprint servers), but may include blog posts, research guides and more. As always be careful with statistics!

2. Comprehensive Full text indexing , great relevancy ranking and indexing & linking to free to read articles

Secondly as described in the 10 year retrospective, Anurag didn’t simply stop at indexing metadata which was standard for all A&Is at the time. Instead he knocked on the door of publishers from Elsevier to ACS to get permission for Google Scholar crawlers to crawl behind the paywall to index the full text.

While some initially resisted, one by one the publishers gave in as ignoring the traffic Google Scholar could bring was simply foolish particularly if their competitors did not (I’ve seen estimates by publishers revealing that the bulk of their referrals come from Google and Scholar and not library systems).

The fact that Google Scholar indexes full text of almost all major publishers is a big advantage that I seldom see mentioned. The ability for Google Scholar to show snippets of what your query terms matched in the paper gives you insane amount of context that isn’t possible if you do not index full text.

You can easily tell without even going into a paper to see if the paper is likely to be relevant.

Google Scholar matches full text and the search snipplets tells you which articles are matching your keyword in conrexr

Indeed the popularity of Google Scholar isn’t a big mystery.

It does the basics very well in a focused way, with almost unmatched coverage, full text indexing, and excellent relevant ranking across many disciplines(which I think itself is a function of the large number of eyeballs and clicks that help Google to optimise their relevancy ranking. This is something newer services will have a problem matching).

Another reason for its popularity that is seldom remarked upon because most reseachers are affiliated with institutions with paywall access is that before open access was a big thing, Google Scholar was almost singlehandedly carrying the torch by finding and reliably linking to free to read copies wherever it was available (journal page, repository, academic research network, author homepage), long before other discovery search engines & databases started to take providing access to free to read articles seriously .

Even today,most of the open access finding capabilities you see in databases are via Unpaywall and Google Scholar is still one of the most reliable linking sources of free to read copies.

One more point that might be relevant for humantities areas is that Google Scholar interposes results in Google Scholar from results from it’s Google Books service. This is a utterly unique service, possible only due to large scale cooperation between the biggest academic libraries allowing digitial scanning of books and as is well known ran into copyright challenges.

While organizations like Microsoft might be able to match the Google Scholar service itself by crawling the web to index journal articles, they have nothing to match Google books and it is unlikely such a experiment can be repeated…

The academic libraries react to Google Scholar

In the 2010s, libraries tried to create their own version of Google Scholar and pined their hopes on “Web Scale discovery services” like Summon, Primo, EDS and Worldcat Discovery.

None of them managed to make much of a dent in the popularity of Google Scholar.

Publishers have found that the vast majority of their traffic originate from Google, Google Scholar as well as Pubmed which are what Lorcan Dempsey calls services at the network level as opposed to Library discovery services which work at the institutional level.

96% of Springer’s traffic originates from Google, Google Scholar and Pubmed

Moreover, while Google Scholar was razor focused on its objective — discovery of article like content, library systems were often called upon to serve multiple roles e.g. uses as a known item search style for undergraduates looking for textbooks, or for historians looking for archives etc , leading to tradeoffs.

Exlibris Primo- a popular library discovery service and its stated uses

While I wouldn’t be so bold to say that Google Scholar is the only and best discovery tool for all situations (for example, a focused disciplinary tool like Pubmed or Psycinfo might often even be a better choice), if you only knew one tool to use across all use cases, it probably wouldn’t be that bad a choice to default to use Google Scholar.

Google Scholar as a bibliometric source

While citation indexes by their nature provide citation counts, arguably many like Google Scholar are focused more on use primarily as a discovery tool, rather than a tool designed for senior administrators and bibliometricians to run bibliometric analysis and research assessments.

By the 2010s , Google Scholar started to almost belatedly add features. Nothing too fancy mostly no brainers, like saved lists, easy citation functions, recommendations based on your profile and related work by profiles you follow, query suggestions, better mobile support, identifies when you cut and past citations etc.

Google Scholar — My Library feature

They even started to expand towards metrics provision and tracking, launching Google Scholar profiles in 2011 and creating Scholar Metrics in 2012 a annual ranking of journals — which compares with the ranked Journal lists - by Clarivate — Journal Citation Reports and Scopus Journal Metrics by Elsevier.

Google Scholar profile

How has Google Scholar fared in this attempt to get into the citation game versus Scopus and Web of Science?

My sense of the matter is that in terms of a discovery search tool , Google Scholar has long elicipsed Web of Science or Scopus (see also various surveys on researcher behavior such as this JISC report in 2015). Sure while there would be some hold out researchers (either old school professors who started their careers where the Science Citation Index was all powerful or from certain countries where publishing in “SSCIE” journals are a big deal but even that might be ending soon), who would insist on seaching in Web of Science or Scopus because they can only cite those journals included in those indexes but by and large they would be outnumbered by researchers whose first instinct is to search Google Scholar, particularly those in fields less well covered by Web of Science.

But what about Google Scholar use as a bibliometric source for assessment?

There is no doubt I think that Google Scholar profiles are extremely popular due to the ease of setup & maintance, increased visibility (it even gives you a chance to appear in the Google Knowledge panel for Google searchs!).

And there are researchers who love Google Scholar metrics for the higher citation counts they get, typically extracted using Harzing’s publish or perish software, the only tool officially allowed (or at least suggested to be so by some) by Google Scholar to scrape results from Google Scholar for bibliometric analysis.

One of the most often asked questions about Google Scholar is why it doesn’t offer a API or some way to bulk extract data. Currently ways to get the rich Google Scholar data on mass are limited to scraping from the GS research pages using scripts, browser extensions and other tools (most famously Hazing’s Publish or Perish.) These methods are known to be very limited for large scale use and will constantly throw up counter-measures like Captchas though. I do not know if there is a official answer to why Google Scholar does not afford a API, but the general belief/suspicion going around is that in return for Google Scholar getting permission to index full text behind publisher paywalls, they are not allowed to provide the content via API. (Compare with how Microsoft Academic does have an API but does not provide/index full text).

However, in general in the race for acceptability as a metric provider Google Scholar is a distant third at best to the other two. Part of it is unlike Web of Science or Scopus there is no easy way to obtain Google Scholar data in bulk , due to lack of APIs or interfaces designed for deep bibliometrics analysis (Clarivate offers the add on incites and Elsevier offers Scival)

Also the sense is still only citation counts and metrics from “proper citation indexes” like Web of Science and Scopus count. Related to this idea but somewhat less strongly held now is the idea only publishing in a journal indexed in these citation indexes count

Some of this is pure inertia, a little bit rooted in the possibly correct idea that Google Scholar data is still too unclean and inaccurate to be used compared to Scopus and Web of Science. But I think the fact that the reputable University rankings all use Scopus or Web of Science has definitely something to do with it too.

I would argue as long as Scopus and Web of Science retain their grip as the de facto tools for measuring and assessment of research, their diminished roles in discovery (taken over mostly from Google Scholar) would not hurt them that badly.

Is it worth to pay so much for better quality data?

Quality costs. Representatives from Clarivate and Elsevier will no doubt justify the price tags on their products with this argument. When they were the only game in town, this argument was strong, but now with competitors emerging many of which are free (Microsoft Academic, COCI etc) , this argument rears it head again, particularly when the coverage of some open data sets are comparable or bigger (e.g. From Micosoft Academic Graph). Though it is important to note coverage and quality/accuracy may not be correlated.

What is the relative coverage of various new citation sources vs Web of Science and Scopus? Studies on this are still not numerous but in general patterns have started to emerged — in terms of coverage (measured by citations) — Google Scholar is undisputed largest in coverage followed by Microsoft Academic (open data available via Microsoft Academic Graph). Dimensions, Scopus and Web of Science are in the next tier in term of size .

Unfortunately, the set of open citations in Crossref e.g. OpenCitations’ COCI (which is derived from the Crossref data and focused on doi to doi citation links) takes up the rear due to hold outs from a very few big Publishers namely Elsevier , ACS, IEEE who refuse to make open references from their journals that are deposited into Crossref.

Update March 2021! — Elsevier surprised the academic world by announcing they would sign the Declaration on Research Assessment (DORA) which includes the requirement of making reference lists of all articles openly available via Crossref. This is a big shot in the arm to feel the missing gap of citations in Crossref! Kluwer and ACS have since followed suit in 2021, and now 87% of articles in Crossref have references!

Major holdouts to supporting open citations, Elsevier, Kluwer and ACS have given in as of March 2021, leaving IEEE.

Early studies comparing new generation citation indexes (often based on free sources such as Microsoft Academic Graph, Crossref sources) and gold standard ones like Scopus, do indeed find that the former tend to be less accurate for various reasons (lots of scraped data etc).

Just as Google Scholar benefits from a matthew like/network effect of having more visitors leading to more data for optimising relevancy, Scopus and Web of Science as *the* definitive bibliometric/citation sources benefits from decades of researchers and librarians who are motivated to go through the data and point out errors to correct (particularly those of their own works).

One also wonders if institutions using products from Elsevier & Clarivate like Scival, Incites, PURE might also further provide signals for improving quality e.g. affiliation accuracy etc.

After all, there is a limit to how much AI/ML can do in spotting such errors, disambiguate authors etc.

Still in the long run, justifying huge costs by pointing to quality differences can only go so far.

One wonders how legacy systems like Web of Science or Scopus continue to justify their existence as pricey A&Is, when you get much of the same coverage and perhaps even more from freely available sources out there. You can choose from sources like Lens.org which have nice user friendly interfaces or for those with technical chops, working with the open data provided by Microsoft Academic Graph to produce custom dashboards

Will such open citation data continue to improve in quality perhaps via crowdsourcing efforts? Will we reach a point when the open citation sources are good enough and the virtues of openness start to trump quality.

The answer to questions like this will I think affect how the future of citation indexes will play out in the field of bibliometrics.

Can Google Scholar be beaten in the discovery game?

  1. The ethical/moral argument

One type of argument against Google Scholar tends to target the danger of over relying on a single monopolitic party to provide all academic discovery needs. Typically the argument goes, Google is famous for abruptly giving up on projects and it doesnt help that Google Scholar isn’t a particularly core service. It would be a disaster if we put all our eggs in one basket and they break…

Academic libraries have used this argument to push back against arguments to give up on discovery and cede that to Google and Google Scholar and instead focus on delivery.

Another moral/ethical type argument against Google Scholar is the argument that if everyone uses Google Scholar it might give them too much power and even if they are not spying on us for any evil purposes, they could easily be able to access and mine the world wide pattern of searches in the academia world to gain even more insight into the world, after all Knowledge is power.

Note: Anurag Acharya has maintained many times in fairly strong terms, that Google Scholar unlike Google does not do much tracking of individual searches and personalization. Listen here to his answer to a clarifying question by Lisa Hinchliffe in 2015. But of course mining in aggregate can still be very valuable without tracking individuals.

While all these arguments are as fine as it goes, I suspect pragmatism tends to prevail and academics will use what works. To make me move away from my default go to tool which increasingly is Google Scholar, you will need to show me something that is as good and more likely appreciably better for me to move. Moral and ethical arguments only go so far after all.

Perhaps the main ethical type argument I can think of that perhaps might work is to leverage the push for Open Science and to make the argument that Google Scholar data is not transparent as there is no API or data to check against, so doing literature reviews using it is not kosher as rhe results are not reproducible.

On the other hand some of the latest competitors to Google Scholar actually provide and/or consume open data — examples includes Microsoft with their Microsoft Academic Graph data licensed under ODC-BY and Semantic Scholar’s S2ORC, making results somewhat more transparent.

One of the problems of wanting to make such data open and the search transparent, I suspect means trading off on the desired user feature of being able to search within full text even of those full text behind paywalls. Because of copyright issues, S2ORC can only release metadata and open access papers while Microsoft academic graph data which powers Microsoft academic can be released as Open Data precisely because it contains no full text (though it may processs full textfor transformative use e.g. to extract field of studies).

For whatever reason, large scale full text indexing for discovery citation indexes is rare. Besides Google Scholar, perhaps Dimensions is the only other one that does it on any scale. Most others like MAG either do not match full-text or only match full-text from a relatively small corpus of Open Access papers (e.g. Lens.org drawing from JISC CORE)

2. Competing with Google Scholar as a discovery tool on utility

Still leaving aside such moral arguments, I see two ways new competitors can distinguish themselves from Google Scholar in the discovery game.

But first, for you to be even have a chance to be competitive we assume you have similar resources as Google you can throw at the problem to crawl the web and mine the required data (say a Microsoft Research) or failing that the chops to merge existing open data (of which there are many now) to create your own index that is competitive in size to Google Scholar.

In fact, it is likely despite all the competition from competitors like Microsoft that use similar techniques , Google Scholar still has the biggest index? The studies I have followed on size of indexes all still point one way.

Simply put Google Scholar is bigger than all the rest , often by significant amounts, no matter how you slice it, To be fair some like Microsoft Academic do get close in some studies, but the undisputed coverage king is still Google Scholar as I write this in 2020.

For example see 2020 study here comparing Google Scholar versus other important new sources including Web of Science, Scopus, COCI (essentially Crossref open citations — but note this was done prior to Elsevier, ACS and Kluwer opening their references, which as boasted the % availability in Crossref from 50% to 87%) and Dimensions.

Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations’ COCI: a multidisciplinary comparison of coverage via citations (Figure 1)

As you can see Google Scholar covers 88% of the whole aggregated set, while the closest competitor Microsoft Academic covers only 60%.

Going more old school

It has been long known particularly by researchers and librarians specializing in systematic reviews that Google Scholar doesn’t allow particular good control of searches.

While Google Scholar does include most papers that are identified for a systematic review in its index (high recall), the lack of precision control features means you could not efficiently run searches to get at those papers (low precision).

Among some of the features it lacks include

  1. Support of long complex queries — search is limited to 256 characters
  2. Support of nested boolean searches
  3. No wildcard and proximity support (autostemming is supported, cannot be easily turned off)
  4. Only limited support of certain fields searches
  5. Maximum of 1,000 results, no bulk export

Add the fact it is also believed that Google Scholar search results are not always reproducible and even fears of a Google Scholar bubble (which may or may not pan out) and this seems to be areas a competitor could improve on.

This seems to be the approach Lens.org has taken with extremely powerful structured search features such as lots of field searches and facets as well as sophisticated booelan search syntax. I’ll review Lens.org in part 2 of this series, but for now if you are curious take a look at - 7 reasons why you should try Lens.org (updated to version Release 5.16.0 — March 2019)

Lens.org structured search

That said how big is the market for users who want such features? Most researchers do seem satisified with the limited control they have in Google Scholar.

Going more “semantic” and pushing past 10 blue links

Secondly, one could take the opposite approach and push towards the fully semantic search way of doing thing.

What do I mean by this?

While Google Scholar isn’t 100% strictly boolean as we have seen earlier, it still has the trappings of a keyword search system, you can use OR functions, do quotes for phrase searching etc giving you some control over the search (though often you may not realise Google Scholar may sometimes quietly changed part of your searches by expanding terms, dropping a term or two etc).

But this is nothing compared to systems like Microsoft Academic or Semantic Scholar, where boolean operators even simple ones like OR are totally thrown out of the window and the system tries to interpret your search.

Microsoft Academic interprets your search such that even with one word off in the title it can guess the right paper title (example from blog)

And of course Google Scholar has been quite conservative in adding features into Google Scholar (at least on the surface) and for the last 25 years, while Google has progressed way beyond the “10 blue links paradigm” by adding Knowledge Graphs, Featured snipplet from webpages to handle Q&A queries and even latest NLP techniques like BERT, it is unclear how much Google Scholar has benefited beyond the generic changes.

Perhaps some of the new innovative “Semantic” features to add further context to searches can help one push past Google Scholar?

Contexualizing research

One major approach I have seen in some of the newer citation indexes is the push towards conextualizing research by tracking links beyond just links between papers or even conference papers and books.

For example, Digital Science’s Dimensions tracks links between paper publications, grants, funders, clincal trials, datasets , patents, policy documents etc.

Semantic Scholar does the same by linking papers to preprints, slides , videos, presentations, code libraries and even online mentions (tweets, blog posts, news stories).

The other major area has been in automatic topic/entity extraction techniques to automatically assign concepts, best seen in Microsoft Academic’s autogeneration of “Field of study” tags.

Using advanced NLP techniques like hierarchical topic modeling , they are able to automatically generate and classify papers into hundreds of thousands of controlled topics in a 6 layer hierachy system that is not only used under the hood by the engine but is also exposed to the human user who can use it to browse.

A Microsoft Academic Field of study page — “Library classification”

They claim this system is auto-self learning and was able to quickly recognise new clusters of research emerging such as the COVID-19 topic.

Another interesting area is in the application of semantics into the citing behavior.

For example instead of just counting citations, Semantic Scholar uses NLP techniques to types citiations by whether the cite is of the method, results or background and also if a reference in a paper is highly influenial to the paper.

This allows Semantic Scholar to put its own spin on Google Scholar’s very useful “search within citing article” feature which I often use when looking at seminal papers or review papers with hundreds of cites.

Typical search within citations of a paper in Google Scholar

In Semantic Scholar on top of just doing a keyword of the citing papers , you can filter it using various criteria such as citing type.

Semantic Scholar tutorial on citation overview

Other possible innovations such as scites classification of citations by “supporting”, “disputing” (again via NLP) also are interesting attempts to try a different approach.

scite visualization by citation types

One of the latest scite improvements, even allows you to selective drill down and follow the citation graph by type of citations, giving a new spin on the old practice of mining citations.

Going after the bibliometric provider role

Of course, at this point it is unclear if any of this will be nothing but cool tricks that don’t pan out and it may well be that outdoing Google Scholar as a pure discovery tool might not be profitable even if it was doable.

While competitors such as Microsoft Academic and Semantic Scholar backed by resource rich pockets can afford to be in the game, it is not hard to argue, trying to make money off a industry where the free excellent Google Scholar exists and where giants like Microsoft are trying to give equally free discovery services seems to be a fool’s errand.

Throw in the increasing number of discovery indexes that are emerging by leveraging open metadata(e.g. Lens.org, Scinapse) it seems to me that the discovery game is fast becoming a “red ocean” situation if you are a for profit company.

Perhaps recognising this Dimensions by the for profit company Digital Science, gives away a fremium discovery service that pretty much matches Google Scholar in most features (e.g. Like Google Scholar, Dimensions takes a inclusive approach of including all journals it can see) and even gives away some more additonal filter sets.

However, I suspect Dimensions is not actually meant to compete against Google Scholar but is more targetted at Scopus and Web of Science and their positions as the arbiters of research quality.

Unlike Microsoft Academic, the free version of Dimensions actually hides the institutional filter in the fremium version, smartly recognising the fact that a major use case would be bibliometrics at the institute level and locking it to ensure libraries or research offices who want to use it this way will have to pay.

You see a similar decision made for the Elsevier acquired 1Science 1Findr service, where again the freemium version lacks institutional filters

Think of Dimensons as a more inclusive Web of Science, Scopus in coverage and unlike Google Scholar there are APIs and easy ways to bulk extract the data

Looking at the additional feature sets of Dimensions plus and in particular Dimensions analytics, features such as custom analytics and dashboards, support of Google BigQuery etc, I think it is obvious the premium product is more targetted at people who want more inclusive coverage than what is in Scopus or Web of Science (similar to Google Scholar) and easy bulk access to bibliometrics for assessment (unlike Google Scholar).

Of course trying to topple Scopus or Web of Science or at least muscle into their business to be recognised as a creditable provider of metrics is no less easy than displacing Google Scholar in discovery.

As already mentioned, Scopus and Web of Science have already built up a brand name in this areas and have a built-in base of thousands of librarians and researchers who inevitably will spot errors and help feedback those errors for correction, leading to relatively clean data.

Also, for a citation index to be recognised as creditable, it needs to be studied by as many third party researchers as possible and Scopus and Web of Science for all their weaknesses are extremely well studied due to decades of studies. This is where Digital Science has been encouraging bibliometricans who are interested in doing research on Dimensions data to apply for access comes in. Working with libraries and institutions to try to use Dimensions data for measurement is something that is on-going of course.

As of 2020, it is still too early to tell if Dimensions will be able to get a foothold to challenge the frontrunners, but it does look promising. Another possible contender for the bibliometric thone is Microsoft Academic’s open data, which again has inclusive coverage like Dimensions and also easy bulk access to data via API or Azure Cloud Storage (technically the data is free, you pay for storage/access on Azure).

Conclusion

There is a reason why Google Scholar and Web of Science/Scopus are kings of the hills in their various arenas.

They have strong brand recogniton, a head start in development and a mass of eyeballs and users that leads to an almost virtious cycle of improvement. Competing against such well established competitors is not easy even when one has deep pockets (Microsoft) or a killer idea (scite).

That said, the push towards more open sources of citations and metadata seems to be continuing.

In particular, Open citations from Crossref have been given a big shot in the arm in 2021, with the last major publisher holdouts Elsevier, ACS and Kluwer joining the fold, these pool of data might start to get competitive. Particularly if merged with other open citation sources such as Microsoft Academic Graph (MAG).

It will be interesting to see how the landscape will look like in 2030.

--

--

Aaron Tay
Academic librarians and open access

A Librarian from Singapore Management University. Into social media, bibliometrics, library technology and above all libraries.