Google Scholar Metrics revisited (again): normalizing, “ma non troppo”

The 2017 edition of the Google Scholar Metrics rankings of publication venues has just been released, covering citations to articles published in different venues throughout the 2012–2016 period. I have always found these rankings particularly interesting because they mix conferences, journals and even arXiv, allowing comparisons between them. This is invaluable when one is an academic in a country whose government and funding institutions believe with religious fervor that Thomson’s journal impact factors are the measure of all things, and therefore conferences, which do not have one, are worth zero. So I have used these data in appeals to convince them that really, ACL can be better than the shoddiest of bottom-tier journals. With mixed success, but I won’t bore you with that, because my purpose is to have a look at the ranking itself and its accuracy in the field of Computational Linguistics.

Here is a screenshot of the 2017 ranking for this discipline:

2017 Google Scholar Metrics ranking for Computational Linguistics (source)

A first thing that may be surprising is that arXiv, a growingly used repository in our field and host to recent controversies, takes the first place from ACL. While arXiv definitely hosts highly influential papers, it is by definition a non-peer-reviewed site that lacks the quality requirements of top-tier conferences and journals. A closer look at the list will reveal some other trends that researchers familiar with these venues may find dubious.

In his analysis of last year’s ranking, Leon Derczynski discusses the reason: Google Scholar metrics are based on the raw h5-index of each venue, i.e., their h-index computed on a period of 5 years. The h-index is the largest number n such that the relevant venue has published at least n papers in that period that have been cited a minimum of n times. The problem is that this index is a raw metric that does not take into account the number of papers published in total. When applying it to an individual researcher, this is probably reasonable as far as bibliometrics go (after all, in our “publish or perish” culture, more papers is better). When applying it to publication venues, it can also be acceptable if one wants to measure raw impact (it is likely that arXiv cs.CL has the largest raw impact in computational linguistics, even if it also hosts inferior papers). But if one intends to measure venue quality, it does not make much sense, as it punishes highly selective conferences and journals precisely for being highly selective: publishing more papers can never bring down your h-index. For this reason, Ani Nenkova suggested last year in Twitter that the rankings should be normalized by venue size.

(By the way, this vice of ranking by raw counts without normalization is no stranger to influential university rankings — see for example the CWUR rankings).

Heeding the suggestion, Leon makes an analysis of last year’s rankings after dividing the h5-index by publication count, finding that some of the surprising trends that favor larger venues are corrected by this operation.

However, I think he might have gone a bit too far with the normalization, as the h-index metric does not scale linearly with respect to the number of publications of a researcher or venue, it grows much slower than that. To see this, suppose that we have a conference that publishes p papers, and has an h-index of n, meaning that n of these p papers have at least n citations. Now suppose that the conference publishes another set of p papers with the same citation counts as the previous p papers. The number of publications and citations in the conference has doubled, but its h-index has not, as this would require having 2n papers with at least 2n citations, and what we now have is 2n papers with at least n citations.

Instead, the h-index scales linearly with respect to the square root of the number of publications (or equivalently, of the number of citations, as publications and citations can be considered to be linearly related). This is so because having a given h-index n requires n papers with n citations each, i.e., n² citations in total. Or, if you prefer to reason geometrically, because the h-index is given by the side length of the biggest square that fits inside a histogram of papers by citation counts, as seen in this Wikipedia image:

Source: Wikipedia, public domain

In fact, some experts have gone much farther and determined that the h-index (for individual authors) can be approximated very well by 0.54 times the square root of the number of citations.

For this reason, I think that Google Scholar Metrics should provide normalized rankings, but the normalization should be done by dividing by the square root of the raw number of citations.

And here is what you may have been waiting for, if you have read this far: the data!

First, if we apply this square-root normalization to the 2016 ranking for Computational Linguistics that Leon Derczynski published, using his publication count data directly, we obtain the following results:

Then, I have also computed the ranking on 2017 data, using publication counts for the 2012–2016 period, obtaining the following:

I must warn here that the data need to be taken with a considerable grain of salt, as there is not always an obvious criterion to count the number of publications in a given venue (in conferences, should be count demos? Student session papers? In journals, should we count book reviews? Editorials?) The optimal thing to do would be to count exactly the publications that Google Scholar is indexing for each given venue, but that does not seem possible. So I tried to follow reasonable criteria (counting from 2012 to 2016 as I understand that is the period where citations are computed, using published paper counts where available, counting whatever is in the main proceeding volumes but not in separate volumes like proceedings of demo sessions, etc.) but they may be far from optimal, especially taking into account that I didn’t have time to spend in counting papers one by one so I used aggregates where possible. However, with the data above, anyone can plug their own publication count and see how that affects the rankings.

With that out of the way, what do the normalized rankings say?

  • CL and TACL are both top notch, no doubt. Apparently, both the classic journal process of CL and the new agile process of TACL succeed at gathering a selection of papers that people cite often.
  • ACL is also at the podium in the newly-normalized 2016 rankings, but drops to 7th place in the 2017 ones, placing it behind EACL and NAACL. I don’t know if this drop between 2016 and 2017 reflects some real trend or just a discrepancy in how Leon and myself obtained paper counts. I suspect the latter, as my EACL and NAACL counts are smaller than his, and this cannot be an effect of the change in period, especially as there was no EACL 2011. Personally, I find the idea of the global ACL conference being behind the continental ones surprising (although the jury is out!) so I’m inclined to trust Leon’s numbers more.
  • Top conferences, top journals and our own top journal-conference hybrid can be said to be roughly on par according to this metric.
  • arXiv loses protagonism with this normalization is applied, but it’s still a venue to be reckoned with.
  • The bottom half of the list has seen a lot of changes with respect to the previous year. Note, however, that this bottom half is probably not very reliable, as there probably are venues ranked below the 20th place in Google’s list that should appear in this normalized ranking, and viceversa. Unfortunately, we are limited to listing the venues appearing in Google’s ranking, as it’s our source for h5-index values.
  • Regardless of this, there is no doubt that some workshops are starting to kick serious butt.

And that is all for this year’s Computational Linguistics Google Scholar Metrics analysis. Let’s see who takes the baton next year and improves the methodology further. Of course, meanwhile, comments and Twitter flamewars (not serious, as “objective” metrics should not be taken seriously) are welcome!