Google Scholar Metrics revisited: normalising for publication count

Leon Derczynski
Jul 25, 2016 · 8 min read

Recently, Google Scholar updated their metrics. There was a bit of surprise in the computational linguistics community, as there is every year, where ideas of a venue’s grandeur align imperfectly with the citation data. Let’s have a look at what happened in this field. Here’s the ranking from Google:

Google Scholar metrics for CL — 2016 edition. Click image for source.

Without going into too deep a discussion of the evils of bibliometrics — that impact factor is broken, that citations are not all equal, that h-indices can be gamed (like in the successful experiment faking personal and journal scores on a large scale)— let’s take this at face value and see where it goes.

First off, we can see that arXiv has rocketed to number three. Well, that’s a surprise for a non-peer-reviewed venue — or is it? In fact, arXiv’s relevant category has a lot of papers that are published in other venues or workshops; it contains many draft papers; it contains a lot of work published in other fields, where the authors have just ticked the “cs.CL” (computational linguistics) category box to market their paper to us. Good for them. But in short, arXiv contains a lot of papers — and it’s probably this sheer volume that means it also has a lot of well-cited papers.

That fact — that arXiv contains versions of a lot of well-cited papers — is what drives it to the top of the h5 listing. The h5-index is like the normal h-index, which represents “the smallest number of papers, that have at least this many citations”; h5 is an h-index measured over just the past five years. So, a venue that gets a lot of papers, as long as the sample isn’t drawn very carefully to be bad papers, will end up having a good h-index. Another way you can think of it is the biggest square that’ll fit in a graph of most-cited-papers.

Here comes the square:

Image from Wikimedia Foundation

Found it? Good. That’s enough of that.

Another surprise for some was to see the sun & beach conference, LREC, up as high as fifth. This conference has a relatively high acceptance rate, and publishes all accepted submissions as full-length papers in its proceedings. It’s probably the largest conference in our field, attracting about 1500 delegates once every other year. It’s also a perfect subject fit for many papers, and as such attracts a lot of quality work. It doesn’t have the prestige and age one might expect from other venues like the CL journal, but we regularly see great, hard-hitting papers at LREC; recently, take for example the universal part-of-speech tagset, Freeling 3.0, or the DBpedia paper. But, it is a huge venue, so could benefit from the same factors as arXiv.

To “correct” for these unexpected rankings, the suggestion to normalise h5 scores for paper count quickly emerged on Twitter. The idea is to correct for any distortions that come from a place publishing a lot of papers. This normalisation in fact just gives us a derivative of impact scores — which are a rough, violent method for measuring quality (citations tell us nothing, etc.) and something I personally find a rather vomitous metric, unless of course an article of my own lands in a high-impact journal, in which case they can’t be so bad, right? Chasing impact factors leads to nasty behaviours that take us away from useful scientific content; behaviours like demanding one cite a journal when submitting to it, or putting up papers with little intrinsic content that really push citation stats.

Indeed, many citations don’t directly acknowledge or use the content of the paper they reference. I know this is the case for my TempEval-3 paper. It’s just as common to give a hands-wavy grounding reference to justify the existence of a problem area, to anchor readers to the literature on a topic, or to satisfy the herd-of-sheep-like motivation of “many others have looked at this problem, so we are going to as well”. Not a strong motivation — quite the opposite — but I digress! Many of the citations to highly-cited papers are like this; and you might find that your favourite own work that people actually engage with is cited much less; and the citation from work that actually does really engage with your research still only counts as just one citation. Personally, I found the papers that were most scientifically satisfying to work on — validating Reichenbach’s model of tense, and our generalisation of Brown clustering — receive less citation traction than my other projects.

So we could even say that focusing on citations really just encourages us all to write pop lit; write things that are general purpose, get them some visibility at the right time, and cross your fingers. That’s another one of those nasty behaviours impact factor encourages. Luckily, most good scientists would rather do good science and disseminate that — and most bad scientists haven’t noticed the hack, or haven’t managed to pull it off, at least.

Anyway, rant aside, the normalised table is below. The short name for the venue/conference is on the left.

Notes on the data:

  • For journals, front matter, back matter, tables of contents and corrections are not counted. Introductions to special issues are counted.
  • For conferences, the table of contents and tutorial notes are excluded, as are student research workshops, special workshops and other workshops
  • TAC 2014 paper count is an estimate
  • The dates are from 2011 to some time around June 2016. Looking at the citation counts mentioned in Scholar metrics, it’s clear that they were a little old when the rankings were published (July 2016).
  • For the above reason, NAACL is shown with and without NAACL 2016
  • SLT is excluded for being too speech-y
  • Source data including per-year counts available here

So, what do we see? TACL is doing great, which is nice: really, we all know deadlines generate crap content, and really, I believe TACL should become the only method of entry to the ACL conference. This is just the same setup as VLDB, the best conference in the database community, which has a rapid-review journal, PVLDB, taking submission monthly and then once a year gathering up its recent crop of papers and presenting it as the annual conference. PVLDB’s now better thought of than the top classical journal VLDBJ, in fact. Another benefit of this setup is that reviewing load can be somewhat amortised over the year (you do less of it all in one lump). Also, finally, we’d get to satisfy the crotchety old idea (from outside computer science) that journal publications are somehow better.

Though that said, perhaps it’s not so crotchety. Journals are doing well in these normalised rankings — including JNLE — perhaps due to their reduced number of publications. Of course, these guys have had to court impact factors for decades already, and so should be experts now, but perhaps that’s unfairly cynical, and of course I’m joking here. Ha ha.

NAACL’s ranking (the North American version of the top conference, ACL) sucks in comparison to EACL (European ACL) if we include the NAACL 2016 papers. Ignore this year’s crop from San Diego and the conference performs a bit more like expected.

And, unsurprisingly, LREC and arXiv are down at the bottom of the top. Just like China’s big population leads it to suffer in GDP per capita rankings and excel with super-green CO2 levels per capita, so these high-volume venues’ stats really suffer when normalised by publication volume.

TAC, the Text Analysis Conference, is apparently reasonably hard hitting. For those not familiar, the event focuses on shared tasks, like Knowledge Base Population (KBP) in which teams try to fill in Wikipedia infoboxes based on an unannotated 1.8M document corpus, or Summarization, where teams have to produce summaries from input document collections. TAC has driven a lot of research topics over the years and provided some great datasets to the community.

But is TAC, as this normalised ranking suggests, really a better venue than EMNLP, which was second before? Looking at EMNLP’s proceedings, we see:

  • 2011: 149 papers
  • 2012: 139 papers
  • 2013: 205 papers
  • 2014: 226 papers
  • 2015: 312 papers

That’s quite the acceleration! Something similar is happening with ACL, too. Lots of papers reduces impact. Does it make these conferences worse? I doubt it — rather, it highlights how much impact sucks as a measure. Well, those monster poster sessions are impossible and unfair to everyone, presenter and listener, actually, but I’d rather have those than a small ACL with bitter undeserved rejections and low recall.

Goodness knows what’s going on with RANLP for it to beat ACL — apparently the methods for choosing content there are more likely to give well-cited papers than at ACL. That’s good for RANLP. I quite like the conference — there’s a pool, with a bar in it, and the food’s healthy and refreshing (or deep fried dead animal, whatever works for you). Though well-cited-ness is a dodgy proxy for quality and scientific contribution, especially over a time period as short as just five years. Sometimes you just have to read the paper and know its area in order to judge it. Outside on a sunny day over a shopska salad, preferably.


Of course, the conferences that have already had their 2016 iterations are at a disadvantage; their impact stats are accordingly reduced by this year’s extra paper volume, but no-one’s had time to read and cite their papers yet, so it’s a little unbalanced. That’s just the way this will always be — bad luck for not having your conference during July/August, just after the Google Scholar update, I guess. Rien à faire.

Note that we’re missing a few venues entirely; JAIR, for example, the Journal of Artificial Intelligence, is a big place for CL content. This one was confined to the Machine Learning category instead — a typical deficiency of Scholar’s hard categorisation, one might say. HyperText, WWW, ICWSM et al. similarly end up in other categories and excluded from Computational Linguistics rankings, despite being strong especially in the social media space. And where’s SemEval?

So what can we take away from all this? Well, first, you can get a high impact factor by rejecting (almost) everything. But we also see that this doesn’t really matter; a low impact factor does not stop excellence from appearing at a venue. Which is just as it should be.

Enjoy! And don’t forget — ignore bibliometrics!

Leon Derczynski

Written by

Scientific researcher: AI, language understanding, soc med intelligence