The Association Between Early ArXiv Posting and Citations

10 min readMay 15, 2018

TL;DR

Should authors post their papers on arXiv before the reviewing process is complete? On the one hand, prepublishing on arXiv allows for accelerating research and making papers more freely available. On the other hand, there is a real danger of breaking the double-blindness of the review process, which may lead to peer reviews unfairly favoring some researchers over others. Further pluses and minuses of prepublishing on arXiv are covered elsewhere.

In this post, we analyze an additional potential source of unfairness: we find that posting a paper on arXiv before it is accepted as opposed to after is associated with about 65% more citations in the calendar year following the conference. We take author influence into account (via the authors’ h-index), as well as differences between conferences and overall time a paper has been available on arXiv, and observe a significant effect of pre-acceptance posting.

We hope that this finding will encourage the adoption of anonymized submissions (with pre-specified time limits on the anonymity) on arXiv and related platforms, similar to how the OpenReview platform was used for ICLR 2018.

A more in-depth version of this blog post is available as an arXiv paper: https://arxiv.org/abs/1805.05238

1. Introduction

ArXiv is a great tool for researchers. It’s a way to freely self-distribute drafts of papers of all kinds: already-published, soon-to-be-published, and never-published. It’s also a place for publishing work that does not fit traditional publication venues, like negative results and blog posts, so that it can be easily cited. And arXiv is becoming more popular, especially within the computer science community, as shown in recent work by Charles Sutton and Linan Gong.

But arXiv is not without its controversy. Yoav Goldberg wrote in his blog post that “there is also a rising trend of people using arXiv for flag-planting, and to circumvent the peer-review process. This is especially true for work coming from ‘strong’ groups. Currently, there is practically no downside to posting your (often very preliminary, often incomplete) work to arXiv, only potential benefits.”

In our blog post, we quantify some of the potential advantages for authors by studying the number of times a paper is cited when it is posted to arXiv before vs. after it is accepted for publication at major venues. In this post, we will only be looking at papers that (1) were eventually accepted, and (2) were posted on arXiv either before or after the acceptance notification. We account for several factors: time available on arXiv, publication venue, and author popularity.

We note that any perceived personal “advantages” for the authors may translate into “costs” for the overall community, and we are certainly not advocating using the findings in this blog post as motivation for pre-acceptance arXiv posting. The purpose of this blog post is to paint a sharper picture of arXiv-related fairness issues already brought to light by others.

2. Data

2.1. Venues

We chose to look at top-tier CS conferences which have a significant portion of their papers on arXiv both before and after acceptance: AAAI, ACL, CVPR, ECCV, EMNLP, FOCS, HLT-NAACL, ICCV, ICML, ICRA, IJCAI, INFOCOM, KDD, NIPS, SODA, WWW. We look at papers published between 2007 and no later than 2016, so that we can count the number of citations they receive during the year following their publication.

To obtain this data, we queried Semantic Scholar for all the papers published in a particular conference. We then looked up each of these papers in the arXiv metadata dump (our source was Sutton and Gong’s work here), and obtained initial arXiv submission dates for each paper that was posted. The result was a total of 4392 papers that were both (a) accepted for publication and (b) posted to arXiv.

Below is a table that breaks the papers by conference:

AAAI     3726
NIPS     3393
IJCAI    3001
WWW      2958
ACL      2676
ICML     2200
KDD      1661
ECCV     1477
EMNLP    1248
SODA     1234
HLT-NAACL 876
CVPR      467
FOCS      305
INFOCOM   183
ICRA      182
ICCV      156

Next, we needed to know the paper submission deadline of each conference. WikiCFP had about 90% of this data, and the rest we obtained through Google sleuthing (we would like to thank the maintainers of WikiCFP for their excellent and ongoing work).

For the outcome variable, we experimented with two variants of citations:

All citations (cites_1year): the number of times the paper in question was cited by any other paper published during the calendar year following the conference.
Influential citations (influential_cites_1year): similar to cites_1year but captures a smaller subset of citations which are more likely to indicate that the paper in question is critical for the citing paper. It only counts non-self citations (i.e., with no overlap in the author lists) where the paper of interest is referenced three times or more in the narrative of the citing paper, not always combined with other references, mentioned in context of experimental results, or explicitly mentioned as foundation for the citing paper.

Here is a histogram, showing that most papers in this population receive < 20 citations in the calendar year following the conference:

2.2. Author influence

We measured author influence by computing the h-index for all authors of each paper one year before the conference took place. Then we took the maximum h-index among all the authors of a paper and used this single value as a per-paper author influence summary statistic. For example, let’s take the (fictional) paper “Machine Learning for Common Good” by authors F. Humpty, S. Dumpty and T. King in ICML 2016. In 2015, the three authors had h-indices of 1, 5, and 25, of which the maximum (i.e. 25) we use as the summary statistic for the paper.

Because h-index is non-linear in its effect, we turned it into a categorical variable with ten buckets/deciles (each of which containing the same number of papers).

2.3. Time available on arXiv

Papers prepublished on arXiv before acceptance have had more time to gather citations than those posted to arXiv after acceptance, which may explain any differences in citation counts. To control for this factor, we compute the fraction of the year the paper has been available on arXiv. In particular, we measure the number of days between the first arXiv submission and the beginning of the calendar year in which we count citations of that paper, then divide by the number of days in the year, as illustrated by the following Python code.

next_year_jan_1 = datetime(year=conf_year + 1, month=1, day=1).date()delta = next_year_jan_1 — arxiv_submission_datefrac_year_remaining = np.maximum(delta.days / 365, 0)

We clamp the difference delta.days at a minimum of zero because a paper may be put on arXiv for the first time long after it is officially published.

For example, let’s say “Machine Learning for Statisticians” was posted to arXiv June 1st, 2016, and accepted for publication in NIPS 2016. The above code would then be “number of days you get when subtracting June 1st, 2016 from January 1st of 2017, and then divided by 365”.

2.4. Posted to arXiv before vs after acceptance

First, we compute the number of days between arXiv submission and the deadline of the conference in which the paper was eventually accepted.

(arxiv_submission_date — conference_deadline_date).days

And here is its histogram:

To simplify our analysis, we binarized this variable into papers that were submitted before acceptance and those submitted after. Unfortunately, we did not have acceptance notification dates available for each conference, so we used a constant cutoff of submission deadline + 28 days as a conservative estimate, knowing of no conference where reviews are over so quickly.

To summarize, we now have the following variables for paper p available for our analysis:

cites_1year — number of papers that cited p and were published in the calendar year following the official publication of p (continuous).
influential_cites_1year — number of influential papers that cited p and were published in the calendar year following the official publication of p (continuous).
max_hindex_decile — the decile into which the maximum (across all authors) h-index of p falls into (categorical — 10 levels).
submitted_before_deadline— whether p was submitted before the conference deadline plus 28 days (binary).
frac_year_remaining— fraction of year remaining from arXiv submission date until the year after the conference in which paper p was published (continuous).
conf — the conference where p was published (categorical — 16 levels).

3. Analysis

3.1. Model

We fit three different general linear fixed-effects models in which our outcome variable cites_1year is treated as a negative binomial-distributed count variable. We model its mean using the regression model:

where y is the outcome variable, x is the vector of covariates/features, and w_i is the learned weight of the ith feature x_i. We chose a negative binomial distribution instead of a Poisson distribution to account for the high amount of overdispersion that is typical of most real-world count data. One can interpret the negative binomial distribution as a marginalized Poisson distribution where its mean is drawn from a Gamma distribution. Our tool of choice is Python’s statsmodels.

To glean whether submitted_before_deadline is an important parameter, we fit the following two different models (expressed in the standard formula mini-language from R that is also used in statsmodels):

cites_1year ~ max_hindex_decile + frac_year_remaining + conf

cites_1year ~ max_hindex_decile + frac_year_remaining + conf + submitted_before_deadline

For more on this formula language see the documentation for patsy here.

The only difference between these two models is the presence of the submitted_before_deadline binary variable. We repeat this again for influential_cites_1year as the response variable

3.2. Results

We conducted a likelihood ratio test on the two models and the resulting p-value was tiny: 6.27e-29. This means that the second model has a significantly better likelihood, and better explains the data. Let’s take a look at the coefficients of the full model that includes submitted_before_deadline:

Due to the exp term in the regression function, these coefficients can be interpreted as having a multiplicative effect instead of an additive effect as in linear regression. We can thus look at the 0.5029 coefficient of submitted_before_deadline and interpret its effect as multiplying the number of citations by exp(0.5029) = 1.65. In other words, the fitted regression model estimates that papers submitted to arXiv before acceptance, on average, tend to have 65% more citations next year compared to papers submitted after, even after controlling for a variety of factors.

The difference is even more pronounced when we look at the number of influential citations. Papers submitted to arXiv before acceptance, on average, tend to have 75% more influential citations next year compared to papers submitted after. Of course, this result is not based on a randomized controlled experiment, so we can’t conclude that pre-acceptance posting has a causal effect on citation counts.

Note that in this framework, each categorical variable with k levels has only k-1 coefficients. Each coefficient can be interpreted as being relative to some baseline level, which is determined by the level left-out of the reported results. For example, the baseline max_hindex_decile is [0, 6], and the coefficients for the other nine deciles capture how many more citations you can expect to have with higher h-indices. In particular, an h-index between 42 and 99 is associated (on average) with more than double the number of next-year citations than if you had an h-index between 0 and 6. These coefficients increase in a nearly-monotonic way as h-index deciles increase, as fits with our intuition of the increasing importance of author influence. Similarly, the baseline conference is AAAI.

The results suggest that frac_year_remaining is a minor variable, with an estimate of 0 being part of the 95% confidence interval (last two columns). This is somewhat surprising since we expected papers which have been on arXiv for a longer fraction of a given year to have more citations in the following year.

4. Conclusion

Our exploratory analysis shows that posting a paper on arXiv before it is accepted (as opposed to after) is associated with 65% more citations in the calendar year following the conference. Although we take into account other factors which may influence number of citations (namely, author influence, publication venue, time available on arXiv), there may be other confounding factors which we did not include in our study (e.g., author affiliation, paper quality). We invite researchers interested in this analysis to explore the effect of other factors we haven’t included in the model, and invite conference chairs to conduct randomized controlled experiments in which authors submitting their drafts to the conference agree to prepublish their drafts on arXiv if they are randomly selected.

We note that identifying the potential unfair advantage given to prepublished papers may not give researchers a sufficiently compelling reason to delay posting their paper drafts on arXiv until the review process has completed. Instead, we encourage the community to adopt anonymous prepublished submissions (with pre-specified time limits on the anonymity) on arXiv and related platforms, similar to how the OpenReview platform implemented the peer reviewing process for ICLR 2018.

The Association Between Early ArXiv Posting and Citations

Written by Sergey Feldman