, Citations, and Open Science in Action

Jul 7, 2015 · 9 min read

On May 8, we announced the results of a year-long study of articles posted to In the study, we asked whether posting an article to was associated with more citations. We found — after controlling for a number of factors and applying several statistical models — that a typical paper posted to the site received about 83% more citations than similar papers that were only available behind paywalls. This translated to about one extra citation every year for the median paper.

We announced the results of the study on our home page, and it was covered by Fortune Magazine. More importantly, we put all of our data and code online. Anyone could — and still can — download our data and code, and easily replicate or modify any part of our study.

The study generated some discussion. A week after our announcement, Phil Davis published a blog post raising some questions about our data sample. He pointed out several “non-research” articles in our sample and asked whether the presence of these in the data might explain the result.

In response to that question, we have spent the last several weeks classifying the nearly 45,000 articles in our dataset, and identifying such “non-research” articles.

Today, we’re pleased to announce our revised study, which answers that question. Excluding any articles that we did not identify with high confidence to be original research or scholarship, we find a 73% citation increase associated with articles posted to This is a little less in relative terms than the 83% we found in our original data, but amounts to approximately the same in absolute terms — about one extra citation every year for the median paper. We find a 64% citation increase to articles posted to compared with being posted on other open access venues, such as an open access journal, or a personal homepage (down from 75% in the original data).

The revised paper is available for download here.

Just as with the original version of the study, all of our data and code are available online at the paper’s Github repository. We encourage those interested to download these materials and do their own review of our work. Promoting open science is one our core values. We believe that opening up research accelerates, improves, and democratizes scholarship and scientific discovery.

In the rest of this post, we’ll describe some of the background leading to our revised paper, and the methodology we used to refine our result.

Our Original Paper

Our data consists of a sample of journal articles published and posted to from 2009 to 2012, along with a sample of articles published in the same journals, but not posted to the site. You can find more detail about our data collection in the paper.

Controlling for the age of the articles, the impact factors of their journals, their academic disciplines, and whether they were freely available online (besides on, we compared the citations to articles posted to the site with those to similar articles not posted. We found that a typical article with full-text available online on a site received about 20% more citations than a similar article with only paid-access. We also found that an article posted only to received about 83% more citations than a similar paid-access article.

A Review from The Scholarly Kitchen

A few days after we publicized our results, we were contacted by Phil Davis, a researcher specializing in academic readership and citation. We corresponded with Dr. Davis for a few days, answering questions he had about our data and methodology, pointed him to the public data, and encouraged him to review it.

After that review, Dr. Davis raised a question about “non-research” articles in our dataset. Specifically that such articles, which often receive few citations, could be driving down the citation estimates for the sample and biasing our results. Dr. Davis subsequently wrote up this question in a blog post on Scholarly Kitchen, “Citation Boost Or Bad Data: Research Under Scrutiny”.

How could “non-research” articles affect our study?

Academic journals publish more than just articles with original research. They also publish content such as book reviews, errata, editorials, letters responding to recent articles, and even obituaries. These “non-research” articles often receive few or no citations.

In his blog post, Dr. Davis claimed that we were comparing articles posted to that were mostly research articles to articles not posted to the site that contained there other types of articles — comparing citable apples to uncitable oranges.

For example, consider an on-Academia sample with eight research articles, one book review, and one erratum. The research articles all receive 10 citations, while the book review and the errata each receive only one. The average citation in this group is 82 citations divided by ten articles, or 8.2.

Next, consider an off-Academia sample with five research articles, three book reviews, one editorial, and one erratum. The research articles all receive 10 citations, while the others each receive only one. The average in this sample is 55 citations divided by ten articles, or 5.5.

Therefore, even though research articles in both the on- and off-Academia sample received the same number of citations (10), if we didn’t account for the “non-research” articles, we would estimate a 50% citation difference (8.2 over 5.5).

So it’s a good question — does the presence of “non-research” articles in the data set explain any or all of the citation difference? For the presence of “non-research” articles to cause a big effect: (1) there have to be a lot of them, and (2) they have to be more prevalent in the off-Academia sample. If (1) isn’t true, then they’ll have little effect on the average citations in each sample. If (2) isn’t true, then they’ll have the same effect in both groups, which will get canceled out in the comparison.

Could this explain our result?

Looking at our data, we saw that in order to conclude that “non-research” articles explained the entire result, we would have to make some extreme assumptions:

  1. There were no “non-research” articles in the on-Academia sample.
  2. All “non-research” articles in the off-Academia sample receive no citations.
  3. 25% of all papers in the off-Academia set are “non-research.”

It was easy to confirm that (1) and (2) were not true. This means that the share of non-research articles in the off-Academia sample would have to be even higher — something like 1 in 3.

Even without a thorough review of all the articles in the data, it was unlikely that a third of all the articles in our off-Academia sample were non-research articles. But, in order to answer the question definitively, we decided to conduct a thorough review of all the articles.

As we’ll see below, our review of the data found that the share of non-research articles was closer to 1 in 10, and that their presence in our data accounts for very little of our result.

Classifying the articles in our sample

One way to address this problem is to identify “non-research” articles, and remove them from our data.

To identify “non-research” articles, we used Amazon’s Mechanical Turk. Mechanical Turk (or MTurk) is a marketplace for crowd-sourcing surveys or large data processing tasks. It is commonly used by academics to, for example, perform online experiments, collect survey data, or train and validate machine learning algorithms.

We provided links to the journal site for each article in our sample to over 300 MTurk workers, and asked them to answer some simple questions about the abstract or full text they found there (a sample instruction page is available in an appendix to the paper. Primarily, we asked them to classify the article as one of the following types:

  1. A summary of a meeting or conference
  2. An Editorial or Commentary
  3. A response to a recent article in the same journal;
  4. An article with original research, analysis or scholarship, or a broad survey of research on a topic
  5. A Book Review, Software Review, or review of some other recent work or performance
  6. An Erratum, Correction, or Retraction of an earlier article
  7. Something else

Sometimes, a worker might fail to categorize an article, giving one of these reasons: the link was broken, there was no abstract or text available on the site, the article was in a foreign language, or they just couldn’t tell.

We had each article reviewed by three different workers. In the final version of our analysis, we only included articles that three workers all agreed were “Original Research” (option 4 above). This left us with about 35,000 articles from the original 45,000. The other 10,000 were excluded because either the workers could not agree on a classification, or they unanimously agreed on a “non-research” classfication. Many of the excluded 10,000 are surely original research articles also, but to have the highest confidence we excluded non-research articles, we only included the unanimous results. (If we went by majority rule, or two out of three workers’ classifications, then we would have classified about 40,000 (90%) of 45,000 as original research articles.)

We checked the quality of the workers in two ways. First, we removed any workers with suspicious results, such as giving all their articles the same classification, or completing tasks unreasonably fast. Second, we compared their results to a “Gold Standard” set of 100 articles that were classified by staff. We included in this set a number of articles we considered tricky to classify. Based on a majority-rule classification, the workers agreed with us on whether the articles were original research or not over 85% of the time.

Updated Results

This is an extract of table 13 in the original version of our paper. It shows the citations predicted over time for articles (1) not freely available online, (2) available online but not on, (3) available only on, and (4) available on as well as elsewhere online. These figures are based on articles published in the median-impact-factor journal in our sample, and are estimated using the original sample of 45,000 articles.

After 5 years, we predicted that an article only available behind a paywall in the median journal will receive 7 citations. An article posted to, on the other hand would have received 12.9 citations, a difference of almost 6 citations, or 83%.

Below is the same table, updated to only include the 35,000 articles that were unanimously classified as original research. Notice that the predicted citations are higher across the board — in part a result of excluding low-citation “non-research” articles. Here a paid-access-only article is predicted to receive 8.1 citations, while an article posted to is predicted to receive 14, a difference of, again, almost 6 citations, but only 73% (because of the higher baseline).

More detailed results are in the paper.

There’s no perfect data

Data on academic publications and their citations is known to be complicated and imperfect. Even large, widely used data sources, such as Thomson’s Web of Science, have been found to have inaccuracies.

Categorizing academic journal content is especially complicated. There’s no single agreed-upon way to classify content, and the wide variety of types of content found in different journals across different disciplines makes is difficult to make comprehensive rules for doing so. For example, Scopus and Web of Science use very different categories for document types, and often disagree. Even the authors of our study sometimes debated with each other the proper classifications of certain articles in our sample.

It is possible in reviewing the raw data, someone will find some classifications they disagree with or believe are inaccurate. But this is also the case with nearly every study on academic citations. We believe that our classification process, while it has some unavoidable imperfections, is sufficiently accurate for us to conclude that “non-research” articles do not explain a significant amount of the citation advantage we find for articles posted to


We provide the data and code for our work because we want to engage in an open scientific discussion of our research. Having data and code available means that any researcher with a competing hypothesis can evaluate it in a complete and rigorous way. Dr. Davis’s critical feedback was useful and much appreciated. We think the work resulting from it has strengthened our analysis.

The outcome is a small modification of the original result: a paper in the median impact factor journal receives 73% more citations over five years if uploaded to (rather than 83%). Dr. Davis’s hypothesis that the original result may have been “entirely explainable by bad data” is not borne out.

— Carl Vogel, Yuri Niyazov, Ben Lund, Richard Price


Accelerating the World’s Research

    The Academia Team

    Written by

    Accelerating the World’s Research



    Accelerating the World’s Research

    Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
    Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
    Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade