Forgotten Knowledge: Examining the Citational Amnesia in NLP

15 min readJul 4, 2023

Have you ever wondered:

How old are the papers we cite?

Or, perhaps you have wondered how amid the push to read and cite all the shiny new papers:

are we failing to read older papers and benefit from important ideas?

Forgetting some amount of old things is useful! For example, to make room for new ideas. But, over the last few years, have we gone too far?

Join me on this empirical and visual quest as we explore questions like these about Natural Language Processing (NLP) papers with data and graphs. This blog post presents some of the key results from our ACL 2023 paper:

Forgotten Knowledge: Examining the Citational Amnesia in NLP. Janvijay Singh, Mukund Rungta, Diyi Yang, and Saif M. Mohammad. In Proceedings of the 61st Annual Meeting of the Association of Computational Linguistics (ACL-2023), Toronto, Canada. BibTeX

Key prior work:

On the Shoulders of Giants: The Growing Impact of Older Articles.
Looked at some similar questions in many fields of study in papers published will 2013. Read on to see how things changed dramatically after 2013.
Update July 5, 2023: Thanks to Emiel van Miltenburg for pointing out this other key related (and very nice) work that we inexplicably missed: On Forgetting to Cite Older Papers: An Analysis of the ACL Anthology.
Looked at whether NLP papers published between 2017 and 2019 cited more recent papers than those published between 2010 and 2014.

Motivation

(If you prefer video to text: 2-min video summarizing the Motivation)

Innovations arise on the backs of ideas from the past

Or as Confucius says: “Study the past if you would define the future.”

Thus, it is not surprising that a central characteristic of the scientific method and modern scientific writing is to discuss other work: to build on ideas, to critique or reject earlier conclusions, to borrow ideas from other fields, and to situate the proposed work. Even when proposing something that others might consider dramatically novel, it is widely believed that these new ideas have been made possible because of a number of older ideas. Citation (referring to another paper in a prescribed format) is the primary mechanism to point the reader to these prior pieces of work and also to assign credit for shaping current work. Thus, we argue that:

examining citation patterns across time can lead to crucial insights into what we value, what we have forgotten, and what we should do in the future.

Of particular interest is the extent to which good older work is being forgotten — citational amnesia. More specifically, for this work, we define:

Citational Amnesia: the tendency to not cite enough relevant good work from the past (more than some k years old).

We cannot directly measure citational amnesia empirically because determining “enough”, “relevance”, and “good” require expert researcher judgment. However, what we can measure is the collective tendency of a field to cite older work. Such an empirical finding enables reflection on citational amnesia. A dramatic drop in our tendency to cite older work should give us cause to ponder whether we are putting enough effort to read older papers (and stand on the proverbial shoulders of giants).

Note that we are:

WHY NLP?

Language is social. NLP applications have complex social implications.
There have been notable and frequent gains in NLP research in recent years.
NLP technology is becoming increasingly ubiquitous in society.

The incredibly short research-to-production cycle and move-fast-and-break-things attitude in NLP (and Machine Learning more broadly) has also led to considerable adverse outcomes for various sections of society, especially those with the least power. Thus reading and citing more broadly is especially important now.

Responsible NLP research/development needs broad literature engagement.

QUESTIONS

We traced hundreds of thousands of citations by NLP papers to answer a series of questions on trends in citations, especially temporal patterns.

What is the average number of unique references in NLP papers? How has this average changed over the years?
On average, how far back in time do we go to cite papers? As in, what is the average age of cited papers (age of citation)?
What is the trend in the variation of age of citation over time and how does this variation differ across different publication venues in NLP?
What percentage of cited papers are old papers? How has this varied across years and publication venues?
Relative to each other, which subareas in NLP tend to cite more older papers and which subareas have a strong bias towards recent papers?
Do well-cited papers cite more old papers and have higher age of citation diversity?

Data Used

77K NLP papers from 1965 to 2022 (ACL Anthology)
their citation information (Semantic Scholar): For each citation, the age of the cited paper when the paper was cited (year of publication of citing paper minus the year of publication of the cited paper)

Lets jump in!

Q1. What is the average number of unique references in the AA papers? How does this number vary by publication type, such as workshop, conference, and journal? Has this average stayed roughly the same or has it changed markedly over the years?

Ans. We calculated the average number of unique references for all papers in the AoC dataset, as well as for each publication type (workshops, confer- ences, and journals). We then binned all papers by publication year, computed the mean and median for each bin for each year.

Results. The table below shows the means and medians of the number of unique references in AA papers.

Table 1: Mean and Median of the number of unique references in an AA paper.

Figure 1 shows how the mean has changed across the years.

Figure 1: Average number of unique references in an AA paper published in different years.

The graph shows a general upward trend. The trend seems roughly linear until the mid 2000s, at which point we see that the slope of the trend line increases markedly. Even just considering the last 7 years, there has been a 41.74% increase in referenced papers in 2021 compared to 2014.

Some contributing factors: Since the late 2000s, we’ve had electronic proceedings, higher page limits (main text, references, appendices).

Q2. On average, how far back in time do we go to cite papers? As in, what is the average age of cited papers? What is the distribution of this age across all citations? How do these vary by publication type?

Ans. If a paper x cites a paper yᵢ, then the Age of the Citation (AoC) is taken to be the difference between the year of publication (YoP) of x and yᵢ:

We calculated the AoC for each of the citations in the AoC dataset. For each paper, we also calculated the mean AoC of all papers cited by it:

here N refers to the number of papers cited by x.

Results. The average mAoC for all the papers in the:

Full AoC dataset: 7.02
Journal articles: 8.16
Conference papers: 6.93
Workshop papers: 7.01

The Figure below shows the distribution of AoCs in the dataset across the years after the publication of the cited paper (overall, and across publication types).

Figure 2: Distribution of AoC for papers in AA (overall and by publication type).

For example, the y-axis point for year 0 corresponds to the average of the percentage of citations papers received in the same year as it they were published. The y-axis point for year 1 corresponds to the average of percentage of citations the papers received in the year after they were published. And so on.

Observe that the majority of the citations are for papers published one year prior, (AoC = 1). This is true for conference and workshop subsets as well, but in journal papers, the most frequent citations are for papers published two years prior. Overall though all the arcs have a similar shape, rising sharply from the number in year 0 to the peak value and then dropping off at an exponential rate in the years after the peak is reached. For the full set of citations, this exponential decay from the peak has a half life of about 4 years. Roughly speaking, the line plot for journals is shifted to the right by a year compared to the line plots for conferences and workshops. It also has a lower peak value and its citations for the years after the peak are at a higher percentage than those for conferences and workshops. Additionally, citations in workshop papers have the highest percentage of current year citations (age 0), whereas citations in journal article have the lowest percentage of current year citations.

Analogous to Figure 2, Figure 3 presents the distribution of AoCs, albeit broken down by the total citations received by a paper.

Figure 3: Distribution of AoC for AA papers with different citation counts (shown in legend).

It is worth noting that the distribution leans more towards the right for papers with a higher number of citations. This shows that papers with a higher citation count continue to receive significant citations even far ahead in the future, which is intuitive.

Discussion. Overall, we observe that papers are cited most in years immediately after publication, and their chances of citation fall exponentially after that. The slight right-shift for the journal article citations is likely, at least in part, because journal submissions have a long turn-around time from the first submission to the date of publication (usually between 6 and 18 months).

Q3. What is the trend in the variation of AoC over time and how does this variation differ across different publication venues in NLP?

Ans. To answer this question, we split the papers into bins corresponding to the year of publication, and then examined the distribution of mAoC in each bin. We define a new metric: Citation Age Diversity (CAD) Index.

CAD Index measures the diversity in the mAoC for a set of papers. In simpler terms, a higher CAD Index indicates that mAoCs covers a broader range, implying that the cited papers span a wider time period of publication. This metric offers insights into the temporal spread of scholarly influence and the long-term impact of research. The CAD Index for a bin of papers b, is defined using the Gini Coefficient as follows:

here, bᵢ corresponds to iᵗʰ paper within bin b, N denotes the total number of papers in bin b and ̄b bar represents the mean of mAoC of papers associated with bin b. A CAD Index close to 0 indicates minimum temporal diversity in citations (citing papers from just one year), whereas a CAD Index of 1 indicates maximum temporal diversity in citations (citing papers uniformly from past years). In addition to CAD Index, we also compute median mAoC of each such yearly bin.

Results. Figure 4 shows the CAD Index across years (higher CAD Index indicates high diversity), and across different publication types. (The results for both CAD Index and median mAoC have roughly identical trends across the years.)

Figure 4: Citation Age Diversity (CAD) Index across years.

The CAD Index plot of Figure 4 shows that the temporal diversity of citations had an:

These intervals coincide with the year intervals in which we observed an increasing and then decreasing trend in median mAoC of published papers (not shown here).

The CAD Index plots by publication type all have similar trends, with journal paper submissions consistently having markedly higher scores (indicating markedly higher temporal diversity) across the years studied. However, they also seem to be most impacted by the trend since 2014 to cite very recent papers. (CAD Index not only goes back to the 1990 level, but also undershoots beyond it.)

Discussion. Overall, we find that all the gains in temporal diversity of citations from 1990 to 2014 (a period of 35 years), have been negated in the 7 years from 2014. This change is driven largely by the deep neural revolution in the early 2010’s and strengthened further by the substantial impact of transformers on NLP and Machine Learning.

Q4. What percentage of cited papers are old papers? How has this varied across years and publication venues?

Ans. Just as Verstak et al. (2014), we define a cited paper as older if it was published at least ten years prior to the citing paper. We then divided all AA papers into groups based on the year in which they were published. For each AA paper, we determined the number of citations to older papers.

Results. Figure 5 shows the percentage of older papers cited by papers published in different years.

Figure 5: Percentage of citations in AA papers where the cited paper is at least 10 years old.

Observe that this percentage increased steadily from 1990 to 1999, before decreasing until 2002. After 2002, the trend of citing older papers picked up again; reaching an all time high of ∼30% by 2014. However, since 2014, the percentage of citations to older papers has dropped dramatically, falling by 12.5% and reaching a historical low of ∼17.5% in 2021. Similar patterns are observed for different publication types. However, we note that a greater (usually around 5% more) percentage of a journal paper’s citations are to older papers, than in conference and workshop papers.

Q5. What is the mAoC distribution for different areas within NLP? Relative to each other, which areas tend to cite more older papers and which areas have a strong bias towards recent papers?

Ans. What makes a paper belong to a particular sub-area is fuzzy business. We use a simple approximation as we only want to draw inferences about broad trends. We use paper title word bigrams as indicators of topics relevant to the paper. A paper with machine translation in the title is very likely to be relevant to the area of machine translation.

Figure 6 shows the mAoC violin plots for each of the bins pertaining to the most frequent title bigrams (in decreasing order of median mAoC).

Figure 6: Distribution of mAoC for frequent bigrams appearing in the titles of citing papers.

Observe that papers with the title bigrams word alignment, parallel corpus/corpora, Penn Treebank, sense disambiguation and word sense (common in the word sense disambiguation area), speech tagging, coreference resolution, named entity and entity recognition (common in the named entity recognition area), and dependency parsing have some of the highest median mAoC (cite more older papers). In contrast, papers with the title bigrams glove vector, BERT pre, deep bidirectional, and bidirectional transformers (which correspond to newer technologies) and papers with title bigrams reading comprehension, shared task, question answering, language inference, language models, and social media have some of the lowest median mAoC (cite more recent papers).

Discussion. The above results suggest that not all NLP subareas are equal in terms of the age of cited papers. This could be due to factors such as early adoption or greater applicability of the latest developments, the relative newness of the area itself (possibly enabled by new inventions such as social media), etc. Thus, lower mAoC for some areas is not cause for value judgment. Instead, every area can reflect on the degree of its collective literature engagement, and strive to improve on that.

Q6. Do well-cited papers cite more old papers and have greater AoC diversity?

Ans. We introduce three hypotheses to explore the correlation between temporal citation patterns of target papers and the number of citations the target papers themselves get in the future.

H1: The degree of citation has no correlation with citation age patterns.
H2: Highly cited papers have more citation age diversity than less cited papers.
H3: Highly cited papers have less citation age diversity than less cited papers.

Without an empirical experiment, it is difficult to know which hypothesis is true. H1 seemed likely, however, there were reasons to suspect H2 and H3 also. Perhaps cite more widely is correlated with other factors such the quality of work and thus correlates with higher citations (supporting H2). Or, perhaps, early work in a new area receives lots of subsequent citations and work in a new area often tends to have limited citation diversity as there is no long history of publications in the area (supporting H3).

On, Nov 30, 2022, we used the Semantic Scholar API to extract the number of citations for each of the papers in the AoC dataset. We divided the AoC papers into nine different bins as per the number of citations: 0, 1–9, 10–49, 50–99, 100–499, 500– 999, 1000–1999, 2000–4999, or 5000+ citations. For each bin, we calculated the mean of mAoC and CAD Index. We also computed the Spearman’s Rank Correlation between the CAD Index of the citation bins and the mean of the citation range of each of these bins.

Results. Figure 7 shows the mAoC and CAD Index for each bin (a) for the full AoC dataset, and (b) for the subset of papers published between 1990 and 2000.

Figure 6: Variation of mean mAoC and Citation Age Diversity (CAD) Index (shown on y-axis) for papers with different citation counts (shown on x-axis).

On the full dataset (Figure 7a), we observe a clear pattern that the CAD Index decreases with increasing citation bin (with the exception of papers in the 1K–2K and 2K–5K bins). The mean mAoC follows similar trend w.r.t. the CAD Index.

These results show that, for the full dataset, the higher citation count papers tend to have less temporal citation diversity than lower-citation count papers. However, on the 1990s subset (Figure 7b), the CAD Index decreased till the citation count < 50 and increased markedly after that. This shows that during the 1990s, the highly cited papers also cited papers more widely in time.

Figures 7a and 7b below show plots for papers from two additional time periods.

Figure 7: Variation of mean mAoC and Citation Age Diversity (CAD) (shown on y-axis) for papers with different citation counts (shown on x-axis).

Plots for the 2000s and 2010s follow a similar trend as the overall plot (Figure 6a), indicating that trend of highly cited papers having less temporally diverse citations started around the year 2000.

The Spearman’s rank Correlation Coefficients between the mean number of citations for a bin and the mean mAoC of the citation bins are shown in Table 2.

Table2: Correlation between the mean of citation bins and CAD Index for the bins for various time periods. The ∗ indicates that the correlation is statistically significant (p-value < 0.05).

Observe that for the 1990’s papers there is essentially no correlation, but there are strong correlations for the 2000s, 2010s, and the full dataset papers.

Discussion. Papers may receive high citations for a number of reasons; and those that receive high citations are not necessarily model research papers. While they may have some aspects that are appreciated by the community (leading to high citations), they also have flaws. High-citation papers (by definition) are more visible to the broader research community and are likely to influence early researchers more. Thus their strong recency focus in citations is a cause of concern. Multiple anecdotal incidents in the community have suggested how early researchers often consider papers that were published more than two or three years back as “old papers”. This goes hand-in-hand with a feeling that they should not cite old papers and therefore, do not need to read them. The lack of temporal citation diversity in recent highly cited papers may be perpetuating such harmful beliefs.

Demo: CAD Index of Your Paper

To encourage authors to be more cognizant of the age of papers they cite, we created an online demonstration page where one can provide the Semantic Scholar ID of any paper and the system returns the number of papers referenced, mean Age of Citation (mAoC), top-5 oldest cited papers, and their year of publications. Notably, the demo also plots the distribution of mAoC for all the considered papers (all papers published till 2021) and compares it with mean Age of Citation of the input paper. Figure 8 below shows a screenshot of the demo portal for an example input.

Demo shows how your paper compares to other NLP papers in terms of citational age diversity.

Key Takeaways

Both the diversity of age of citations and the percentage of older papers cited increased from 1990 to 2014, but then dropped dramatically.
By 2021 (the final year of analysis), both reached historical lows.
Studied correlation between the number of citations a paper receives and the diversity of age of cited papers. We found:
No correlation in 1990s; strong inverse correlation in 2000s, 2010s.

These results point to an urgent need for reflection:

As researchers, advisors, reviewers, area chairs, and funding agencies, how are we contributing to this intense recency focus in NLP?

See our paper for additional details and questions explored: Forgotten Knowledge: Examining the Citational Amnesia in NLP (ACL, 2023)

Acknowledgements

Many thanks to Roland Kuhn, Rebecca Knowles, Jan Philip Wahle, and Tara Small for thoughtful discussions.

Related Works of Interest

Ongoing and Future Work

Explore citational amnesia in various fields such as Psychology, Linguistics, Computer Science, Machine Learning
Explore various aspects of citational diversity (beyond age of papers cited)

Dr. Saif M. Mohammad
Senior Research Scientist, National Research Council Canada

Twitter: @saifmmohammad
Webpage: http://saifmohammad.com