About the NLP Scholar Project

13 min readOct 22, 2019

Acknowledgments, Caveats, Ethical Considerations, and Related Work

About

This work began as a side project out of my interests in information visualization, the ACL Anthology, and Google Scholar. I must confess, I greatly underestimated the amount of effort this would take, but it has been rewarding to see the large number of interesting questions that can be investigated with the data.

Contact
Saif M. Mohammad
Twitter: @saifmmohammad
Email: uvgotsaif@gmail.com, saif.mohammad@nrc-cnrc.gc.ca
Webpage: http://saifmohammad.com

Project Homepage: http://saifmohammad.com/WebPages/nlpscholar.html

Acknowledgments

This work was possible due to the helpful discussion and encouragement from a number of awesome people, including: Dan Jurafsky, Tara Small, Michael Strube, Cyril Goutte, Eric Joanis, Matt Post, Patrick Littell, Torsten Zesch, Ellen Riloff, Norm Vinson, Iryna Gurevych, Rebecca Knowles, Isar Nejadgholi, and Peter Turney. Also, a big thanks to the ACL Anthology Team for creating and maintaining a wonderful resource.

Papers

Examining Citations of Natural Language Processing Literature. Saif M. Mohammad. In Proceedings of the 58th Annual Meeting of the Association of Computational Linguistics (ACL-2020). July 2020. Seattle, USA.

Summary: Examines nine questions pertaining to broad trends in citations of NLP papers (across time, across venue types, across paper types, across areas, etc.).
BibTeX:
@inproceedings{mohammad2020citations,
title={Examining Citations of Natural Language Processing Literature},
author={Mohammad, Saif M.},
booktitle={Proceedings of the 2020 annual conference of the association for computational linguistics},
address={Seattle, USA},
year={2020} }

Gender Gap in Natural Language Processing Research: Disparities in Authorship and Citations. Saif M. Mohammad. In Proceedings of the 58th Annual Meeting of the Association of Computational Linguistics (ACL-2020). July 2020. Seattle, USA.

Summary: Examines eight questions pertaining to disparities across gender in authorship and citations of NLP papers.
BibTeX:
@inproceedings{mohammad2020gender,
title={Gender Gap in Natural Language Processing Research: Disparities in Authorship and Citations},
author={Mohammad, Saif M.},
booktitle={Proceedings of the 2020 annual conference of the association for computational linguistics},
address={Seattle, USA},
year={2020} }

The State of NLP Literature: A Diachronic Analysis of the ACL Anthology. Saif M. Mohammad. arXiv preprint arXiv:1911.03562. November 2019.

Summary: A manuscript that brings together the analyses of NLP papers first presented in the four State of NLP blog posts.
BibTeX:
@article{mohammad2019nlpscholar,
title={The State of NLP Literature: A Diachronic Analysis of the ACL Anthology},
author={Mohammad, Saif M.},
journal={arXiv preprint arXiv:1911.03562},
year={2019}

NLP Scholar: An Interactive Visual Explorer for Natural Language Processing Literature. Saif M. Mohammad. In Proceedings of the 58th Annual Meeting of the Association of Computational Linguistics (ACL-2020). July 2020. Seattle, USA.

Summary: Presents an interactive visualization tool to help users find (related) work published in the ACL Anthology.
BibTeX:
@inproceedings{mohammad2020demo,
title={NLP Scholar: An Interactive Visual Explorer for Natural Language Processing Literature},
author={Mohammad, Saif M.},
booktitle={Proceedings of the 2020 annual conference of the association for computational linguistics},
address={Seattle, USA},
year={2020} }

NLP Scholar: A Dataset for Examining the State of NLP Research. Saif M. Mohammad. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC-2020). May 2020. Marseille, France.

Summary: Presents the NLP Scholar Dataset — a single unified source of information from both the ACL Anthology (AA) and Google Scholar for tens of thousands of NLP papers.Presents initial work on analyzing the volume of research in NLP over the years, identifies some the most cited papers in AA, as well as outlines a list of applications of the dataset.
BibTeX:
@inproceedings{mohammad2020data,
title={NLP Scholar: A Dataset for Examining the State of NLP Research},
author={Mohammad, Saif M.},
booktitle={Proceedings of the 12th Language Resources and Evaluation Conference (LREC-2020)},
address={Marseille, France},
year={2020} }

Data: The dataset used for the analyses will be made freely available shortly.

Caveats, Limitations, and Ethical Considerations

NLP Scholar comes with several caveats, limitations, and ethical considerations as listed below.

Aspects of Analysis

The analyses presented in The State of NLP Literature posts cover only some aspects of the literature. Prior work has explored other aspects such as citation link analysis, co-author networks, influence, types of citations, etc. Yet, several interesting questions remain unexplored.

Accessing Information about the Papers

Google does not provide an API to extract information about the papers. Martín-Martín et al. (2018) and others have pointed out that this is likely because of its agreement with publishing companies that have scientific literature behind paywalls. The ACL Anthology is in the public domain and free to access. We extracted citation information from Google Scholar profiles of people who published in the ACL Anthology. This is explicitly allowed by their robots exclusion standard, and is how past work has studied Google Scholar:
— Martín-Martín, A., Orduna-Malea, E., Thelwall, M. and López-Cózar, E.D., 2018. Google Scholar, Web of Science, and Scopus: A systematic comparison of citations in 252 subject categories. Journal of Informetrics, 12(4), pp.1160–1177.
— Khabsa, M. and Giles, C.L., 2014. The number of scholarly documents on the public web. PloS one, 9(5), p.e93949.
— Orduña-Malea, E., Ayllón, J.M., Martín-Martín, A. and López-Cózar, E.D., 2014. About the size of Google Scholar: playing the numbers. arXiv preprint arXiv:1407.6239.

Errors

Even though the ACL Anthology and Google Scholar are outstanding resources, they contain some errors. Also, aligning information from the two resources can never be perfect. (More details in the bullets below.) Thus NLP Scholar is bound to include some errors. We apologize for any misrepresentations, and will fix things as best we can.

Inconsistencies and Missing Values in the ACL Anthology

Information in the ACL Anthology is not always consistent and some attributes may be missing:

The same venue may be described in different ways.
There is no consistent way to identify short papers, tutorials, demo papers, book reviews, etc. Main conference short papers are sometimes clearly marked in the booktitle field of the BibTex, but at other times, they are not distinguished from the long papers. Occasionally, they are marked in other idiosyncratic ways such as appending “(short paper)” to the paper title.
Some papers have a missing author field in the BibTeX entry. These papers are omitted. (These are often proceedings, lists of tutorials, etc. that we would want to omit anyway.)
For some papers, the title in the BibTeX entry uses non-accented letters, even though the title has accented letters. For example, the title is recorded to have the word “sémantique” in the main records of AA, it is written as “semantique” in the BibTeX entry in AA. We use the BibTeX entry to extract author names, and the mismatch in titles causes the system to not find the authors. Papers with missing values for authors are omitted.

We use high precision heuristics to identify necessary information. However, note that there will be some number of omissions and misclassifications.

Citation Information From Google Scholar

Google Scholar is used widely in research. However, it has received criticisms regarding the amount of curation, reducing academic worth to citations and h-index, etc. (see Criticisms of the Citation System, and Google Scholar in Particular, How Has Google Scholar Changed Academia?, 4 reasons why Google Scholar isn’t as great as you think it is).
Some number of papers exist such that none of their authors created a Google Scholar profile. We do not have citation information for those papers. Such papers are still displayed in NLP Scholar — only their citation information has a null value. This, however, means that in terms of citation information, it is likely that work done in the past is under-represented (as authors who left academia or retired may be less likely to create a Google Scholar profile). Nonetheless, we do not expect this to markedly impact the inferences drawn from the analyses presented, as we do have citation information for over 35,000 papers.

Aligning Information in AA and Google Scholar is Tricky

they do not have a common paper id or author id
occasionally two different papers have the same title
the same author may use different forms of their name in different articles
multiple authors might have the same name

We use the paper title and publication year combination as the unique identifier for a paper. However, there are some pairs of papers that have the same title and year of publication. These are omitted.

New Papers are Constantly Added to AA.

The current instantiation of NLP Scholar is based on the papers in AA as of June 2019. We will update NLP Scholar with new AA information periodically.

Papers Receive More Citations with Time.

The current instantiation of NLP Scholar is based on the citations papers received as of June 2019. We will update NLP Scholar with new citations information periodically.

Rich get Richer

Visualizations in NLP scholar present papers with more citations more prominently than papers with fewer citations. This can have the effect of making highly cited papers even more cited. (This is not unlike Google Scholar, which also ranks papers by relevance and citation counts.) Citations are one (somewhat noisy) indicator of the amount of impact a paper has had. While they can be useful to find interesting and impactful papers, it must be noted that papers get cited for a number of other reasons as well, and it is entirely possible that some of papers of interest might be those that are less cited.

There are several ways in which NLP Scholar can cast light on less cited papers too though. Here are some examples:

By showing the papers on a timeline, one can easily track papers that influenced a high-citation paper in an area.
When searching for papers in an area, one can compare citations of papers within that area. This places a target paper in a more appropriate context. For example, a target paper may not have received hundreds of citations, but one can see that the within the area of research, it is one of the most highly cited papers.
The Languages visualizations highlight work in various languages.

Search based on Words in Titles

Even though there is an association between terms and areas of research, the association can be less strong for some terms. I use the association as one (imperfect) source of information about areas of research. This information may be combined with other sources of information to draw more robust conclusions. Planned future work on allowing searches for terms in abstracts and whole papers, as well as finding documents related to a query term based on word embedding based document representations will alleviate the current limitations. However, it should be noted that search based on title words is a simple and powerful method for finding relevant documents.

Demographics

Data is often a representation of people (Zook 2017). This is certainly the case here and we acknowledge that the use of such data has the potential to harm individuals. This work has several limitations, and some have ethical considerations in terms of who is left out. Further, while the methods used are not new, their use merits reflection.
Analysis focused on women and men leaves out non-binary people. Not disaggregating cis and trans people means that the statistics are largely reflective of the more populous cis class. We hope future work will explore gender gaps between non-binary — binary, trans — cis, etc. Similarly, tracking the skew in authors of diverse income, experiences, and abilities is also crucial. This work does address those but hopefully more work on those will follow.
The use of female- and male-gender associated names to infer population level statistics for women and men, can reinforce harmful stereotypes and is exclusionary to people that do not have such names, to people from some cultures where names are not as strongly associated with gender, and trans people who have not been able to change their name.
Since the names dataset used is for American children there is lower representation of names from other nationalities. However, many names are common in more than one country, and the large immigrant population in the US means that there still exists substantial coverage of names from around the world.
Chinese names (especially in the romanized form) are not good indicators of gender. Thus the method presented here disregards most Chinese names, and the results of the analysis do not apply to researchers with Chinese names.
Some might argue that names partially address the gender inclusiveness guidelines listed in (Keyes 2018): names can be changed to indicate (or not indicate) gender, people can choose to keep their birth name or change it, and the name, more so than appearance, can be independent of physiology. However, changing names can be quite difficult. Also, names do not capture gender fluidity or contextual gender.
A more inclusive way of obtaining gender information is through optional self-reported surveys. However, even if one allows for a self-report checkbox so that the respondent can have the primacy and autonomy to express gender, downstream data science either ignores such data or combines information in ways that are not in control of the respondent. Further, as is the case here, it is not easy to obtain self-reported historical information.
A small number of names change association from one gender to another with time. We hope that the ≥99% rule filters them out, but this is not guaranteed.
Social category detection can potentially lead to harms, for example, depriving people of opportunities simply because of their race or gender. However, one can also see the benefits of NLP techniques and social category detection in public health (e.g., developing targeted initiatives to improve health outcomes of vulnerable populations), as well as in psychology and social science (e.g., to better understand the unique challenges of belonging to a social category).
Some papers may have more than one joint first author or more than one joint last author. The analyses presented here does not take that into consideration.

References

See Mihaljevic (2019) for a discussion on the limitations and biases in using author names to infer gender statistics in the Gender Gap in Science Project.
See Larson (2017), Keyes (2018), Cao and Daume III (2020), Blodgett et al. (2020) for discussions on the lack of adequate and inclusive considerations of gender in NLP systems.
See Scheuerman (2019) and Keyes (2018) for concerns about inferring gender via face-recognition techniques.
See Zook (2017) for tips on responsible use of data.}

Key Related Work

Articles:

Steven Bird, Robert Dale, Bonnie Dorr, Bryan Gibson, Mark Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir Radev and Yee Fan Tan (2008) The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. In Proc. of Language Resources and Evaluation Conference (LREC 08). Marrakesh, Morocco, May.
Yogatama, D., Heilman, M., O’Connor, B., Dyer, C., Routledge, B.R. and Smith, N.A., 2011, July. Predicting a scientific community’s response to an article. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 594–604). Association for Computational Linguistics.
Anderson, A., McFarland, D. and Jurafsky, D., 2012, July. Towards a computational history of the acl: 1980–2008. In Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries (pp. 13–21). Association for Computational Linguistics.
Khabsa, M. and Giles, C.L., 2014. The number of scholarly documents on the public web. PloS one, 9(5), p.e93949.
Orduña-Malea, E., Ayllón, J.M., Martín-Martín, A. and López-Cózar, E.D., 2014. About the size of Google Scholar: playing the numbers. arXiv preprint arXiv:1407.6239.
Radev, D.R., Joseph, M.T., Gibson, B. and Muthukrishnan, P., 2016. A bibliometric and network analysis of the field of computational linguistics. Journal of the Association for Information Science and Technology, 67(3), pp.683–706.
Mariani, J., Francopoulo, G. and Paroubek, P., 2018. The NLP4NLP Corpus (I): 50 Years of Publication, Collaboration and Citation in Speech and Language Processing. Frontiers in Research Metrics and Analytics, 3, p.36.
Martín-Martín, A., Orduna-Malea, E., Thelwall, M. and López-Cózar, E.D., 2018. Google Scholar, Web of Science, and Scopus: A systematic comparison of citations in 252 subject categories. Journal of Informetrics, 12(4), pp.1160–1177.
Schluter, N., 2018. The glass ceiling in NLP. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 2793–2798).

Blog Posts:

The Geographic Diversity of NLP Conferences. Marek Rei.
The #BenderRule: On Naming the Languages We Study and Why It Matters. Emily Bender.
NLP’s Clever Hans Moment has Arrived. Benjamin Heinzerling.

Links:

ACL Anthology: https://www.aclweb.org/anthology/
Search AAN: http://aan.how
ACL Anthology Reference Corpus: https://acl-arc.comp.nus.edu.sg

Appendix

I. Categorizations of NLP Papers

By types:

journal papers
main conference papers
student research papers
system demonstration papers
shared task papers
workshop papers
tutorial abstracts
doctoral consortium papers
squibs¹

By length:

long papers (8 pages OR 8 pages + references)
short papers (4 to 6 pages OR 4 to 6 pages + references)

By mode of presentation:

oral
poster
demo

AA does not explicitly or systematically capture many of the paper types (because the conferences and journals do not do so). Thus there are several challenges in automatically categorizing a paper into one of these categories for the NLP Scholar project.

Most papers in AA that are not explicitly marked as long or short. Some number of short papers are marked as short in the booktitle. The page number of papers in the proceedings are indicated, but the stipulated length of long and short papers has has undergone changes over the years. For example, for many years, long papers had to be 8 pages maximum (including references). Then at some point an additional page was allowed for addressing reviewer comments, and now many conferences allow an unlimited number of pages for references.
In early 2000s, short papers were often presented as posters, and proceedings may not clearly distinguish posters and demos (for example, ACL-2005 had a volume for “Posters and Demos”). So, separating posters, demos, and short papers from that time is problematic.
SemEval is technically a workshop, but is included with *Sem (a conference). It is a platform for shared tasks, but some main conferences also have shared tasks (independent of SemEval).
Tutorial abstracts are not really papers, but tutorials are cited in scientific articles
The distinction between a conference and a workshop can sometimes be fuzzy. Some venues, such as EMNLP and CoNLL, started off as workshops but later transformed into conferences. For this work, we will consider them to be conferences.
Sometimes there are joint events, such as the 2007 joint EMNLP-CoNLL conference. In such cases, we treat the papers to belong to the conference whose code is assigned to the joint event by AA.

¹ Squibs are short research articles presenting a focused discussion or position. There were 43 squibs in AA (all from CL Journal) at the time of data compilation.

Other posts in this series:

Part I: Size and Demographics
Part II: Areas of Research (Examining Title Terms)
Part IIIa: Impact (Most Cited Papers and Aggregate Citation Metrics by Time, Paper Type, and Venue)
Part IIIb: Impact (Examining Citations by Area of Research, Academic Age, & Gender)
NLP Scholar: An Interactive Visual Explorer for the ACL Anthology