The State of NLP Literature: Part I

Size and Demographics

Saif M. Mohammad
16 min readOct 22, 2019

This series of posts presents a diachronic analysis of the ACL Anthology —
Or, as I like to think of it, making sense of NLP Literature through pictures.

The world of scientific publishing is a rain forest: Where ideas compete for sunlight/attention; Where some win out and grow taller, while others are forgotten. (Photo credit: Héctor J. Rivas)

The ACL Anthology (AA) is a digital repository of tens of thousands of articles on Natural Language Processing (NLP) / Computational Linguistics (CL). It includes papers published in the family of ACL conferences as well as in other NLP conferences such as LREC and RANLP.

AA is the largest single source of scientific literature on NLP.

This project, which I call NLP Scholar, examines the literature as a whole to identify broad trends in productivity, focus, and impact. I will present the analyses in a sequence of questions and answers. The questions range from fairly mundane to oh-that-will-be-good-to-know. My broader goal here is simply to record the state of the AA literature: who and how many of us are publishing? what are we publishing on? where and in what form are we publishing? and what is the impact of our publications? The answers are usually in the form of numbers, graphs, and inter-connected visualizations.

The posts in this series include:

Subsequent parts will be published in the coming days.

Before we begin, some quick notes:

  • Target Audience: The posts are likely to be of interest to any NLP researcher. This might be particularly the case for those that are new to the field and want to get a broad view of the NLP publishing landscape — current and past. On the other hand, even if you attended NLP conferences long before deep learning was a thing, you have likely wondered about the questions raised here and are interested in what the data tells us.
  • Data: The analyses presented below are based on information about the papers taken directly from AA (as of June 2019) and citation information extracted from Google Scholar (as of June 2019). Thus, all subsequent papers and citations are not included in the analysis. A fresh data collection is planned for January 2020.
  • Interactive Visualizations and Anonymity: The visualizations I am developing for this work (using Tableau) are interactive — so one can hover, click to select and filter, move sliders, etc. However, I am not currently able to publish the interactive visualizations in a way that can be anonymized. Since I want to be able to anonymize public posts about this work as per the ACL guidelines, I include here relevant screenshots. The visualizations and data will be available once the work is published in a peer-reviewed conference. During the relevant anonymity period, this post and the associated paper will be anonymized.
  • Caveats and Ethical Considerations: This is work in progress and is not meant to be a complete or comprehensive view the AA literature.
    See the About NLP Scholar page for a list of caveats, ethical considerations, related work, and acknowledgments.

Papers (most pertinent to this post):

  • Gender Gap in Natural Language Processing Research: Disparities in Authorship and Citations. Saif M. Mohammad. In Proceedings of the 58th Annual Meeting of the Association of Computational Linguistics (ACL-2020). July 2020. Seattle, USA.
  • NLP Scholar: A Dataset for Examining the State of NLP Research. Saif M. Mohammad. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC-2020). May 2020. Marseille, France.
  • The State of NLP Literature: A Diachronic Analysis of the ACL Anthology. Saif M. Mohammad. arXiv preprint arXiv:1911.03562. November 2019.

See full list of associated papers in the About Page.

Let’s jump in!!

Size

Q1. How big is the ACL Anthology (AA)? How is it changing with time?

A. As of June 2019, AA had ~50K entries, however, this includes some number of entries that are not truly research publications (for example, forewords, prefaces, table of contents, programs, schedules, indexes, calls for papers/participation, lists of reviewers, lists of tutorial abstracts, invited talks, appendices, session information, obituaries, book reviews, newsletters, lists of proceedings, lifetime achievement awards, erratum, and notes). We discard them for the analyses here. (Note: CL journal includes position papers like squibs, letter to editor, opinion, etc. We do not discard them.)

We are then left with 44,896 articles. Below is a graph of when they were published:

Discussion: Observe that there was a spurt in the 1990s, but things really took off since the year 2000, and the growth continues. Also, note that the number of publications is considerably higher in alternate years. This is due to biennial conferences. Since 1998 the largest of such conferences has been LREC (In 2018 alone LREC had over 700 main conferences papers and additional papers from its 29 workshops). COLING, another biennial conference (also occurring in the even years) has about 45% of the number of main conference papers as LREC.

Q2. How many people publish in the ACL Anthology (NLP conferences)?

A. This graph shows the number of authors (of AA papers) over the years:

Discussion: It is a great sign for the field to have a growing number of people join its ranks as researchers. A further interesting question would be:

Q3. How many people are actively publishing in NLP?

A. It is hard to know the exact number, but we can determine the number of people who have published in AA in the last N years.

#people who published at least one paper in 2017 and 2018 (2 years): ~12k (11,957 to be precise)
#people who published at least one paper 2015 through 2018 (4 years):~17.5k (17,457 to be precise)

Of course, some number of researchers published NLP papers in non-AA venues, and some number are active NLP researchers who may not have published papers in the last few years.

Q4. How many journal papers exist in the AA? How many main conference papers? How many workshop papers?

A. See graph below:

Discussion: The number of journal papers is dwarfed by the number of conference and workshop papers. (This is common in computer science. Even though NLP is a broad interdisciplinary field, the influence of computer science practices on NLP is particularly strong.) Shared task and system demo papers are relatively new (introduced in the 2000s), but their numbers are already significant and growing.

Creating a separate class for “Top-tier Conference” is somewhat arbitrary, but it helps make certain comparisons more meaningful (for example, when comparing the average number of citations, etc.). For this work, I consider ACL, EMNLP, NAACL, COLING, and EACL as top-tier conferences, but certainly other groupings are also reasonable.

Q5. How many papers have been published at ACL (main conference papers)? What are the other NLP venues and what is the distribution of the number of papers across various CL/NLP venues?

A. # ACL (main conference papers): 4,839

The same workshop can co-occur with different conferences in different years, so I grouped all workshop papers in their own class. I did the same for tutorials, system demonstration papers (demos), and student research papers.

The graph below shows the number of main conference papers for the various venues and paper types (workshop papers, demos, etc.).

Discussion: Even though LREC is a relatively new conference that occurs only once in two years, it tends to have a high acceptance rate (~60%), and enjoys substantial participation. Thus, LREC is already the largest single source of NLP conference papers.

SemEval, which started as SenseEval in 1998 and occurred once in two or three years, has now morphed into an annual two-day workshop — SemEval. It is the largest single source of NLP shared task papers.

Demographics (focus of analysis: gender, age, and geographic diversity)

Beatrice “Trixie” Worsley (1921–1972) — brilliant scientist and the first person in the world to earn a PhD in Computer Science — joined the National Research Council Canada (NRC) before moving to the new University of Toronto Computation Centre to run the NRC-funded IBM punch card mechanical calculators. She was overlooked for the assistant professor position at the university for many years while her male colleagues were promoted.

NLP, like most other areas of research, suffers from poor demographic diversity. There is very little to low representation from certain nationalities, race, gender, language, income, age, physical abilities, etc.

This impacts the breadth of technologies we create, how useful they are, and whether they reach those that need it most.

In this section, I analyze three specific attributes among many that deserve attention: gender (specifically, the number of women researchers in NLP), age (more precisely, the number of years of NLP paper publishing experience), and the amount of research in various languages (which loosely captures geographic diversity).

It should be noted that there exists very little work on tracing the participation and contributions of those with non-binary and other gender identities. Similarly, tracking the skew in authors of diverse income, experiences, and abilities is also crucial, and hopefully more work on those will follow.

Demographics: Gender

The ACL Anthology does not record demographic information about the paper authors. (Until recently, ACL and other NLP conferences did not record demographic information of the authors.) However, many first names have strong associations with a male or female gender. I will use these names to estimate the percentage of female first authors in NLP.

The United States Social Security Administration publishes a large database of names and genders of newborns. We use the dataset to identify 55,133 first names that are strongly associated with females (probability ≥99%) and 29,873 first names that are strongly associated with males (probability ≥99%). (As a side, it is interesting to note that there is markedly greater diversity in female names than in male names.) We identified 26,637 of the 44,896 AA papers (~60%) where the first authors have one of these names and determine the percentage of female first author papers across the years. We will refer to this subset of AA papers as AA*.

Caveats:

  • Since the names dataset only includes children born in the United States, there is lower representation of names from other nationalities. However, many names are common in more than one country, and because there is a large expatriate population in the US, there is still a substantial coverage of names from around the world.
  • Chinese names (especially in the romanized form) are not good indicators of gender. Thus the method presented here disregards most Chinese names, and the results of the analysis apply to the group of researchers excluding those with Chinese names.
  • The dataset only records names associated with two genders.
  • A small number of names change association from one gender to another with time. We hope that the ≥99% rule filters them out, but this is not guaranteed.

The approach used here is not meant to be perfect, but a useful approximation in the absence of true gender information.

So what can we learn from this mapping of a subset of AA authors to likely gender:

Q6. What percent of the AA* papers have female first authors (FFA)? How has this percentage changed with time?

A. Overall FFA%: 30.3%.
The graph below shows how FFA% has changed with time.

Common paper title words and FFA% of papers that have those words are shown in the bottom half of the image.

Note that the slider at the bottom has been set to 400, i.e., only those title words that occur in 400 or more papers are shown. The legend on the bottom right shows that low FFA scores are shown in shades of blue, whereas relatively higher FFA scores are shown in shades of green.

Discussion: Observe that as a community, we are far from obtaining male-female parity in terms of first authors.

A further striking (and concerning) observation is that the female first author percentage has not improved since the years 1999 and 2000 when the FFA percentages were highest (32.9% and 32.8%, respectively). In fact there seems to even be a slight downward trend in recent years.

The calculations shown above are for the percentage of papers that have female first authors. The percentage of female first authors produced a similar number (~31%). On average male authors had a slightly higher average number of publications than female authors.

To put these numbers in context, the percentage of female scientists world wide (considering all areas of research) has been estimated to be around 30%. The reported percentages for many computer science sub-fields are much lower. (See Women in science: quarterly thematic publication, issue I, March 2015 and Employed Scientists and Engineers, by occupation, highest degree level, and sex (2010).) The percentages are much higher for certain other fields such as psychology and linguistics. (See this study for psychology and this study for linguistics.) If we can figure out how to move the needle on the FFA percentage and get it closer to 50% (or more!), NLP can be a beacon to many other fields, especially in the sciences.

There are some areas within NLP that enjoy a healthier female-male parity in terms of first authors of papers. Below is the graph of FFA percentages for papers that have the word discourse in the title.

There is burgeoning research on neural NLP in the last few years. Below is the graph of FFA percentages for papers that have the word neural in the title.

FFA percentages are particularly low for papers that have parsing, neural, and unsupervised in the title.

When considering terms that occur in at least 50 paper titles (instead of 400 in the analysis above), below are lists of terms with the highest and lowest FFA percentages, respectively.

Discussion: FFA percentages are relatively higher in non-English European language research such as papers on Russian, Portuguese, French, and Italian.

FFA percentages are also relatively higher for certain areas of NLP such as work on prosody, readability, discourse, dialogue, paraphrasing, and individual parts of speech such as adjectives and verbs.

FFA percentages are particularly low for papers on theoretical aspects of statistical modelling, and areas such as machine translation, parsing, and logic. The full lists of terms and FFA percentages will be made available with the rest of the data.

Demographics: (Academic) Age

“Playing in mud and streams is the best thing.” -Ben Wicks describing the photo he took.

While the actual age of NLP researchers might be an interesting aspect to explore, we do not have that information. Thus, instead, we can explore a slightly different (and perhaps more useful) attribute: NLP academic age. We can define NLP academic age as the number of years one has been publishing in AA. So if this is the first year one has published in AA, then their NLP academic age is 1. If one published their first AA paper in 2001 and their latest AA paper in 2018, then their academic age is 18.

Q7. How old are we? That is, what is the average NLP academic age of those who published papers in 2018? How has the average changed over the years? That is, have we been getting older or younger? What percentage of authors that published in 2018 were publishing their first AA paper?

A. Average NLP Academic Age of people that published in 2018: 5.41 years
Median NLP Academic Age of people that published in 2018: 2 years
Percentage of 2018 authors that published their first AA paper in 2018: 44.9%

Graphs for how these numbers have changed over the years are shown below:

Graphs showing average academic age, median academic age, and percentage of first-time publishers in AA over time.

Discussion: Observe that the Average academic age has been steadily increasing over the years until 2016 and 2017, when the trend has shifted and the average academic age has started to decrease. The median age was 1 year for most of the 1965 to 1990 period, 2 years for most of the 1991 to 2006 period, 3 years for most of the 2007 to 2015 period, and back to 2 years since then. The first-time AA author percentage decreased until about 1988, after which it sort of stayed steady at around 48% until 2004 with occasional bursts to ~56%. Since 2005, the first-time author percentage has gone up and down every other year. It seems that the even years (which are also LREC years) have a higher first-time author percentage. Perhaps, this oscillation in first-time authors percentage is related to LREC’s high acceptance rate.

Q8. What is the distribution of authors in various academic age bins? For example, what percentage of authors that published in 2018 had an academic age of 2, 3, or 4? What percentage had an age between 5 and 9? And so on?

A. Below is the distribution of number of authors by NLP academic age bin for papers published in each of the years from 2018 to 2011:

Discussion: Observe that:

About 65% of the authors that published in 2018 had an academic age of less than 5.

This number has steadily reduced since 1965, was in the 60 to 70% range in 1990s, rose to the 70 to 72% range in early 2000s, then declined again until it reached the lowest value (~60%) in 2010, and has again steadily risen until 2018 (65%).

What all of this means is that even though it may sometimes seem at recent conferences that there is a large influx of new people into NLP (and that is true), proportionally speaking, the average NLP academic age is higher (more experienced) than what it has been in much of its history.

Demographics: Location (Languages)

“The limits of my language mean the limits of my world.”
Ludwig Wittgenstein

Automatic systems with natural language abilities are growing to be increasingly pervasive in our lives. Not only are they sources of mere convenience, but are crucial in making sure large sections of society and the world are not left behind by the information divide.

Building on what Wittgenstein said, the limits of what automatic systems can do in a language, limit the world for the speakers of that language.

Q9. How much NLP work has been done on various languages?

A. We know that much of the research in NLP is on English or uses English datasets. Many reasons have been proffered, and I will not go into that here. Instead, I will focus on estimating how much research pertains to non-English languages.

I will make use of the idea that often when work is done focusing on a non-English language, then the language is mentioned in the title. I collected a list of 122 languages indexed by Wiktionary and looked for the presence of these words in the titles of AA papers. (Of course there are hundreds of other lesser known languages as well, but here I wanted to see the representation of these more prominent languages in NLP literature.)

Below is a treemap of the 122 languages arranged alphabetically and shaded such that languages that appear more often in AA paper titles have a darker shade of green.

Discussion:

Even though the amount of work done on English is much larger than that on any other language, often the word English does not appear in the title, and this explains why English is not the first (but the second-most) common language name to appear in the titles. This is likely due to the fact that many papers fail to mention the language of study or the language of the datasets used if it is English. There is growing realization in the community that this is not quite right. However, the language of study can be named in other less prominent places than the title, for example the abstract, introduction, or when the datasets are introduced, depending on how central it is to the paper.

We can see from the treemap that the most widely spoken Asian and Western European languages enjoy good representation in AA. These include: Chinese, Arabic, Korean, Japanese, and Hindi (Asian) as well as French, German, Swedish, Spanish, Portuguese, and Italian (European). This is followed by the relatively less widely spoken European languages (such as Russian, Polish, Norwegian, Romanian, Dutch, and Czech) and Asian languages (such as Turkish, Thai, and Urdu). Most of the well-represented languages are from the Indo-European language family.

Yet, even in the limited landscape of the most common 122 languages, vast swathes are barren with inattention. Notable among these is the extremely low representation of languages from Africa, languages from non-Indo-European language families, and Indigenous languages from around the world.

Examples of Positive Change

Some examples of efforts to bring more attention to less popular languages include:

Some efforts by ACL to improve demographic representation and inclusiveness:

NLP in Africa:

NLP in South America:

Future Work

  • Showing the languages on an interactive world map. Showing the coverage of the most widely spoken languages.
    Motivation: Showing the linguistic diversity in the world. Showing the paucity of work on languages from large geographical areas.
  • A tool to show papers on a user-chosen language. The tool will have an option to show work on languages in the same language family as the user-chosen language.
    Motivation: Techniques that work on one language are likely to also work on other languages in the same language family.
  • An analysis of the rate at which researchers are leaving NLP (using the academic age data). Separate analyses for female and male researchers to identify whether (and the extent) of the differences in the drop-off rates.
    Motivation: There is growing realization that women leave academic research at a greater rate than men due to systematic biases, abuse, poor work environment, and sexism.
  • Extract author affiliations and locations from what is listed on the paper.
    Motivation: Track quantity and impact of academic, industrial, and government NLP research publications over time.
  • Develop a classifier to determine non-AA papers that are NLP papers and include them in the analysis.
    Motivation: Track NLP papers at large, and not just those in AA.

Contact
Saif M. Mohammad
Twitter: @saifmmohammad
Email: uvgotsaif@gmail.com, saif.mohammad@nrc-cnrc.gc.ca
Webpage: http://saifmohammad.com

Project Homepage: http://saifmohammad.com/WebPages/nlpscholar.html

--

--

Saif M. Mohammad

Saif is Senior Research Scientist at the National Research Council Canada. His interests are in NLP, especially emotions, creativity, and fairness in language.