Ranking U.S. Universities for NLP Research Part 1: Data Collection

Chloe Lee
6 min readFeb 13, 2020

--

As the Information Age evolves with the wave of big data, demand to analyze unstructured textual data increases, bringing tremendous attention to the field of Natural Language Processing (NLP) and resulting in numerous emergence of NLP programs at academic institutions.

Each institution is unique and has its strengths, which makes it difficult for prospective students and faculty candidates to choose the right programs to apply to when there are so many good ones.

Emory University is ranked #21 among top national universities by U.S. News in 2020.

To give a comprehensive understanding of research environments provided by these academic institutions, so that researchers can make informed decisions to best proceed with their careers in NLP, the open-source project called NLP Rankings and the website nlprankings.org are created by the NLP Research Lab at Emory University (a.k.a. Emory NLP).

http://nlprankings.org

Inspired by CSRankings, NLPRankings are entirely metrics-based, weighing academic institutions by their publications found in ACL Anthology, that is an open-source website hosting papers on the study of computational linguistics and NLP. It is an unbiased ranking methodology that reflects each institution’s research advancement in NLP.

Venues and Time Range

Papers published in the last 10 years (2010 ~ 2019) to certain venues are collected from ACL Anthology. All venues hosted by ACL events as well as a few venues hosted by non-ACL events are considered for NLP Rankings:

Retrieving Bibliography from ACL Anthology

Thanks to ACL Anthology (which did most of the heavy-lifting jobs), the papers are readily organized by venues and publication years.

Homepage of ACL Anthology organized by venues and years
ACL Anthology Homepage.

Each venue also provides a bibliography file that contains the details of all papers published to that venue.

The following shows an example of the information presented in these bibliography files.

This is an example of a bib file in ACL Anthology.

All downloaded bib files and the code used to scrape the information can be found on our open source project.

Retrieving Email Addresses from Publications

Once we have all the bibliography files in place, it is easy to access the paper’s URL and download the respective PDFs. All PDF files are converted into the TXT format using Apache Tika, which allows us to obtain the information that we need to rank universities — the authors’ email addresses.

The email domains of the authors reveal the information about which institutions that the authors belong to. Taking the above paper as an example, author Chongyang Tao, Wenpeng Hu, Dongyan Zhao, and Rui Yan belong to Peking University, whereas Wei Wu and Can Xu are working for Microsoft.

Matching the email addresses with authors by their names may seem straightforward, but it still encounters the following challenges:

  1. Emails may be presented in groups (e.g. {wuwei,caxu}@microsoft.com must be separated to wuwei@microsoft.com and caxu@microsoft.com).
  2. Only naming convention is provided (e.g. {firstname.lastname}@abc.edu) meaning that the actual name literals need to be filled in place of the template.
  3. Naming conventions are not consistent across different institutions (e.g. firstname.lastname@abc.edu, {first_initial}{lastname}@abc.edu).
  4. Not all authors provide their email addresses in their publications (e.g. we may have 4 authors but only 3 email addresses).

1. Separating Emails that Come in Groups

First, for each paper in TXT format, we extract the first 2,000 characters which most likely cover the area on the first page containing email addresses. We then use a very tedious set of regular expressions that consider almost all possible forms of email addresses to extract a list of id@domain per author, which also handles the case of grouped email addresses above.

Using the code above, the following email addresses are retrieved:

['chongyangtao@pku.edu.cn', 'wenpeng.hu@pku.edu.cn', 'zhaody@pku.edu.cn', 'ruiyan@pku.edu.cn', 'wuwei@microsoft.com', 'caxu@microsoft.com']

2. Filling in Details to the Email Address Format

In some publications, instead of presenting the IDs or the full email addresses, the authors only provide a template such as firstname.lastname@school.edu.

Without any modification, the following is returned from the code above:

['firstname.lastname@imag.fr']

To handle this issue, again, we use Regular Expressions to substitute the placeholders with the respective names. It is relatively easy to extract the author’s names from the bibliography files, and also distinguish the last name of each author from their first name (and middle initial if provided).

Once we have the first name, (middle initial), and the last name of each author, we need to iterate through the author list to generate the followings:

['bo.li@imag.fr', 'eric.gaussier@imag.fr']

3–4. Matching Emails with Author Names

Now that we have a list of email addresses extracted from each paper, the only step that is left is to match the emails with the corresponding authors, since emails are not always provided in the same order of the authors listed in the bibliography files.

Fortunately, academic authors tend to use their institutional emails, and these email addresses often follow typical naming conventions. Thus, we pseudo-generate the following email addresses for every author, where f/m/l is the initial of the first/middle/last name, and (m) is optional:

  • firstname lastname (e.g., jinho choi)
  • f (m) lastname (e.g., j d choi)
  • lastname f (m) (e.g., choi j d)
  • firstname (e.g., jinho)
  • lastname (e.g., choi)
  • f (m) l (e.g., j d c)

Since there are 6 naming conventions in our template, for every author, 6 email IDs are pseudo-generated, and then compared independently in terms of the Levenshtein distance to the list of email addresses extracted above. The package FuzzyWuzzy is used to measure the Levenshtein distance

To match emails and authors accurately, the matrix M is created for every paper as follows, where

  • e is the number of extracted email addresses,
  • c = n · a,
  • n = 6 is the number of naming conventions above,
  • a is the number of authors in the corresponding bibliography file:

Thus, an email address is pseudo-generated per column by substituting the corresponding author’s name. Each cell in M is then filled with the Levenshtein distance between the corresponding row and column. Finally, the argmin of each row is taken so that its corresponding author is matched to the email represented by the row.

A publication may have more authors than emails, in which case, the author with the least similar score to every email address ends up pairing with an empty string, and the contribution of the unmatched authors are discarded from scoring (for now).

Because we are matching emails based on argmin, there will never be the case where an email is matched to an author just because the author is processed first. Thus, when the number of emails is less than the number of authors, every author is equally likely to be paired with an empty string.

Next Part …

Part I describes:

  • How the data are collected and extracted from ACL Anthology
  • How email addresses are matched with the corresponding author names.

In Part 2, we will explain our score mechanism to measure the research contribution of each author and academic institution.

Please visit the following pages for more information, or leave a comment if you have any questions or suggestions!

--

--