Ranking U.S. Universities for NLP Research Part 2: Scoring Mechanism

4 min readFeb 21, 2020

This is Part 2 of the NLP Rankings series that explains the scoring mechanism. Part 1 discusses the data collection procedure to measure the ranking scores.

NLP Rankings is an open-source project created by the NLP Research Lab at Emory University to provide a comprehensive understanding of the NLP research environment in academic institutions so that researchers can make an informed decision to best proceed with their careers in NLP.

Emory NLP is the Natural Language Processing Research Laboratory at Emory University

Scoring Mechanism

Our scoring mechanism is similar to CSRankings, although it is distinguished in three major ways:

Unlike CSRankings that considers only three venues in NLP (ACL, NAACL, EMNLP), NLP Rankings considers most major venues and allows users to customize weights for different venues for scoring.
Unlike CSRankings where the author’s current institution will take all the credits the author had done even before coming to the institution, NLP Rankings distinguishes between the publications made from different institutions (e.g., the research contribution made by an author during one’s Ph.D. stays with the Ph.D. institution, whereas the contribution made after becoming a faculty member is credited to one’s occupied institution).
Unlike CSRankings that considers research contributions only from faculty members, NLP Rankings considers contributions from all authors (e.g., if a paper has 5 authors where 3 are students from University A, 1 is a professor from University A, and 1 is a professor from University B, NLP Rankings gives 4/5 of the credit to University A and 1/5 to University B, whereas CSRankings gives 1/5 to both universities).

Measuring Author Scores

Each publication is counted only once, and credits are evenly distributed to all authors such that every author receives the same score of w/a where w is the weighted credit of the publication venue and a is the total number of authors in the paper.

By default, papers from the major venues (CL, TACL, ACL, NAACL, EMNLP) are credited with the weight of 3, other conferences (COLING, CoNLL, EACL, IJCNLP) with the weight of 2, and workshops/demonstrations with the weight of 1. Note that workshops and demonstration papers that are under 5 pages are not included because they are often incomplete.

The overall score of each author is measured by the sum of all scores from one’s publications.

Python code to calculate author scores by customized weights: website.py#get_author_dict()

Measuring University Scores

From the data collection step, the publication information of each selected venue are organized into an individual JSON file with extracted metadata such as the publication ID, title, author names, email addresses, author IDs, number of pages, year of publication, etc., where the lists of author names, email addresses, and author IDs are matched in the same order.

Python code to generate publication JSON files: pub_json.py#publication_json()
An example of the JSON file generated for the Computational Linguistics (CL) venue in 2010: C10–1.json

In Part 1, we also went through the process of correctly matching authors with their corresponding email addresses, which are the only indicators of which institutions the authors decide to dedicate their work to.

The JSON file that contains the email domains of U.S. universities (used to allocate publication contributions): university_info_us.json

If the author uses a non-institutional email address (e.g., person email, email from an industrial organization), we assume that the author considers that particular work not to be dedicated to the academic institution.

For each paper, the score of (w*b)/a is credited to each institution where w is the weighted credit of the publication venue, a is the total number of authors in the paper, and b is the number of authors from that particular institution.

The overall score of an academic institution is measured by the sum of all scores from the institution’s publications.

Key Features

All authors count!

NLP Rankings considers all authors; if a paper is published by students and professors from the same university, that university receives the full credit for the paper.

A paper where all students and a faculty member are from the same university.

This is to ensure that contributions from students are not overlooked. Most often students collaborate with their professors (from the same university) to publish a paper. If only faculty members are considered, the ranking score for each specific university would be less representative.

Author movement should not affect prior scores

Scores from NLP Rankings are sensitive to institutional authorship. In other words, scores earned by an author from one institution will not be transferred to another institution upon the author’s move.

Although a reputable author with numerous publications is very likely to continue a high performance at another institution, such expectation cannot be guaranteed because the research environment and student quality vary by institutions.

As a substitution, NLP Rankings indicates how active each author is (or the presence of each author) in every institution by the year of the author’s last publication dedicated to that particular institution.

Next Part …

Part 2 describes:

Distinctive features of NLP Rankings as opposed to CSRankings
Data are used to calculate each university’s ranking scores
How the ranking scores are derived and calculated.

In the next part, we will be conducting preliminary analyses and creating visualizations that will provide more structured insights into the understanding of NLP research environments in U.S. universities.

Please visit the following pages for more information, or leave a comment if you have any questions or suggestions!

Open source: https://github.com/emorynlp/nlprankings
Website: http://nlprankings.org
Emory NLP: http://nlp.cs.emory.edu