How to write a persuasive ICLR review: visualizing the ICLR 2018 open review dataset

March 20, 2018

I recently discovered the site Openreview.net, which not only lists every paper submitted to various conferences, but often includes the anonymous reviews! For example, 930 submissions to ICLR 2018, including titles, abstracts, reviews and acceptance decisions are available online.

The public availability of this data and permissive terms of service (https://openreview.net/terms) invite some data exploration. I’m not aware of any other efforts on this front, although I’m sure they exist.

The goal of the analysis is to identify language common among positive and negative reviews, reviews of accepted and rejected papers, and see if there were commonalities among reviewers who went against the grain. In other words, does anything interesting distinguish positive reviews of rejected papers and negative reviews of accepted papers?

In short, clear, low-jargon language appears to be a hallmark of influential reviews (e.g. the phrase “main idea”), while verbose constructions seemed to be signs of uninfluential reviews (e.g., “formulation of the problem”).

We’ll first look at how to scrape openreview.net and wrangle the data into a a categorized data frame we can analyze. Then we’ll use Scattertext (Kessler 2017) to see the differences in content among these various classes of reviews, and also look at some class association scores and phrase detection methods.

Plot 0: The final output.

Scraping openreview.net and building the corpus

Scraping the site itself is slightly non-trivial since most of the content is rendered through AJAX, but by monitoring the network activity the task becomes very easy. The following code block shows how to politely scrape all ICLR reviews and represent them as a Pandas data frame.

Code to crawl openreview.net.

Next, we can parse through the data frame and assemble it into a set of categorized reviews. The code snippet here is a bit long, but please see the Jupyter notebook (http://nbviewer.jupyter.org/github/JasonKessler/ICLR18ReviewVis/blob/master/MiningICLR2018.ipynb). The result is that 930 papers were identified along with 2,806 unique reviews.

Below is a sample of the first 10 reviews scraped, including their metadata.

There first 10 reviews and their metadata. This is our core dataset.

ICLR has a decent acceptance rate, although the vast majority of accepted papers were accepted as posters. For the purpose of this study, we’ll group all accepted papers together and omit the workshop category.

The reviewer ratings were fairly tentative, with most hovering between 4 and 7 (out of 10). 4 and below were labeled by the conference committee as “reject”, 7 and up accept, and 5 and 6 were either “marginally above” or “marginally below” the acceptance threshold. For the analysis, I grouped [1,4] as “Negative”, [7,10] and positive, and omitted [5,6].

The distribution of the ratings found in the data.

Difference in Frequency Ranks for Term-Category Associations

Plot 1. A Scattertext plot showing how positive and negative reviews differ. Click the image for an interactive version. We can see positive sentiment expressions like “nice”, “well written”, “useful” and “novel” are highly associated with positive reviews, while concerns about about a paper’s “novelty”, it’s “limited” nature, and markers of skepticism dominate negative language. The “novel” vs. “novelty” dichotomy provides some justification for the decision not to stem or lemmatize.

Let’s first use Scattertext to visualize the difference in language between positive reviews and negative reviews. We’ll grab the review data frame we created in the last section, parse it with spaCy (Honnibal and Johnson 2015), and then use Scattertext to plot unigrams and bigrams which occur at least three times. Scattertext, by default, requires bigrams match the PMI “phrase-like” criteria. Be default, the threshold coefficient is 8, a fairly stringent criteria.

The PMI formula for identifying bigrams as phrases. Used in Scattertext, but originally introduced for this purpose in Schartz et al. (2013).

The code to produce the visualization is fairly concise. One can read the generated set of reviews as a Pandas data frame, add a column containing the spaCy-parsed reviews, and create a Scattertext Corpus object from the data frame. The Corpus object categorizes documents based on their binned rating — i.e., “Positive”, “Negative” or “Neutral”. Documents from the Neutral class are removed.

The visualization is created in HTML form, with the Positive category on the x-axis, contrasted to frequency counts from the “not-categories” — in this case, only the “Negative” category. The words are scored and colored based on their difference in rank. The ranks used in this plot (Plot 1) are dense.

Snippet 1. The code to produce the scatter plot.

Note that spaCy tokenizes contractions as multiple words, so the “s” that appears in the upper righthand corner is likely part of a possessive.

In this plot, the axes both refer to the ranks of unigrams and bigrams in each category. The higher a word is on the y-axis, the higher its frequency rank among positive reviews, while the further right a word is on the x-axis, the higher its frequency rank in negative reviews.

No stop-listing or normalization (other than case-insensitivity) is applied to the terms. Had we lemmatized, we’d fail to pick up that the word “novelty” is the best predictor a review is negative, and “novel” is a good indicator a review is positive.

To see how the words appear in context, click the image, mouseover a term, and click on it.

We can use the same workflow to visualize the difference between reviews of accepted papers and rejected papers (omitting workshop papers).

Figure 2. Dense ranks of words used in reviews of accepted and rejected papers.

While the terms highlighted here are similar to those associated with a review’s polarity (e.g., “well written” or “unclear”), we can see that terms associated with a paper’s content (e.g. ,“memory” and “theoretical” vs “regularization” and “LSTM”) appear to be influential, potentially reflecting topics favored by the organizers.

Snippet 2. Code used to make Figure 2.

Making use of neutral data: the Log-Odds-Ratio with an Informative Dirichlet Prior for term-associations

Let’s take a brief digression to see how

The next plot we’ll look at should be similar — it will be the difference between positive reviews and negative reviews. Here, we’ll use a different technique for finding interesting terms: the log-odds-ratio with an informative Dirichlet prior from Monroe et al. (2008). It was popularized in the NLP world by Jurafsky et al. (2014).

Feel free to look at the above papers for an explanation of this score. I’ve created a Jupyter notebooks which describes how this score works, along with Python code and a lot of charts. This no notebook also covers the Dense Rank Difference measure, a measure derived from tf.idf, and a novel scoring measure, Scaled F-Score. http://nbviewer.jupyter.org/github/JasonKessler/PuPPyTalk/blob/master/notebooks/Class-Association-Scores.ipynb

Snippet 3. Source code for producing the log-odds-ratio with an informative Dirichlet prior chart.

We use the reviews for Workshop papers as the background corpus, and instead of looking at the absolute number of word occurrences, use the number of times a word or phrase occurred in a document as our term-count definition. This is accomplished in lines 12–13 of the snippet above. Finally, we scale the sum of the prior vector to that of a typical document length, (following Monroe et al.) to create the term-scorer object (lines 14–16).

The result is below.

Plot 3. Looking at how language differs between reviews of papers that were subsequently rejected or accepted. The log-odds-ratio with an informative Dirichlet prior was used. In this plot, the x-axis is the log frequency of a word or phrase, while the y-axis is the z-score of the log-odds-ratio. Terms with a z-score in (-1.96,1.96) are listed in gray.

Reviews of accepted papers praised their writing (“well written”) discussed the papers appendices and praised them (“thank”, “nice to see”). Reviews of rejected papers questioned their novelty, and contained a number of negatives (“is not”, “is no”, “never”) and criticized the writing style (“unclear’).

Acceptance vs. Positivity

Inevitably, some accepted papers will receive negative reviews while many rejected papers received positive reviews. Below, we’ll construct a plot shows how terms are associated with both a review’s polarity (i.e., whether it was positive or negative) and the ultimate acceptance decision of the paper being reviewed.

There are many ways to construct this chart, but we will define the axes in a way that distinguish terms that are present in reviews which aligned with the acceptance decision (“good” reviews) and those which went contrary (“bad” reviews)

In this case, we’ll only look at unigrams.

  • X-axis: how positive or negative a term is among reviews of accepted papers. The further to the left the more positive a term was among accepted papers.
  • Y-axis: how positive or negative a term is among reviews of rejected papers. The higher a term is, the more positive it was among rejected papers.
Plot 4: contrasting term-sentiment associations of reviews of rejected and accepted papers. Click for an interactive version.

This leads to a division of four quadrants.

  • The upper-left is the Pollyanna quadrant. Terms which tended to be used in positive reviews, regardless of the. “Nice” dominates here, but also also terms like “great” and “potential”. Interestingly, the word “suffer” appears as well, although it was mostly used to summarize shortcomings of comparable approaches.
  • The upper-right is the uninfluential reviewers’ quadrant. These are terms that were used in negative reviews of accepted papers and positive reviews of rejected papers. Discussing a paper’s “usefulness” or “scalability” were, oddly, relatively good markers of an uninfluential review. Perhaps the ICLR area chairs are looking for beyond sheer practicality.
  • The lower-right is the negative quadrant. Some of these terms are stylistic (“actually,” “sure” and “enough”) and but terms “novelty” and “hard” point to fundamental criticisms of the concept and criticism of writing style (“hard to read”).
  • The lower-left corner is the influential quadrant. These were negative reviews of rejected papers and positive reviews of accepted papers. There’s less jargon here than the other quadrants, and closed-class terms that imply simple, coherent discourse structure “about”, “first”, “then”, and “given”. Also, words like “interesting”, “idea”, and “clear” are firmly in the quadrant.

Below is the code used to make this “FourSquareAxes” chart in Scattertext. Note that we disable “censor_points” on line 22 to allow for terms to be labeled over points.

Code to create Plot 4.

From words to phrases

We can also look, in a similar way, at how noun phrases can indicate how likely a review is to align with the acceptance decision or indicate polarity.

Phrases were found through Handler et. al (2016)’s Phrase Machine (https://github.com/slanglab/phrasemachine), which applies regular expressions over sequences of part-of-speech tags to identify probable NPs. PhraseMachine is integrated into Scattertext in an incarnation that can work in Python 2 or 3.

Plot 5: Noun phrases, polarity and acceptance decisions.

Interestingly, discussions of “previous work” are by far a very good indicator or an influential review. Like the unigram example, phrases in the lower-left quadrant aren’t genre heavy and deal with parts of the paper “main text”, “main idea” and basic ML concepts (“loss function”, “neural network”, etc.).

Uninfluential language is more pompous and verbose (“aspect of the paper”, “formulation of the problem”) and involves some in-the-weeds concepts like “inception distance.”

Code to create phrase-based Plot 5.

Postscript and Acknowledgements

Thanks to Julien Chaumond for the spelling correction: https://github.com/JasonKessler/jasonkessler.github.io/pull/1

After this blog post was published, the Allen Institute for Artificial Intelligence released a large corpus of paper reviews and corresponding acceptance decisions: https://github.com/allenai/PeerRead

References

Handler, A., Denny, M. J., Wallach, H., & O’Connor, B. (2016). “Bag of What? Simple Noun Phrase Extraction for Text Analysis”. In Proceedings of the Workshop on Natural Language Processing and Computational Social Science at the 2016 Conference on Empirical Methods in Natural Language Processing.

Matthew Honnibal, Mark Johnson. (2015.) An Improved Non-monotonic Transition System for Dependency Parsing. EMNLP.

Jurafsky, D., Chahuneau, V., Routledge, B., and Smith, N. (2014) Narrative framing of consumer sentiment in online restaurant reviews. First Monday 19(4). http://firstmonday.org/ojs/index.php/fm/article/view/4944.

Kessler, Jason S. (2017). Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations.

Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Ramones, S. M., Agrawal, M., . . . Ungar, L. H. (2013). Personality, gender, and age in the language of social media: The open vocabulary approach. PLoS One, 8(9), e73791. doi:10.1371/journal.pone.007379