Blog Post 3: It’s Decision Time.

Bryan Hanner

Published in

GatesNLP

8 min readApr 17, 2019

Hello! Dying to know which project idea we chose? We’re here with all the answers for you -- cause it’s decision time!

We hope you’re as excited as this puppy. Photo by Joe Caione on Unsplash

Our Chosen Idea

We decided on a project inspired by Idea 3 from our last blog post, albeit with a few significant changes. For this project, we are now building a way to understand associations and connection strength between academic research papers based on their text. To put it more formally, we are planning on building a model that takes research papers as input and outputs the top-k similar research papers for a new research paper in the test set.

So, what motivates us? The scientific community is complex and there is currently so much literature out there, with little to no organization. While many search engines have tried to estimate associations between papers, we are not aware of a single tool that compares papers based on the ideas within the text itself, particularly for the use case of seeing what is related to papers a person is in the process of writing. We believe that this is an extremely interesting space to be in because it enables us to start organizing the vast amounts of data in the academic literature space, while also attempting to apply natural language to academic literature data organization.

Minimal Viable Plan

As a minimum viable plan, we hope to achieve two main goals:

Build both a supervised and unsupervised model that generates a list of top-k “similar” papers for a new research paper. In the case of the supervised model, we would use citations as the “labels” and then compare if the associations we develop match those of current citations. In the case of the unsupervised clustering model, we would take research papers as input and output papers that discuss related topics (putting semantic meaning into a vector space for comparison and/or nearest-k neighbors).
Build an evaluation framework that includes metrics for the unsupervised and supervised models, makes comparisons across those models, and uses human evaluation to further understand the data and how good our model is for generating associations between papers.

Stretch Goals

We believe that there is an immense amount of potential to improve our unsupervised and supervised models. This includes...

Having an interpretable confidence score for each output paper. It will be helpful for the user to see how similar the results are to their input paper.
Creating a richer set of associations related to keywords, authors, topic areas etc. between different components of our model. This would enable us to widen the utility of our model (by enabling the output to be of different forms beyond just papers) while also enabling us to improve the accuracy and strength of our connections.
Giving reasoning (keywords, etc.) that might help the user understand why each paper was relevant or similar. This is related to the previous point.
Incorporating other natural language input such as questions. This will enable us to improve the utility of our application as people are better able to interface with our model as they would in real life.
Writing a parser to allow the user to simply upload their paper as a file instead of copying the content as plaintext.

Related Work

The closest work that we found to our goal of modeling paper similarity was the study of text similarity. For general comparisons, there are adaptations of Word Mover’s Distance, random walks through graphs such as WordNet, and mathematical comparisons of pairs of texts including Jaccard and cosine similarity. It does not appear that neural models have been used much in this area or that there has been much work on this topic in the last five years, so we hope to bring some modern approaches to this task while learning from past research.

We found a few works that apply unsupervised learning to analyze text content and model topics. An example of work that specifically deals with papers is the Toronto Paper Matching System, which automates the task of matching paper reviewers with papers. They also struggled with the evaluation of their system, so we might start with their lines of thinking, including building multiple models and directly comparing the models’ results (especially for our unsupervised models).

There are also several major instances of people using unsupervised learning to observe trends in text data over time. Hall et al. used latent Dirichlet allocation (LDA) as an unsupervised topic model, and then proceeded to observe and analyze trends. Prabhakaran et al. expanded on this idea by considering the context in which a topic is observed to better understand what that occurrence means (e.g. whether the topic is getting more or less prominent in research). Context will be important to consider in our models as well. Tan et al. in Noah’s ARK also used LDA among other techniques to look at the relationships between ideas and organized these relationships into four different categories. They used case studies to showcase their results, which would be a potential path for us. This analysis work may be useful as inspiration for stretch goals, or as a different way to look at evaluation if we want to learn specific topics to use for clustering.

Since in our models we plan to use citations as our “true” related papers, there is also work on citation recommendation that we can look into, including Semantic Scholar’s work. It would also be worth starting a conversation with Jevin D. West to get more domain knowledge since he has extensive experience with citation networks and academic publishing.

Project Objectives

For the first part of our minimum viable plan, we want to build models that generate a list of top-k “similar” research papers for a new research paper by understanding the relationships between different research papers during training. To do that, we will train on research papers as input. In the case of the supervised model, we will be labeling all data through its citations and then building a model to enable users to input a paper, which will then output a top-k ranking of related papers (and the strength of associations for our stretch goal). In the case of the unsupervised model, we will be implementing a clustering model that clusters papers that are similar together, enabling us to understand associations between papers.

For the second part of our minimum viable plan, we will be developing an evaluation framework to understand the effectiveness of our model. This evaluation framework includes both a quantitative component (using citations as “true” similar papers) as well as a qualitative component (human evaluation and comparison across models). This has been explained in more detail below.

Proposed Methodologies

We start with the strawman approach and just use Jaccard similarity or another word comparison metric to get the percentage of shared words in a pair of documents and sort by this percentage to get the most similar papers. To build up on that, we can improve the model further by removing function words and focus on only technical/meaningful content words. We can use inverse-document frequency (idf) to give extra weight to rare words that are shared.

In terms of datasets, there are many datasets available for raw papers, even some that already have the text extracted in a database such as the one on Kaggle below. This reduces the complexity of our problem and the barriers that we must overcome to implement our strawman approach.

In the case of our unsupervised model, we will be implementing a model that clusters papers that are similar together, enabling us to understand associations between papers and find nearest neighbors. For our unsupervised approach, we plan to learn a clustering of the papers using a method like EM and Gaussian Mixture Models that hopefully learns the number of clusters we want to create. We are interested in using unsupervised learning to treat this as a clustering problem to see the relationships between papers in more depth. We also think it will make finding the top-k related papers less computationally expensive. After exploring clustering methods, we are also interested in finding other ways to map papers into a vector space where we can learn their semantic differences/distances without needing finite clusters.

In the case of the supervised model, we will be labeling all data through its citations (making the assumption that cited papers are similar in some way, which itself needs exploration) and then building a model to output pairwise similarity scores, which we can use to rank the papers. We then treat the problem as classifying pairs of documents as “similar” or not. We hypothesize that the supervised model will have more domain-specific knowledge to learn by tuning to citations, though it adds some bias that needs to be analyzed.

Though we are not sure if this is possible, we are interested in exploring if we can combine the supervised and unsupervised ideas together by training our model to minimize the distance between papers and the papers they cite in our model’s vector space. We describe our approach to evaluation below.

Available Resources

Our project leverages current existing technologies and datasets, piecing different aspects together. In terms of datasets, our top choice is one from NeurIPS because it has extracted all the text of NeurIPS papers from 1987-2017.

Additionally, here are some other potential datasets that we might use:

This markdown has instructions on how to extract text from Arxiv. Unlike the first source, this is not publicly available, so we will have to run the scripts.
http://tangra.cs.yale.edu/newaan/index.php/home/download#aanNetworkCorpus is also a similar resource to those above.
https://chenhaot.com/papers/idea-relations.html also talks about how to extract data from ACL/NeurIPS.
Semantic Scholar, ACL Anthology, etc.

Finally, we plan to leverage existing Python technologies and libraries such as AllenNLP, Sklearn etc.

The Evaluation Plan

In the supervised case, we can use a ranking metric such as Discounted Cumulative Gain to give more credit to papers that are cited in a given work and also appear at top of our model’s ranking as “similar”. There are many similar metrics to choose from, and we will need to adapt this metric to the restriction that citations only give us a binary relationship of “similar” or not between papers.

In the unsupervised case, our evaluation plan would be first to check how closely cited works are to the citing paper in our model’s vector space, where we could utilize the ranking metrics from the supervised model evaluation. We may also be able to check that papers with the same keywords (as defined in the papers themselves) are similar in our model. Both these options are ideal for the meaning we want to extract from the model, but because we are using clustering, we may need to use a standard clustering metric such as the Silhouette coefficient.

Part of our research this quarter will be exploring our options here. We plan to supplement this with human evaluation, possibly in the form of case studies. Through human evaluation, looking into both the data and our models, and comparing our models, we hope to learn about patterns in the data and what each model is learning, though we expect challenges in interpretability with our neural models.

That’s what our path ahead looks like for now. We’ll be back soon with more updates!