Blog Post 7: We are Getting Fancy

Published in

GatesNLP

8 min readMay 10, 2019

Hello again! This week we made some solid progress on our models — we took the approach to try several different models to see which one performs the best. As such, we implemented supervised learning techniques through using a simple model with an RNN and feedforward layer as well as unsupervised learning techniques applying cosine similarity to word embeddings. We also took some time to reflect on how we’ve done so far as a team and what we can do to improve.

Supervised Learning

This week, Bryan focused on getting our first supervised model up and running. For the first time this quarter, we have a fully working AllenNLP pipeline including dataset reading, modeling, and prediction. Our supervised setup required a dataset formatted differently than what we originally had, so we’ll start by explaining the data preprocessing.

Since AllenNLP trains on one line at a time and we originally had one paper per line, we needed to reformat the data to have a pair on each line for our pairwise approach. Since we didn’t have the computational power/space to handle N² many pairs, we needed to pick particular pairs that we care about. The first category was clear: we needed to include pairs that were actually cited (“true pairs”). We also wanted to include one-hop examples because AI2 found that they provide tough examples for the model to learn from in their citation recommendation system (Bhagavatula, 2018). A and C is a one-hop pair when A cites B and B cites C (since citing isn’t transitive by our definition, C is not cited by/relevant to A). Later, we added random negative examples by sampling until we get five negative examples for each paper (up to our desired total). We currently bias towards the order of the original data, which we hope to consider more thoroughly in the upcoming week.

We have some initial findings with the pairs we picked playing an important role, reporting accuracy. The first accuracy given is on the training set and the second is on the dev set:

20K true pairs and 20K one-hop negatives: 0.75/0.4710K true pairs, 10K one-hops, and 10K random negatives: 0.91/0.65Second model with smaller dimensions: 0.91/0.63

For error analysis, we looked into the difference between one-hop pairs and random negative pairs, as well as the predictions of the model. When we looked at the one-hop pairs, as one of our TAs pointed out in class, it was difficult for even a human to decipher between the paper was actually cited and the one-hop-away paper. For example, we had one query paper called “A Listwise Approach to Coreference Resolution in Multiple Languages” that cited “Supervised Noun Phrase Coreference Research” but did not cite “Enforcing Transitivity in Coreference Resolution”. I had no idea which one to pick! Perhaps if we had the full paper text, we could have used the specific text in which the reference occurs, but with only titles and abstracts it was nearly impossible to tell. It makes sense that papers in these close citation chains would have very similar topics. However, as one would expect, random negatives came up with truly unrelated examples, such as one from NLP and one from security. These are much easier to tell apart, so it makes sense that our model would be able to learn a more general idea of similarity by comparing the true pairs with the random negatives.

The model’s predictions seem decent. One thing we have found so far is that the model struggles with text that does have correct spelling. The difficulty of identifying misspelled words could be compounded by different word choice between native and non-native speakers. One tricky example we found was

“This paperdescribesa wide-coveragestatistical parserthatusesCombinatoryCategorial Grammar (CCG) to derive dependenc y structures”

This may also be an issue with the original parser for the papers, so we will also look into this.

Here’s an example of a correct prediction that the first paper does cite the second. This example is actually fairly interesting because it is not clear on the surface that these documents would be related. However, it’s possible that it is simply learning the difference between NLP and security papers, or more optimistically it is learning something more nuanced about the text:

{ 
 “logits”:[ 
 1.8649840354919434,
 -2.0288710594177246
 ],
 “query_paper”:”MPQA 3.0: An Entity/Event-Level Sentiment Corpus This paper presents an annotation scheme for adding entity and event target annotations to the MPQA corpus, a rich span-annotated opinion corpus. The new corpus promises to be a valuable new resource for developing systems for entity/event-level sentiment analysis. Such systems, in turn, would be valuable in NLP applications such as Automatic Question Answering. We introduce the idea of entity and event targets (eTargets), describe the annotation scheme, and present the results of an agreement study.”,
 “candidate_paper”:”Local and Global Algorithms for Disambiguation to Wikipedia Disambiguating concepts and entities in a context sensitive way is a fundamental problem in natural language processing. The comprehensiveness of Wikipedia has made the online encyclopedia an increasingly popular target for disambiguation. Disambiguation to Wikipedia is similar to a traditional Word Sense Disambiguation task, but distinct in that the Wikipedia link structure provides additional information about which disambiguations are compatible. In this work we analyze approaches that utilize this information to arrive at coherent sets of disambiguations for a given document (which we call \u201cglobal\u201d approaches), and compare them to more traditional (local) approaches. We show that previous approaches for global disambiguation can be improved, but even then the local disambiguation provides a baseline which is very hard to beat.”,
 “class_probabilities”:[ 
 0.9800398349761963,
 0.019960155710577965
 ],
 “label”:”1"
}

We also made a stretch goal of hooking up the model’s predictor to our ranking script to use as a scoring function. However, we found out today that this is restrictively slow: the model with the smaller dimensions taking an average of 75 seconds per ranking. This makes sense because for each ranking, we currently need to iterate through all the training papers, and pass the dev paper and each training paper to the model to score. With more than 20,000 training papers, this takes a long time. The model results also have a wide range of performance: some predictions rank at 11 but several are fairly low in the tens of thousands’ place. The use of this model will need to be analyzed closely. Note that there are also some differences between this lightweight version and the larger version we analyzed in previous paragraphs.

Just as we are writing this, we realize that we are currently drawing the pairs from the complete dataset and should only be drawing them from the training set. Otherwise, we might see dev or test papers early, which would be cheating. We will address this as our next task. We also allow for the negative pairs to be the same paper twice, which we shouldn’t allow in our selection of pairs. Other next steps include considering some approximation to avoid N² computation using our supervised model and considering our model’s architecture more carefully since it is very bare bones right now.

Cosine Similarity on Word Embeddings

As discussed in previous blog posts, we have formulated our problem to be “what are the top-k similar papers for a single input paper”. To do this, this week we experimented with applying cosine similarity to word embeddings. We used to main methods to generate word embeddings: Doc2Vec which encodes word semantic meaning and BERT which generates contextualized word embeddings.

In the case of Doc2Vec, we generate the word embeddings for all the abstracts in both our development and training set, creating a single vector for each abstract. To do this, we make sure to UNK all unknown words and words that have appeared in the training less than UNK_THRESHOLD (which we set to 3 in this case). We also do not perform any lemmatization on the words. We are currently running this model on the GPUs and will hopefully present some form of results in our next blog post!

At the same time, we also present contextualized word embeddings generated through BERT. For this, we generate BERT embeddings per word in the abstract and then average these embeddings to generate a single vector per abstract. While implementing this, we ran into several problems as we were trying to generate the word embeddings rather than complete a specific task — as such, this took a lot longer than we expected it to take. We are currently also running experiments on this model and hopefully we’ll have experimental results to report in our next blog post!

As discussed in class, due to GPU and memory issues as well as bugs in our models, creating and running these models has taken a lot longer than we anticipated. As such, for the unsupervised learning techniques we do not have solid error analysis results to report as of yet. Stay tuned for error analysis in our next blog post though!

More Supervised Models

Ｗe have some ideas for our future plan. We are considering taking into account more other features in our dataset instead of only encoding title with abstract and using citations as the label. Also, after we have better models, we will start to try tuning hyperparameters in order to get better results.

As we mentioned in our updates in lecture, one part of our next advanced attempt #2 will be using Learning to Rank. Learning to rank (LTR) is an application of machine learning to solve ranking problems for ranking models and information retrieval systems. The three main approaches for learning to rank are pointwise, pairwise and listwise. There are pros and cons for each approach. We already have baselines and a supervised model for the pairwise approach, so listwise is the new one we are going to be working on this week. We are going to implement ListNet (Cao, Zhe, et al.), which is one of the earliest listwise model as a starting point for our next step. There is going to be some challenge for us to define our function for calculating the score and what evaluation metric we’re going to use. We are looking forward to get some real rankings from this model.

Group Feedback

We have decided to shift meetings from coding to status updates/brainstorming. Most of our work is parallelized now which has enabled us to not need the whole team to work at the meeting. This way we can spend more time individually on assigned tasks.

We have also decided to invest some time to look for alternative approaches. It doesn’t have to be a lot but we believe it’ll be fruitful to try new approaches. For instance, we are working on BERT but we decided that it would be interesting to see scores for Doc2Vec and ELMo as well. For supervised learning approaches, our jury is still out for what the best model would be. Hopefully with added time, we can research better models/refine our approaches.

That is it for this week! See you in the next blog post. 👋👋👋

Here’s the link for week 6’s blog.

Bibliography

Chandra Bhagavatula, Sergey Feldman, Russell Power, Waleed Ammar. “Content-Based Citation Recommendation”. NAACL-HLT, 2018.

Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., & Li, H. (2007, June). Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning (pp. 129–136). ACM.