Blog Post 8: It’s all fun and games.

Published in

GatesNLP

7 min readMay 17, 2019

Welcome back! This week we have more progress on our unsupervised model using BERT and our supervised model using AllenNLP! For the unsupervised model, we fixed some bugs and made some improvement, not only optimizing the speed of our model (!!), but also generating some interesting results! For our supervised model, we improved our model slightly by trying various tasks to see the difference between different approaches.

Pairwise Unsupervised Model using BERT

To recap, we planned to contextualized embeddings (specifically BERT) to generate document embeddings for our abstract. However, as a lot of groups have already mentioned, this is definitely easier said than done :(

We ran into quite a few problems while using BERT. First and foremost, our biggest challenge in using BERT was memory issues. We found that very often at ~60% of evaluation completion, our model would crash. While it crashed at around the same spot, the exact spot was different every time as was the error message. We spent a huge amount of time trying to fix this — ultimately landing on the approach of writing and reading training embeddings to a file. When we first implemented this, it would also crash and take a long time to run. After much debugging, and a few different bug fixes, we were able to get our script to run in around 1.5 hours. At least, this gives us the ability to quickly iterate on our script (much faster than the older time that would take around 1 day to run).

However, much to our dismay, when we ran BERT (after the above fixes), the scores were extremely poor. We achieved the following results:

Min rank = 1MRR = 0.004468199495249607Matching citation count = 1551

Do note that these are the updated results that we used when using bert-uncased. To use this model, we made sure to transform our entire abstract to lowercase. Additionally, we tried to add segment ids (show which sentence the word is from), but that also didn’t seem to help much. With such a poor MRR, we postulated that maybe there were certain bugs in our code; however, weren’t able to find anything significant.

After talking with the TAs about our results, they suggested that it’s actually really hard to debug BERT and figure out where specifically the issue is — whether it’s in the embeddings or the cosine similarity. As such, we’re planning on implementing GloVe embeddings over the weekend and hopefully getting back to fixing the issues with BERT early next week. We also performed some error analysis as shown below!

Error Analysis

At a high level, we realize that our BERT model didn’t do as well as we expected. This poor MRR score is pretty evenly distributed, with the model not performing better on a specific set of papers and worse on others.

Here is the specific details for a single paper:

Sample Evaluation Paper Title: ‘CREDAL: Towards Locating a Memory Corruption Vulnerability with Your Core Dump’Top Ranked Paper: ‘Enriching Interlinear Text using Automatically Constructed Annotators’Top Correctly Ranked Paper rank: 6570

From these results, we see that exactly what we expect — our model isn’t really learning anything well and the results aren’t very strong. However, from this sort of error analysis what we do realize is that the problem seems to be in our document embeddings, which is what we use BERT to generate.

Continuing with the Pairwise Supervised Model

This week, we also grappled with ways to improve and apply our pairwise model to our ranking problem. As a reminder, the model is trying to predict if the first “query” paper cites the second “candidate” paper, given each paper’s text. Though the supervised model is currently too slow to reasonably evaluate on every pair of training and dev papers, we now have more information about our data and what our model can learn from it to inform next week’s work. Importantly, it seems that the pairwise model from last week was not learning patterns that generalized. The first sign was when I was getting 0.63 dev accuracy when the split was ⅓ positive (actually cited) and ⅔ negative (not cited) examples, and 0.48 on a 50/50 split. Since these values were just below the accuracy of picking the most common answer, I focused on improving the model’s learning this week.

The first key takeaway is that the way we extract pairs from the dataset has a massive impact on the results. Making model changes like GloVe embeddings, adding neural layers, and bi-directionality made little difference compared to the actual pairs we were training or validating on.

Last week, I took the pairs starting from the beginning of the file with only a small amount of randomness (for instance, in picking the second paper in random negative pairs), but this still left quite a lot of the pair choices up to the order of the file. This week, I made the choices much more random (though still enforcing the 50/50 positive/negative split), and these changes finally made my models rise above the “most common answer” benchmark.

Further improvements came from taking out the difficult “one-hop” negative examples (Bhagavatula et al., 2018). As a reminder, these are examples A and C where A cites B and B cites C, but A doesn’t cite C. This is a negative example (i.e. “not related”), but the content will be very similar since they are all related papers. Though the model’s poor performance still confuses me a bit, my hypothesis is that the one-hop papers were confusing the model as to what was related or not, so the model was mostly learning from noise. However, getting a random sampling of positive and negative examples without one-hops gives the model a more general idea of what is relevant or not.

I tried seeing if shuffling the data before writing it to disk would make a difference, but since AllenNLP’s trainer does shuffling for us, I am not sure why this changes anything. Perhaps more randomness helps the trainer more evenly send positive and negative examples to the model, instead of all the positives and then all the negatives.

We made a final improvement by increasing the size of our dataset: doubling and then quadrupling the size. Interestingly, the larger the datasets got, the smaller the training accuracy and the larger the dev accuracy. This points towards the model learning something generalizable since it does similarly well on both datasets.

Below are the results of each model on the dataset it was trained/tuned on. Though we adapted the task to be easier, it is exciting that we have much better results now! For context, I started with 30,000 pairs total, and then later tried 60,000 and 120,000. These are each split into 80%, 10%, and 10% between training, development, and test sets respectively. Also note that I adjusted a couple model parameters throughout the experiments as noted in the table, so these are not perfect comparisons.

Our results for the different types of datasets that we tried

Now that we have a model that has learned something about what signals that two papers are cited, we have several directions to go. One is analysis: looking more at the model predictions and the agreement between this model and our unsupervised methods. Another is application: looking at ways to filter out options before doing pairwise comparisons on a smaller set of pairs to make the ranking evaluations computationally reasonable. We can do this by putting all the papers in an embedding space using triplet loss as Bhagavatula et al. does (2018) or use prototypes as a fellow capstone group is doing (Hathi et al.). We also want to look into how this model can be used for rankings. Specifically, we want to confirm if the confidence score from the improved binary classification model is useful for a confidence of a paper relevance or if using a multi-class model predicting multiple degree of relevance (say 1 through 5) will help us rank. Starting with the work of Guo et al., We did significant research into calibration such as Platt’s scaling that would output proper probabilities (2017). However, since methods like Platt’s don’t seem to change the relative confidence between inputs (as would be useful for ranking), we could not find a clear way to make the neural network output better represent a confidence score (Zygmunt 2014). This made sense because it is only learning off of a binary decision, so there are no degrees of confidence that it can directly learn. However, a multi-class relevance scoring system would need to make more assumptions on the metadata, which leads us to the analysis below.

Dataset Analysis

We also did some analysis on the pairwise model. Pairwise is a pair of papers <paper1, paper2> where paper1 cited paper2. Out citations are the citations a paper makes, and in citations are occurrences where the paper is cited.

In the pairwise dataset, we found that there is a relatively high percentage of cited pairs with shared entities, out citations and in citations. There are 72% of the pairs shared at least one same out citations, the average is 2.62, the maximum is 49. There are 59% of the pairs shared at least one in citations, the average is 4.87, the maximum 1712. There are 63% of the pairs shared at least one same entities, the average is 1.14, and the maximum is 17. It is not surprise that author and venue are relatively low, since there are a lot of researchers in the world and a lot of academic conference in a specific research field.