Blog Post 6: To Err is Human.

Published in

GatesNLP

6 min readMay 3, 2019

Hello again! This week was a ton of fixing bugs (which meant improved baseline scores, yay!), improving code quality (which hopefully means less bugs in the future) and the beginning of the implementation of new models using contextualized word vectors.

Fixing those darn bugs

Last week, we reported on how our baseline scores weren’t as strong as we had hoped them to be and TF-IDF performed much worse that our other baseline, jaccard similarity, which didn’t match up to current literature. As such, early this week, we dove into our code head first, refining and carefully going through all the code that we had written. We looked carefully at the way we were ranking our papers and realized that we were computing the sorted ordering, but weren’t taking this sorted ordering into account when determinign the “top-k” papers related to a single paper in our development set. Once we fixed this bug, we achieved much better scores as well as our findings matched that of academic literature. Hooray!

Here are our updated scores:

Jaccard similarity score without lemmatization = 0.19Jaccard similarity score with lemmatization = 0.25TF-IDF scores without lemmatization = 0.36

These scores were recorded from our dataset of ACL, EMNLP and NAACL papers. These scores also match current literature which states that TF-IDF should perform better than jaccard similarity (and lemmatization should improve performance). These scores are definitely something that we are super excited about as this shows us that we’re heading in the right direction, yay!

Improving our dataset

Last week, we also reported that we curated a dataset of NLP conferences: EMNLP, ACL and NAACL. However, this dataset only contained ~7500 papers. We thought this dataset was a little bit too small, and so we sought to extend our papers to a second field of computer science, namely security. We chose this field because one of our team members, Mitali, has done research in this field and has connections to the graduate students and professors in this lab.

We added papers from the following 5 conferences (these conferences seem to cover a large set of conferences in the security field):

ACM Conference on Computer and Communications Security
IEEE Symposium on Security and Privacy
IEEE International Conference on Information Theory and Information Security
IEEE Transactions on Information Forensics and Security
USENIX Security Symposium (and it’s related workshops)

Once we added in papers from these conferences, we had around ~26.5K papers in our dataset from the subfields of computer security and NLP. After curating this dataset, we then ran our old baselines again. Below are our updated scores:

Jaccard similarity with no lemmatization: 0.21894744805412408Jaccard similarity with lemmatization: 0.290622022347491TF-IDF scores without lemmatization: 0.3679469330736681

Extending our dataset to include security conferences didn’t affect have an adverse impact on our evaluation metrics. This is expected and desired behavior since the language/vocabulary used in different subfields is different and as such, our models which use token wise similarity tend not to cluster papers from different subfields together. These results are also something we are super excited about, as it means that our current models are valuable across fields and different datasets — it’s definitely looking a lot more promising now!

Implementing contextualized word embeddings (BERT)

This week we also began our first implementation of our advanced model — specifically using BERT. We plan to leverage BERT to produce word embeddings of the papers in our dataset and then apply cosine similarity to determine the top-k papers in the training set for papers in the test/development set.

Getting this working was actually a pretty large project! For one, half of our group had taken NLP last year when BERT/ELMo was not a thing. So, we first spent some time reading up on current literature and blog posts to understand how BERT worked as well as the different tasks its trained one. Once we understood the logic behind BERT as well as it’s inner workings, we were able to dive right in and begin thinking about coding things up.

As we began to code things up, we ran into another roadblock: the current libraries that use BERT as well as the code released by Google is very easy to use if you want to perform a common NLP task such as question-answering, NER tagging etc. However, for our project, since we are not using a common NLP task, but instead want access to the word embeddings, we needed to dig a little deeper. As we dug around a bit, we found this library built by Hugging Face, that did something similar to what we wanted to do. After much back and forth (as well as a ton of “how does this work?!?” haha), we currently have a working implementation. We are currently running this overnight (once the GPUs start working again, as of right now, they’re super slow), and will hopefully have some interesting results to report in our next blog post!

As for now, seems like we don’t have any confusing results, but I’m sure we’re going to run into some issues once we generate our first set of results from our BERT embeddings — look out for some fun learnings in our next blog post!

So, what’s next?

We have two major goals for next week — first and foremost, is to continue our implementation of the contextual word embeddings using BERT. This is something I’ve talked about above, so I won’t focus on it much for now!

Our second biggest goal for next week is to port our codebase to AllenNLP. We desire to do this because this will enable us to leverage SciBERT, contextualized word embeddings which have been trained on scientific literature and have shown to have performed better on scientific literature than on normal vocabulary (from Wikipedia). We have begun the process, and already have a dataset reader to parse what we need for our citation prediction task. However, we are still figuring out how to design our model in coordination with the data, particularly how to feed pairs into the model in a way that doesn’t give us a vast majority of “not relevant” pairs. One option is to select all the cited pairs and the same number of one-hop pairs (a paper that is cited by the paper, but not the paper that is actually cited) as they do in the citation recommendation paper by Bhagavatula et. al (2018). The idea is that these are harder examples, so the model would need to learn to make harder choices. We are also running into several issues with matrix multiplication which we’re also hoping to fix next week.

Finally, over the next few weeks, we also plan to try out different models to see how they perform in addition to our current models. We plan to try out word2vec/doc2vec and GloVe embeddings, both word embedding models to see how they perform in conjunction and separately from our current models. We also plan to use Autoencoders or GANs in conjunction with word embeddings. In the case of supervised learning, we might use LSTMs. We are excited to see what different models we can explore and how each of them perform!

Observations/Error analysis

We also performed some error analysis, mostly based on TF-IDF model. We found that TF-IDF does fairly well with picking relevant papers, especially when multiple rare topics that are shared make the text comparison closely track the content similarity.

It was interesting to note that examples make recommendations more difficult because a smaller portion of the abstract talks about the ideas of the work itself. There was an abstract that included a long example involving the words “doctor” and “nurse”, and we could imagine some cases where these words could lead a model (especially one doing text comparison) away from the main ideas of the paper’s work.

We also found out methodologies are not necessarily connected content-wise or textually to the actual NLP task, so these get low ranks even if they are cited. A common example here is citations of the GloVe paper. A lot of papers cite it because GloVe is a standard word embedding used in a lot of NLP tasks, even though the NLP task might be very different.

We also found that sometimes, a conditional on a topic is not represented in text comparison (for instance, a shared context such as dialogue beats out the difference between generation vs. classification). Additionally, context, methods, or results sometimes get more words than the actual description of the work. This may make it hard to know if two papers have the same content.

That’s all our updates from this week! See you in the next one!

Bibliography

Iz Beltagy, Arman Cohan, and Kyle Lo. “SciBERT: Pretrained Contextualized Embeddings for Scientific Text.” 2019, venue N/A.

Chandra Bhagavatula, Sergey Feldman, Russell Power, Waleed Ammar. “Content-Based Citation Recommendation”. NAACL-HLT, 2018.