Blog Post 9: The Quest Continues

9 min readMay 25, 2019

Hi all! We are back with updates on what we have done this last week and what we are preparing for our last couple weeks of the quarter. It is quickly approaching!

Supervising our Ranking Task

Coming off of last week, we were concerned about two main issues: improving our pairwise model accuracy and speeding up our ranking evaluation. We did some things to address both of these and have unearthed some more fundamental questions about our project.

First off, since our evaluation to calculate the Mean Reciprocal Rank (MRR) with the development set would take several weeks with the supervised model, we needed some sort of speedup to iterate through our experiments faster. We found it by batching the predictions needed to create each rank, so we could fit each batch on the GPU. Through experiments (shown below), we found that a batch size of 650 was empirically the fastest. Moving from the CPU to the GPU made things significantly faster, but evaluation still takes two days. There is definitely room for improvement here.

With evaluation time improved, we took several steps to improve the accuracy of our pairwise model. The first was to take a more disciplined approach to separate train, dev, and test sets to use with the pairwise model. Since we are in turn using this model to evaluate the rankings for dev and test, we need to respect our divisions for the MRR ranking evaluation which are based on sorted order by year, so we took each train, dev, and test pair from their corresponding paper dataset. We based the size off of the smallest number of cited pairs, which became 10% of the total, so this came to a total of 25,940 pairs. We also now enforce uniqueness in the pairs within train, dev, and test, so we don’t see any pair twice. With this new dataset, we did some hyperparameter tuning on the functions in the feedforward neural network component of the model: getting the best results with a sigmoid nonlinearity followed by a linear layer (as pictured below). These scores are comparable to our previous dataset of 30,000 pairs (which was also a 0.71 dev accuracy, though a 0.85 training accuracy).

To make the pairs more similar to our task and allow us to draw more pairs while maintaining a 80/10/10 split, we tried a second extraction process: the training pairs only from within the training set of papers, the dev pairs from citations going from dev to train, and the test pairs going from test to train. We also experimented with frozen and fine-tuned GloVe embeddings. The scores here again improved: jumping to 0.81 train and 0.82 dev accuracy (more scores below). This matched the larger dataset results for the old pair extraction (0.79 train, 0.81 dev), which is a sign that the amount of data is the cause of the increased scores and that the pair extraction did not change the accuracies much. It was good to see that we weren’t relying on mistakes in our dataset to get such high accuracies.

In terms of error analysis, since our model did not substantially change, it still generally picks the correct relevant pairs, but is sometimes confused by instances of different terms used for the same concept (e.g. “sense” vs. “word sense”) and very similar papers happening to not cite each other. In a false negative example with both the query and candidate paper discussing social media, the model may have been misled by the candidate paper only taking the first sentence to describe the field they are working, and spending the rest of the abstract using non-task specific questions and goals such as “What are the CENTRAL PROPOSITIONS associated…” One hypothesis from the predictions is that the model is better at learning from longer sequences of matching text than more shorter similarities (i.e. single words) because the smaller similarities may be more easily forgotten by the model while iterating through the text. For instance, a correctly predicted positive example had both papers including the phrase “Neural Machine Translation,” which is by far the most clear match between the two as a human looking at the examples.

With these new models, we ran the evaluation again, which is still running. It seems that the speedup for a model with smaller dimensions on a CPU is mostly obscured by the speedup of moving to the GPU. For the first 676 iterations, the MRR is 0.09, which is lower than both our baselines but reasonable.

Unsupervised Learning

To recap from our previous blog post, we were running into significant challenges with getting BERT to produce good results. Consequently, after talking with the course staff, we decided to focus on implementing GloVe, as an unsupervised learning algorithm, with the hope that this would enable us to uncover problems with BERT.

We first implemented our base GloVe model. To do this, we used pre-trained 50-D GloVe vectors (trained on the Wikipedia dataset) and then in order to get a document embedding, we took the average of all the single word embeddings [Pennington, 2014]. In this case, each document corresponded to a single paper title (since using abstracts was taking too long, around a week to evaluate all our papers in our dataset). We then used word movers distance to determine how close two titles were. When we did this, we achieved the following results:

```

min rank = 1

MRR = 0.17574750606967016

```

Once we were able to get the base model working, we experimented with several different approaches. For one, instead of using pre-trained GloVe vectors, we created our own GloVe vectors trained on our own corpus (the rest of the methods from the above base model were the same). This was actually much more complicated than we expected since for one we had to change the format of our data to match that of what was desired from their scripts. We expected this to improve our scores, but unfortunately, the scores produced were worse. They were as follows:

```

min rank = 1

0.14325120590296692

```

Additionally, we spent some time trying to investigate other things that might improve our model. After reading several blog posts (here and here), we realized that one thing that a lot of people recommended was lowercasing all input data as well as removing all stop words. Upon initial analysis, it seems that doing this might improve our scores (we tested this on the first 100 eval titles, and we achieved an MRR score of 0.24). This is currently still running, and we will update scores for this in our final report.

Finally, to get GloVe vectors running in a reasonable time (less than one week), we had to switch from using abstracts as input to using titles. As a result, we re-ran our baselines to use titles instead of using abstracts. Here are the results we achieved:

```

Jaccard Similarity — 0.1690233346351762 (only titles)

TFIDF — 0.2793150964725672 (only titles)

```

Name Entity Recognition (NER)

Something outside supervised and the unsupervised model we have is NER (Name Entity Recognition). NER is widely been using for information retrieval and other NLP tasks; therefore, we want to see if it could be one of the improvements that could fit in into our model. The model we tried is using SciBERT, a BERT model trained on scientific text, which has the state of the art on the NER of scientific papers. In our dataset, there is a list of entities given for each data. Using these given entities, we match them with the NER tags we extract from the input text with the pretrained model and rank them with the shared entities. As a small demo, we put the following data into the model.

“title”:”Context-based Speech Recognition Error Detection and Correction”

“paperAbstract”:”In this paper we present preliminary results of a novel unsupervised approach for highprecision detection and correction of errors in the output of automatic speech recognition systems. We model the likely contexts of all words in an ASR system vocabulary by performing a lexical co-occurrence analysis using a large corpus of output from the speech system. We then identify regions in the data that contain likely contexts for a given query word. Finally, we detect words or sequences of words in the contextual regions that are unlikely to appear in the context and that are phonetically similar to the query word. Initial experiments indicate that this technique can produce high-precision targeted detection and correction of misrecognized query words.”

We concatenate the title and the paper abstract as an input, our pretrained model will extract its entities, and give the following result. Data is listed as <title, number of shared entities, shared entities list>. Rankings are given by the numbers of shared entities with the input text.

1 Open-Domain Name Error Detection using a Multi-Task RNN, 4, shared: [‘vocabulary’, ‘error detection and correction’, ‘speech recognition’, ‘speech recognition’]

2 Limited-Domain Speech-to-Speech Translation between English and Pashto, 3, shared: [‘vocabulary’, ‘speech recognition’, ‘speech recognition’]

3 Improved Alignment Models for Statistical Machine Translation, 3, shared: [‘vocabulary’, ‘speech recognition’, ‘speech recognition’]

4 Automatic Chinese Abbreviation Generation Using Conditional Random Field, 3, shared: [‘vocabulary’, ‘speech recognition’, ‘speech recognition’]

5 State-Transition Interpolation and MAP Adaptation for HMM-based Dysarthric Speech Recognition, 3, shared: [‘vocabulary’, ‘speech recognition’, ‘speech recognition’]

6 Clitics in Arabic Language: A Statistical Study, 3, shared: [‘lexical analysis’, ‘speech recognition’, ‘speech recognition’]

7 Unsupervised Vocabulary Adaptation for Morph-based Language Models, 3, shared: [‘vocabulary’, ‘speech recognition’, ‘speech recognition’]

8 Language Dynamics and Capitalization using Maximum Entropy, 3, shared: [‘vocabulary’, ‘speech recognition’, ‘speech recognition’]

9 The “Casual Cashmere Diaper Bag”: Constraining Speech Recognition Using Examples, 3, shared: [‘vocabulary’, ‘speech recognition’, ‘speech recognition’]

10 Using Chunk Based Partial Parsing of Spontaneous Speech in Unrestricted Domains for Reducing Word Error Rate in Speech Recognition, 3, shared: [‘vocabulary’, ‘speech recognition’, ‘speech recognition’]

11 Characterizing and Recognizing Spoken Corrections in Human-Computer Dialogue, 3, shared: [‘error detection and correction’, ‘speech recognition’, ‘speech recognition’]

12 Automatic Editing in a Back-End Speech-to-Text System, 3, shared: [‘error detection and correction’, ‘speech recognition’, ‘speech recognition’]

13 Unsupervised Learning of Acoustic Sub-word Units, 3, shared: [‘vocabulary’, ‘speech recognition’, ‘speech recognition’]

14 Processing Broadcast Audio for Information Access, 3, shared: [‘vocabulary’, ‘speech recognition’, ‘speech recognition’]

15 Searching the Audio Notebook: Keyword Search in Recorded Conversation, 3, shared: [‘vocabulary’, ‘speech recognition’, ‘speech recognition’]

Since this is the state-of-the-art model in NER, no doubt the model does very good on recognizing entities. We currently consider this as a feature or an improvement of our model.

Some error analysis are the numbers of shared entities are all below 10, it would be difficult to make a ranking with the only score from 0–10. As we could see in the results above, ranking 2 to 12 has the same number of shared entities, in order to make the ranking better, we probably want the scoring function to be more specific. Therefore, out next steps to improve this is add more factor into calculating the rankings, we hope we could get something better before the final week.

Action Plan

Our advanced solutions do not outperform our baselines, but we are becoming competitive. From here, we need to reconsider both our progress so far and what we can reasonably do in the next two weeks to get to a good stopping point.

One definite to-do is to fix a known issue with TF-IDF where we are using the dev and test sets to calculate the training vectors, which we believe should be better isolated for our experimental integrity. From discussions with the course staff and inspirations from Bhagavatula et al., we have ideas for using a TF-IDF indexer or our own word embeddings to take nearest neighbors and reduce how many candidate papers we need to rank with the models we have worked on so far [2018]. We are also considering focusing on a new evaluation framework centered around qualitatively analyzing random pairs, so we can see if our models are (not) learning something that the numbers aren’t catching to include as a more thorough investigation for our final products.

Specifically, on the unsupervised front, we are hoping to get our GloVe vectors to use abstracts as training. For one, we realized that removing all stop words speeds up our model by 100% (it’s twice as fast). However, this is still not enough. We are thinking about using different methods such as attention and PCA to reduce the dimensionality of our vectors (that encode the abstracts). We are planning on meeting with the course staff sometime early next week to get some intuition on what they think will work and not work.

Bibliography

Chandra Bhagavatula, Sergey Feldman, Russell Power, Waleed Ammar. “Content-Based Citation Recommendation”. NAACL-HLT, 2018.

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.

Beltagy, I., Cohan, A., & Lo, K. (2019). SciBERT: Pretrained Contextualized Embeddings for Scientific Text. arXiv preprint arXiv:1903.10676.