Blog Post #4: Strawman Time.

Bryan Hanner
GatesNLP
Published in
5 min readApr 19, 2019

Hello, back so soon, you say? We had a couple days to take in a lot of feedback and have a followup of how we are doing.

Strawman Approach

As a starting point, we split our dataset into train, dev, and test sets in a 80/10/10 split, which we hope to keep consistent for comparison with our later models. We used a simple bag-of-words comparison between each paper to get a similarity score, specifically Jaccard similarity to calculate the percentage of shared words across both abstracts. We processed the text using spaCy tokenization and lemmatization that assumes English text. Out of the resulting tokens, we only take alphabetic tokens that are not stop words. We also ignore casing. We tried to remove the papers’ headers and citations to strictly learn from each paper’s semantic content, but struggled with this because it required parsing the raw text. This was resolved later by changing to the Semantic Scholar dataset. It also took a significant amount of time to preprocess all of the paper text, which was later resolved by our pivot to only using specific parts of the text such as the abstract (described below).

Once we got to implementing our first draft of the evaluation framework and wanted to use citations for each paper, we ran into roadblocks because there was no citations field. We would have needed to parse different parts of the papers out of the raw text in the initial NeurIPS dataset, which we found is a non-trivial task, and fortunately, Iz at AI2 recommended using the Semantic Scholar corpus this afternoon (Ammar, 2018). This corpus proved very useful for this since it already has each paper organized by its components (citations, abstracts, etc.) in a simple JSON format. This will continue to be helpful as we experiment with using different characteristics of the papers to improve our models and framework. Once we changed datasets, our initial evaluation framework and training process fell into place. We could also now conduct our experiments.

Experiments

For this blog post, we prioritized following pivots (with the dataset, for example) over trying to do multiple experiments with what we had set up (that we probably wouldn’t end up using). We hope this helped us do work that is more relevant to our project goals. With the Semantic Scholar datasets recently set up, we report the MRR score on the test set as mentioned below. Since we are only using citations to know what paper are “relevant”, we simply ignore papers in the dev/test set that don’t have citations. Our training, dev, and test datasets should be checked going forward to make sure we have appropriately-sized datasets after this filter. Here is one example of input and output:

INPUT

Test File Title: 8-year analysis of the prevalence of lymph nodes metastasis, oncologic and pregnancy outcomes in apparent early-stage malignant ovarian germ cell tumors.

OUTPUT

Data in train that closely matches in (sorted by most-to-least relevant)

[

‘[Clinicopathologic analysis of primary pure squamous cell carcinoma of the breast].’,

‘Postoperative radiotherapy for adenocarcinoma of the ethmoid sinuses: treatment results for 47 patients.’,

‘Low-Dose Whole Brain Radiotherapy with Tumor Bed Boost after Methotrexate-Based Chemotherapy for Primary Central Nervous System Lymphoma’,

‘Truncus arteriosus: ten-year experience with homograft repair in neonates and infants.’, ‘Thorough debridement under endoscopic visualization with bone grafting and stabilization for femoral head osteonecrosis in children.’,

‘The Feasibility of Median Sternotomy With or Without Thoracotomy for Locally Advanced Non-Small Cell Lung Cancer Treated With Induction Chemoradiotherapy.’,

‘Multimodality treatment for sinonasal neuroendocrine carcinoma.’,

‘[Correction of high myopia with anterior chamber angle-supported phakic intraocular lenses — own results].’,

‘Influence of age on outcome after radical treatment of esophageal cancer with surgery or chemoradiotherapy (dCRT).’,

‘[Surveillance after orchiectomy for stage I testicular seminoma].’

]

MRR score: 0.00150147668 (we are addressing a bug in our system that is likely making this number too small)

Since AI2’s citation recommendation paper only used abstracts and titles for their models, we are interested exploring how using other content would affect our ranking score to build off of their work (Bhagavatula, 2018). We plan to do another round of experiments once we get more feedback from Iz and finalize our evaluation framework.

Evaluation Framework

The first major change we made from the proposal is only using the abstract instead of the entire paper, to minimize the amount of text we are trying to encode in a vector. We also fleshed out how we will mark papers as relevant for our ranking evaluation: to have a baseline which is also consistent with AI2’s citation recommendation paper, we started with using mean reciprocal rank (MRR) to score each ranking. A correctly “relevant” paper is simply one that the query paper actually cites. In the future, we can include other information when creating our gold ranking (using author/citation graphs, shared authors, etc.) as discussed in class today. At that point, we will likely need other ranking metrics that take into account all relevant papers in some top-k because the relevancy test will no longer be binary: “relevant” or not. This is our first baseline, and we plan to nail down our gold rankings (which will also affect our evaluation metrics) by Thursday of next week.

It is important to note that we recently switched from the NeurIPS dataset to the Semantic Scholar datasets, specifically a smaller sampling of 10,000 papers from 2017. As mentioned earlier, we found Semantic Scholar’s parsing of each paper to be critical to finding each component reliably. For instance, we can now simply access the abstract field to retrieve that text for training, which is a big step as we continue defining this problem space.

We also learned today that the citation recommendation work at AI2 is more similar to what we are wanting to do than we initially realized. This requires us to explore what we can do to extend what AI2 has done, take a unique approach to the problem, and/or target a somewhat different use case. This will all tie into our evaluation framework as we finalize what we are learning and how we want to evaluate it.

That’s all of our updates! See you in a week.

Bibliography

Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford, Doug Downey, Jason Dunkelberger, Ahmed Elgohary, Sergey Feldman, Vu Ha, Rodney Kinney, Sebastian Kohlmeier, Kyle Lo, Tyler Murray, Hsu-Han Ooi, Matthew Peters, Joanna Power, Sam Skjonsberg, Lucy Lu Wang, Chris Wilhelm, Zheng Yuan, Madeleine van Zuylen, Oren Etzioni. “Construction of the Literature Graph in Semantic Scholar”. NAACL 2018.

Chandra Bhagavatula, Sergey Feldman, Russell Power, Waleed Ammar. “Content-Based Citation Recommendation”. NAACL-HLT.

--

--

Bryan Hanner
GatesNLP

Software Engineer at AWS. B.S. in Computer Science, University of Washington. Enjoys learning fast, building trust on a team, and hacking reliable code.