Blog #2: Digging Deeper.

Bryan Hanner
GatesNLP
Published in
7 min readApr 12, 2019

Hello! We are back with more thoughts on our project ideas for our capstone this quarter. We will go into the pros and cons, as well as libraries, tools, and datasets we might use for each.

Digging towards the light of a project idea: Photo by Dane Deaner on Unsplash

Idea 1: Analyzing model drift to understand what models are really learning

Pros

We believe that this is an extremely interesting idea that approaches the analysis of model interpretability from a unique perspective — i.e. enabling us to interpret models from an “output” perspective (what type of outputs/summaries the model is generating) vs from a “parameter” perspective (what type of parameters the model has). Moreover, it contributes to NLP standards for model analysis which is growing area of research that many researchers are increasingly interested in, especially since the advent of neural networks.

Additionally, as we did more research into this topic area, we realized that there are in fact different methods for summarization, specifically abstractive vs. evaluative methods. While we haven’t delved deeper into specific models, we believe that this is a promising start for this project. Finally, another positive that stems from this project is that we believe that there are several interesting directions we can take with this project. With just a little brainstorming, we came up with two interesting directions: analyzing model drift to see how different models emphasize different information and analyzing model drift to see how different initial corpora affect model drift. This suggests that there might be a lot to uncover in this area of NLP.

Cons

On the negative side, we believe that the biggest cons stem from all the unknowns we have with this project. One of the biggest challenges that we have with this project is developing a robust metric to evaluate how different models encode different information (which we have received feedback in our most recent blog post, but haven’t spent too much time thinking about yet — it’ll be addressed in more depth in our next blog post). As discussed in class, it is hard to come up with a robust metric that removes biases and covers everything that we want to cover. Another challenge that we foresee with this project is that we do not have a strong grounding in terms of how to extract information that is encoded from a summary. While there might be enough research and model development within this field, this is definitely an area of research that we haven’t delved deep into yet and so we might run into unforeseen problems.

Finally, another challenge that we might run into is actually building the models for this project. For this project, we have to set up multiple models from different papers, sometimes a challenge in itself. This would involve a large amount of upfront work in terms of setting up models to see if there are interesting analyses across models from different papers before we would be able to develop any meaningful insights or indicators that would predict potential success. Finally, a large part of this project is not easy “demonstrable” as it is more of an analysis project and as such, we are a little unsure about the feasibility of this project as a capstone project.

Potential Datasets to Explore (for summarization)

WikiHow: a dataset of article and summary pairs developed using abstractive summarization techniques

CNN/DailyMail: dataset consisting of summaries of news articles

DUC: Document understanding conference, an important dataset for summarization (though relatively old / not actively being worked on)

Implementation Libraries/Platforms

AllenNLP: Useful as a starting place for training, evaluating, and predicting with models, as well as connecting standard components together into a larger model.

PyTorch: Hopefully to be used mostly with AllenNLP.

Idea 2: Summarizing textbooks to glean out the most important points

Pros

There is unlimited knowledge out in the world, but we only have a limited lifespan as a human. This project can definitely help to summarize textbooks in large quantities without human intervention, making it more efficient to absorb information from textbooks/novels, etc. Even though there are a lot of human summaries out there, not every textbook has a good summary if it even has a summary at all. Also, if we try to implement our stretch goal of query-based summarization, it would help provide context-driven summaries to the end user, making it even more useful. We will be able to do some work in the relatively untouched field of query summarization, and maybe add some interesting insights to this field. All of us are interested to work in this field.

Unlike project 1, this is an application so it’ll be easy to demonstrate and will have tangible real-world applications. It’ll need a fair amount of abstractive/extractive summarization, depending on how we want to glean the key points from the text. Thankfully, there is a lot of work done in these areas and we have plenty of evaluation metrics to pick from if we choose this project. We can use ROUGE and METEOR to get an idea of how much relevant data we got from the source.

Cons

It seems that the only dataset we can utilize is from Wikipedia. SparkNotes and Cliffnotes don’t permit web scraping in their ToU. Training models that ingest large corpora will take significant resources. It’s unknown what data would be most useful for our project. For instance, a Shakespearean summary would not work well for scientific writing. Also, training with shorter texts might not generalize well to longer ones. We might just have to stick to one kind of source text and summary to make our project viable. And as far as we know, there’s no known perfect metric to tell us how well the summaries got the key findings from the main text. ROUGE and METEOR have their limitations, which are beyond the scope of this article.

Libraries/Platforms

AllenNLP and Pytorch as mentioned above.

Idea 3: Recommending experts that can help

Pros

You often hear that people are willing to help if you just ask, but it can be hard to know who to ask. For this idea we would be building a product to allow users to directly search for helpful people for a certain topic or to answer a certain natural language question. The most appealing part of the project for us is that it is a tangible product that would allow people to help each other. Whether it is interdisciplinary or within the same field, we would be supporting new collaborations in industry and academia.

As we understand this project, it would require a specific application of information extraction to model each author based on the text they have written. If we pursue our stretch goal of taking natural language questions as input, we would also be doing a form of question answering. Both of these subfields of NLP are rich in previous work, so this should give us an advantage as we figure out what approaches to take. It seems like the information extraction task will also center around the architecture of the model, where our team could apply our systems approach to engineering. This project would greatly take advantage of how much writing people post online as well. Lastly, this project would be easy to demo for the community.

Cons

Training models that ingest up to hundreds of papers or writings for a single person would need a significant amount of computational power. We are not sure that the text itself without the field’s context will be enough to differentiate experts from novices, and it will be a challenge to evaluate to see how accurate our rankings are. A method similar to those used in citation recommendation may be useful here. We also need to address the fact that many papers list authors that have very different roles in the related research, so we could attach some sort of weighting based on heuristics like the estimated main contributor gets the most credit, the estimated advisor gets second most, and so on.

It may be necessary to create a topology of topics to know what subfields are relevant to our rankings. For example, if someone searched for the topic “natural language processing”, they would probably want to include a paper about syntax trees even if the paper didn’t explicitly mention “natural language processing”. This adds another component to our model. If we want to extend beyond research papers, we might want to (legally) gather other types of writing that are published online, which would be another challenge. Lastly, we would want to make sure to differentiate our work from previous related work including grapAL, which strictly deals with academic literature’s metadata, and citation recommendation.

Libraries/Platforms/Datasets

Apart from AllenNLP and Pytorch mentioned above, there are other codebases that might prove useful.

grapAL: A tool from AI2 that could help us with getting useful metadata about research articles and authors.

CORE: A good starting place for open-access research papers to use as unlabeled training data.

Citeomatic and Semantic Scholar’s OpenCorpus: Part of the related task of citation recommendation that is worth using as a reference point.

Lecture Idea

Since two of our project ideas have to do with summarization, it would likely be helpful to have a lecture about both modeling and evaluation techniques for summarization. We could also use recommendations of open-source summarization models that we could use out-of-the-box as a check that our projects are non-trivial. Information extraction and topography may also be relevant to us if we pick the last project idea.

See you next week to hear about our final project choice!

Check out part 1 here!

--

--

Bryan Hanner
GatesNLP

Software Engineer at AWS. B.S. in Computer Science, University of Washington. Enjoys learning fast, building trust on a team, and hacking reliable code.