The beginning of OutStanding CApstoneRs (OsCar)

Published in

OsCar Capstone

4 min readApr 9, 2019

A Boundless OutStanding CApstoneR boi — Source: http://disguise.com/oscar-the-grouch-adult.html

Our names are Ravi Patel, Tyler Ohlsen, and Blarry Wang. Our purpose? Follow along with the footsteps of our predecessors: ELMo and BeRT. We are OsCar: the Outstanding Capstoners.

Our group’s GitHub: https://github.com/tohlsen/OsCar.

Here are our top three project ideas that we are looking to accomplish in the 10-week-long capstone project.

Project Idea 1: Question Answer (DROP dataset)

Summary:

DROP stands for Discrete Reasoning Over Paragraphs. It is a reading comprehension benchmark that consists of a set of paragraphs and questions that require answers in the format in number, word span, or date. However, different from most other QA datasets, DROP requires further discrete reasoning on the information presented in the paragraphs to arrive at the right answers. For example, models sometimes need to count how many touchdowns happened in a football game based on a description of the game, where the answer is not explicitly stated in the paragraph. In addition to counting, the dataset requires other kinds of discrete reasoning, such as subtraction, comparison, selection, etc.

One of the biggest attraction of this dataset is it’s relatively new. The baseline model provided in the paper achieved only 47.01 F1 score on the test set, where humans can achieve 95.98. We are excited about the large room for improvement.

Source: https://arxiv.org/pdf/1903.00161.pdf

Minimum viable action plan:

Replicate the baseline model presented in the paper (Augmented QANet);
tune the baseline model as much as possible to replicate the F1 score shown in the DROP paper; and
analyze the strengths and weaknesses of Augmented QANet by categories of reasoning.

Stretch Goals:

Build more specialized models that tackle Augmented QANet’s weakest categories;
integrate our models and Augmented QANet; and
submit our model to the DROP public leaderboard (https://leaderboard.allenai.org/drop/submissions/public) for evaluation.

Project Idea 2: Question Answer (HotPotQA dataset)

Summary:

Most Question-Answering (QA) datasets use one passage as the reference for answering the questions. The HotPotQA dataset seeks to further test existing QA models by requiring them to perform complex reasoning. Specifically, HotPotQA requires models to answer questions based on supporting evidence from multiple documents or passages rather than just one passage. Additionally, the HotpotQA dataset also contains what parts of the passage(s) are supporting evidence to a specific answer. This allows the model to better understand what classifies as viable evidence for a specific question. The HotpotQA dataset is based on Wikipedia data and is a relatively new QA dataset (released in 2018).

Source: https://nlp.stanford.edu/pubs/yang2018hotpotqa.pdf

Minimum viable action plan:

The HotpotQA research paper shows that they tried one of the leading QA models on the HotpotQA dataset. The model’s performance on the HotpotQA dataset was poor. It achieved an F1 score of 34.40 on the test dataset for getting the correct answer, and an F1 score of 40.69 for getting the correct supporting facts for an answer. Our goals would then be as follows:

Start with the same (or simpler) model, implemented in AllenNLP.
Tune the simple model as much as possible to try to replicate the F1 scores shown in the HotpotQA paper.
Analyze the strength and weaknesses of the original/simple model and discuss approaches to improve the model.
Design a model that can beat the F1 scores given in the paper.

Stretch Goals:

If we are able to beat the baseline with our model, we should create a demo website that will help others understand the task and specifically how we designed our model to handle complex, multi-hop reasoning.

Project Idea 3: Semantic Textual Similarity (Quora Question Pairs Dataset)

Summary:

The Quora Question Pairs Dataset contains 400,000 potential question duplicate pairs, and a binary value indicating if the pair is a duplicate question or not. It is quite new (January 2017), and models have already achieved pretty impressive accuracies (almost 90%).

Building models based on semantic textual similarity can be used in many NLP-related tasks, such as identifying duplicate posts (which this particular dataset annotates) and can consequently help filter out unnecessary / duplicate questions on a QA forum website (such as Quora).

Minimum viable action plan:

Build a model (& dataset reader) similar to those provided, implemented in AllenNLP.
Tune the model (different pre-trained word embeddings, different neural nets to embed the questions (LSTMs, Bi-LSTMs, LSTMs with attention, ELMo), etc.) to achieve an accuracy similar to those reported on NLPProgress (87%-89%).
Describe how changing our model architecture affected our accuracy and why.
Design a model that beats the highest reported accuracy on NLPProgress (89.12%).

Stretch Goals:

If we can beat the highest reported accuracy (or at least very close), we could build a simple demo website that could describe the motivation for the model and what it does, as well as let users input question pairs and see our model’s prediction.
Apply our model to accomplish semantic similarity tasks, such as filtering out duplicate questions from a website or other source.

We hope you will join us on this outstanding journey and follow us on our blog to keep you updated on our project!

Stoners out.

The beginning of OutStanding CApstoneRs (OsCar)

Written by Ravi S Patel