Natural Language Processing at Quill.org

Catherine Alvarado
Open Source Quill
Published in
8 min readOct 8, 2017

During my Quill.org software engineering fellowship, I worked on Quill’s NLP effort to automatically detect if a sequence of words is either a complete sentence or an incomplete sentence. This blog post describes my fellowship and experience using spacy.io, an open-source natural language processing (NLP) library in Python.

Background:

I started my fellowship at Quill in June 2016 after receiving a grant from the Edwin Gould Foundation. Quill is an education technology start-up and nonprofit in New York City that provides free online tools to help K-12 students improve their writing skills through personalized writing instruction. Quill’s mission is to eliminate the writing gap, which is the disparity in writing skills that exists between children from low-income and high-income backgrounds.

I primarily focused on Quill’s diagnostic tool, a tool that provides feedback on a student’s sentence structure skills. One part of the diagnostic asks students to turn a fragment into a complete sentence. The software then analyzes whether the student wrote a fragment or a sentence.

To predict whether a sentence is a complete sentence or an incomplete sentence, each word in the sentence is analyzed by its part of speech (POS). Consider the example below:

Sentence:

The dog is smiling.

POS taken from the Stanford NLP Parser:

The — Determiner

Dog — Noun

Is — Verb, 3rd person singular present

Smiling — Verb, gerund or present participle

Suppose you are a grade school student going through the diagnostic. You are provided with a fragment and need to correct it:

The dog smiling.

There are many solutions you might come up with. For example:

The dog is smiling.

The dog is smiling, running, and barking.

Remy, the neighbor’s dog, has a wonderful smile.

There are many ways to turn a fragment into a complete sentence! At Quill, we built a quick initial solution to determine if a student response was a complete sentence, but realized our solution needed to be more robust. Our initial solution fell short because it did not take into account the infinitely many ways a student could modify a sentence fragment. To tackle this problem I initially worked directly with Donald McKendrick, the CTO of Quill, and Raghav Mehrotra, a software engineering summer intern from Stanford. I then transitioned into developing my own solution with guidance from Donald.

Initial Solution:

Donald, Raghav, and I developed a solution that stored several correct responses for a sentence fragment into Firebase, a NoSQL cloud database, and for each response, we stored the POS for each word in the response. Our algorithm compared the POS of a student’s response to the POS of the answers stored for that question. If a student’s response exactly matched the POS of a correct answer, then the response was marked as correct, but if no match was found, then the response was marked as incorrect. Our assumption was that students would only write sentences that matched the POS we stored into Firebase, but we quickly realized that our assumption was incorrect. Students were very creative with their answers and it was inefficient and nearly impossible for us to know all of the correct answers to store for each sentence fragment.

The problem we faced was that a student could write as many words as he or she pleased to make the sentence fragment into a complete sentence. There was no way for us to predict what kinds of responses a student was going to write. We needed to find a better way.

The start of a better solution:

Through research we found that the best way for us to solve our problem was to use NLP. We needed to train a model that could correctly predict whether or not a sentence was a fragment. Peter Gault, the CEO of Quill, and Donald found a research paper called “Automatic Detection of Sentence Fragments”, written by Chak Yan Yeung and John Lee, and we decided to try to replicate the work done in the paper. I took over the project from there and gathered and put together a dataset of over 100,000 sentences composed of complete sentences and sentence fragments. I gathered 60,000 sentences from Wikipedia and turned each sentence into a fragment by removing a different POS from each sentence. By removing a certain POS from each sentence, I could generate a large number of sentence fragments that would be used in the NLP model.

Data Cleaning:

A big chunk of working on an NLP project is having enough data to test. Most of my contribution to Quill involved automatically extracting and cleaning over 60,000 sentences gathered using Wikipedia’s API.

Wikipedia Sentence Gathering:

In order to get started on this project I needed to put together a dataset of complete sentences. Wikipedia has a list of well written featured articles that are reviewed by editors and serve as examples of the kind of content authors should write on Wikipedia. I used the MediaWiki API and JavaScript to automatically scrape and parse content from hundreds of featured articles.

The MediaWiki API allowed me to pull all of the content from any Wikipedia article and dump it into one string. I essentially had hundreds of sentences within one string, but I needed each sentence to be its own string. Initially this problem seemed simple. I could identify the end of a complete sentence if it was followed by a period, space, and capital letter marking the start of the next sentence. Unfortunately, when you have a string with hundreds of sentences you cannot just rely on simple characters to identify sentences. The content I pulled from Wikipedia had honorifics (Mr., Mrs.), different kinds of name suffixes (Jr., M.B.A.), decimal numbers, time periods (a.m., p.m.), acronyms, abbreviations, non-alphanumeric characters, missing spaces between sentences, and many other edge cases. To solve my problem, I quickly became a regex ninja.

I used the power of regexes to identify what characters and circumstances truly marked the end of a sentence. I was able to parse through more than a hundred Wikipedia articles and separate over 100,000 sentences. In the process of cleaning through data I also removed sentences that were too long or contained non-alphanumeric characters, quotations, and excessive punctuation marks because the dataset needed to be representative of the content students would write.

NLP Data Processing:

Once I had clean data I closely followed the research paper mentioned earlier. Out of the final 60,000 sentences I gathered, I created four subsets with 15,000 sentences each. Each subset was meant for a specific POS removal for nouns, verbs, nouns and verbs, and subordinate conjunctions.

I used spacy.io to parse through POS in sentences. Spacy is an open-source library in Python that is widely used for quick natural language processing. For this project I only used two Spacy functionalities:

  • POS tagger: Gives the POS of every word in a sentence.
  • Word dependency tagger: Gives the dependency between words in a sentence.

For each dataset mentioned earlier, I focused on a specific POS removal. I will explain each removal in detail below:

Noun removal:

I removed the first noun or group of consecutive nouns from a sentence, while keeping the POS tags of the remaining sentence the same. For example, suppose the sentence is “Julio, Narcisa, and Catherine went to the beach.” The updated sentence would be “And Catherine went to the beach.”

My algorithm identified the first noun or group of nouns by looking at the POS and dependency for each word. As soon as the POS tag was a noun, pronoun, proper noun, or number or the word dependency tag was a possessive noun, then the algorithm greedily grabbed words until it reached a word that was not a noun. In the example above the algorithm only identified Julio and Narcisa as nouns because the word “and” in the sentence was not a noun. Consider another example, “I went to visit pandas.” The algorithm would only identify “I” as a noun because the word “went“ is not a noun.

Verb removal:

I removed the first verb or group of consecutive verbs from a sentence. For example, in the sentence “He is tall.” the verb identified would be “is”. The updated sentence would be “He tall.”

Similar to the noun removal process described earlier, my algorithm identified the first verb or group of consecutive verbs by looking at the POS tags of a sentence. If a word is labeled as “verb”, the algorithm grabs the word and continues to grab the next word if it is a verb. For example, in the sentence “She is running up the hill.”, the words identified as verbs are “is running” and the updated sentence would be “She up the hill.”.

Noun and verb removal:

I removed both consecutive nouns and verbs using the methods described above.

Subordinate conjunction removal:

The process of removing subordinate conjunctions was different from the processes mentioned earlier. In order to create this dataset I needed to make sure that all of the 15,000 sentences actually contained subordinate conjunctions. Sentences that have subordinate conjunctions contain a word or words that connects or joins words, phrases, clauses, or sentences. I made a list of the most common conjunctions and filtered out sentences that did not contain those conjunctions.

Once I had a clean dataset composed of sentences with subordinate conjunctions, I wrote a function to remove parts of these sentences. If a subordinate conjunction was at the beginning of a sentence, then I removed that word and all of the words following it until a comma was reached. On the other hand, if a subordinate conjunction was in the middle of a sentence, then I removed that word and all of the words following it till the end of the sentence.

Final dataset:

My final dataset contained detailed information for complete sentences and sentence fragments. Each sentence fragment had additional information including the corresponding complete sentence, word or words removed, updated sentence, POS for the original sentence, POS for the removed words, and POS for the updated sentence. This data was then used by the team to train a model for detecting fragments and sentences.

Quick reflection on Spacy:

Spacy was very quick and identified the POS I needed, but it had several problems that required custom solutions. The one main problem was that Spacy automatically separates words that are connected with hyphens and then tags each word differently. This was a problem because most of the words with hyphens were nouns, but Spacy would break up the noun into different words that were no longer classified as nouns. Another problem was that some of the POS tags were incorrect and I needed to identify what kinds of words were classified incorrectly.

Overall, Spacy was very quick and despite the challenges I faced, it still helped me get my job done.

Contribution to Quill:

The first step towards solving a data science problem is gathering enough data to test and build models on. I was the first person at Quill to gather and put together a training set for NLP with labeled data for complete sentences and sentence fragments. The entire process required me to deal with very messy data, pick up new technologies quickly, and think on my feet. From start to finish the project was challenging, but very rewarding. They say that more than 50% of data scientists work is just cleaning data and I’m glad I was able to contribute that to kick start Quill’s NLP work.

--

--

Catherine Alvarado
Open Source Quill

Senior Backend Software Engineer @Slack. Formerly @Qventus. Wesleyan CS.