Quill is Open Sourcing NLP Tools To Aid Student Writing
Imagine a student writing, “My favorite writer is J.K. Rowling, she is very descriptive.”
In the example above, a student has written a sentence fragment. In the United States, 75% of students struggle with writing, and these students often struggle with fragments and run-on sentences. Here’s how one teacher describes his students’ writing,
“[Students] produced massive run-on sentences connected by a long series of and’s, while still others scattered sentence fragments throughout their compositions.”
Computer programs often cannot detect or remediate these errors. When you look at the state of the art in sentence fragment detection today, you see abysmal results. In our test of 100 sentence fragments, Programs like Microsoft Word and Google Docs were able to correctly detect, at best, 30% of the fragments. Students rely on these tools to improve their writing, and they need better solutions.
Quill.org’s free online tools help students become strong writers, and to date, the organization has helped over 300,000 students improve their skills. For a couple of years, however, Quill did not detect sentence fragments, and one of the most common pieces of feedback we heard from teachers is that their students need sentence fragment remediation.
Quill is now building a machine learning powered fragment detection algorithm to detect and remediate fragment errors, and today, we have open sourced our fragment detection algorithm. In our tests, our algorithm can accurately detect fragments 84% of the time, and we aim to launch the algorithm by September 1st, 2017 .
The Quill team does not have a machine learning expert on staff. Instead, our developer team was able to solve this problem by using online courses and open source tools. Specifically, we used Udacity’s machine learning course, and we highly recommend to other software engineers who are interested in machine learning. For Quill, we were able to successfully implement a machine learning solution because we had a discrete problem with two potential outcomes — either a text string is a sentence or a fragment. For larger problems you may need an expert, but if you have the right problem, you can solve it yourself.
Here’s how we were able to solve our detection problem. To start, Quill turned to the machine learning research. Researchers at the City University of Hong Kong used a combination of Natural Language Processing and Machine Learning to detect sentence fragments. Machine learning requires a large data set, and these researchers were using 100,000 sentences pulled from Reuters newspaper articles. The data set and code were closed sourced, however, and required an expensive license to access the data. Quill set out to build a free, open source fragment detection tool that any organization could use to help people understand and remediate fragment errors, and we needed an open source data set.
Here’s how our implementation works:
- Compile Wikipedia dataset: Wikipedia’s featured articles have been reviewed by multiple editors and locked to ensure accuracy. We pulled 100,000 highly-edited sentences from these articles.
- Build training data: To detect whether a sentence is a fragment or a complete sentence, you need a large data set that contains both types of sentences. We took the 100,000 sentences, and turned half of them into fragments by using Spacy.io, a natural language processing tool. Spacy.io enabled us to turn each sentence into a parts of speech string (“I run” becomes “I: Subject, Run: Verb”). We then removed certain parts of speech from the sentences, such as the subject, verb, or dependent clause. We now had 50,000 fragments.
- Make a numerical representation of the data: We now need to label the data. The classifier needs two sets of data, inputs and labels. The labels are a binary value of either 1 or 0, where 1 indicates the inputs represent a sentence, and 0 represents a fragment. For inputs we convert the fragments and sentences in lists of their parts of speech and create trigrams of the parts of speech. A trigram is a group of three parts of speech. For example, “subject verb noun” is one trigram. For efficacy, we only used the 1,200 most common trigrams. For each sentence and fragment we create a 1,200 value array that represents the trigrams it contains.
- Feed the data into the classifier: We reserve 10% of the data for testing, and use the other 90% of the data rest for building the model. We then fed the data into Tensorflow. We have 1,200 input nodes followed by a 200 node hidden layer and a second hidden layer of 25 nodes. We then output to 2 nodes, one for each each outcome, sentence or fragment.
- Analyze the fragments: Our team manually created a set of 100 fragments and sentences, and we ran these sentences through the algorithm. While we know whether the sentence was a complete sentence or a fragment, the algorithm does not, and it makes a judgement of whether it is a complete sentence or a fragment. By comparing the algorithm’s judgements to our data set, we found that the algorithm accurately detected fragments 84% of the time, and we aim to increase the accuracy by working with the open source community. One area we can improve for example, is detecting short sentences. We found that short sentences were hard to detect as they were under-represented in the data set.
As a nonprofit, open source organization building free writing tools, we invite everyone to join us. You can join us on our slack channel or follow our Medium for updates. We will continue to share what we are learning and welcome your thoughts.
Ready to dive into the code? You can take a further look at the code here. The Readme file is a complete walkthrough of the code need to prepare and model the data. You can run and modify the code by cloning the repo and running Jupyter Notebooks from the directory. You’ll need to have Python 3.5 installed. Feel free to open a pull request or issue for any feedback you have.