Written Comm Analyzer — Scoring Grammar

Published in

Prod.IO

2 min readSep 27, 2018

We can’t emphasize enough on the importance of correct grammar in written communication. Hence, it was an unquestionable necessity in the suite. But as can be imagined, grammar check is one of the toughest problems in NLP, which is still a major focus of machine learning to build better and better models to check grammar. We have explored a few techniques in this regard.

Rule-Based

There are a few packages written in python that give rule-based corrections for grammar. One such popular package is the language tool. The tool gives out of the box methods to check a text passage for grammatical correction. The package is pretty straightforward. There are several common grammatical rules that evaluate a text and suggest corrections.

As can be observed, each of the corrections gets mapped to a certain ruleId like — TOT_THE, EN_A_VS_AN, etc. These provide more information on why the correction was suggested.

This package works pretty well for the most of the corrections, but we wanted to explore a deep learning solution for this as well.

2. Deep Learning based solution

We found open-source deep learning based models for grammar correction. One such model is the deep text corrector.

The basic idea of this project is to induce manual grammatical errors in a text corpus containing short dialogues in English. A sequence to sequence model is trained on this corpus with an expectation that this would be able to learn the errors and predict relevant corrections. The specification of the model is as follows:

Text Corpus — Cornell Movie Dialogues Corpus. This contains parsed short dialogues from movies. This was processed to induce manual grammatical errors in the dialogues.
Grammatical Errors — subtraction of articles (a, an, the), subtraction of the second part of a verb contraction (e.g. “‘ve”, “‘ll”, “‘s”, “‘m”), replacement of a few common homophones with one of their counterparts (e.g. replacing “their” with “there”, “then” with “than”)

We trained a sequence-to-sequence model using LSTM encoders and decoders with an attention mechanism and stochastic gradient descent. This link gives a brief overview of attention mechanism in sequence models. We achieved a decent performance on the test set, but there are some limitations.

The model works only when it encounters perturbations (manual errors) that fall in the 3 categories of errors induces in the corpus. It is unable to detect errors of other types.
The model is fit for error detection in limited length texts. This can be solved using a dataset containing bigger sentence lengths.

Our pipeline (consisting of open source packages and DL model) is reliable when used to correct commonly occurring grammatical errors. But there is a fairly long way to go to include less common errors, that websites like Grammarly can detect and correct. Nonetheless, our pipeline can be used for a basic screening of texts for grammatical errors, that can be seen as a representation of the grammatical quality of the entire text, which is what we need for a grammar score.

Written Comm Analyzer — Scoring Grammar

Written by Sajid Rahman