Grammatical Error Correction using Deep Learning

10 min readNov 30, 2021

Ref: 1. You correct other people’s grammar.

Table of content

Introduction
Business problem Formulation
DL problem formulation
Data Collection
Data Cleaning & Exploratory Data Analysis
Data preparation for model
Model Training
Post Training Analysis
Deployment
Future work
References

1. Introduction

We have always found it quirky when Ross Geller from F.R.I.E.N.D.S. corrected grammar errors of people surrounding him. Although it was intimidating for his friends at times, but this is definitely very useful in day-to-day tasks be it writing a mail, an article or professional chats. In fact, while writing this article, I have got 2 to 3 browser add-ons for grammar correction.

Natural language processing is a subfield of Deep learning concerned with the interactions between computers and human language, i.e., making computers understand human languages similar to programming languages like python, C, Java, etc. Typical NLP problems are like machine translation, Named Entity recognition, Part of Speech tagging.

In this article, we explore how grammatical correction can be achieved using neural networks. Grammatical error correction is a problem similar to machine translation where given an input ‘x’ of length ‘i’ the output ‘y’ is a sentence of length ‘j’.

2. Business Problem Formulation

2.1 Problem Statement

This is a problem of grammatically detecting grammar errors in the input sequence and providing the correct sentence as output.

Incorrect and Correct Sentences Overview

2.2 Business Applications

This can be used as a browser add-on which will detect any possible text input from any website, e.g., Email, article writing websites, etc.

It can be incorporated with various document creating applications such as Microsoft Word, Google Docs, etc.

3. DL problem formulation

In NLP problem statement context, Grammatical error correction is a sequence to sequence problem. Here, the input will be an erroneous sentence and the output will be a correct sentence. Such problems are solved using encoder decoder model architecture.

In layman terms, encode takes an input sentence and represents it in some vector which consists of number. The decoder takes this encoded vector as input and attempts to provide the output sentence.

We are trying to minimize Categorical Cross-Entropy loss.

We will measure the performance of the model by GLEU (Generalized Language Understanding Evaluation) score. It is a variant of BLEU proposed for evaluating grammatical error corrections using n-gram overlap with a set of reference sentences which is calculated as follows,

1. All sub-sequences of 1, 2, 3 or 4 tokens in output and target sequence (n-grams) are recorded.

2. Then recall is calculated as number of matching n-grams
to the number of total n-grams in the target (ground truth) sequence.

3. Then precision is computed as a ratio of the number of matching n-grams
to the number of total n-grams in the generated output sequence.

4. GLEU score is minimum of recall and precision. Its range is always between 0 (no matches) and 1 (all match).

4. Data Collection

We are using lang-8 dataset, which can be procured from NAIST Lang-8 Learner Corpora. To download, it has to be collected from Download Form of the NAIST Lang-8 Learner Corpora.

The data is in “m2” file format. First line(Starting with S) is the incorrect sentence and second line (Starting with A) are annotated sentence or a correct sentence.

Above data has to be converted into pandas dataframe format for subsequent processing. The following code snippet converts m2 file to pandas dataframe. The code is referred from here.

5. Data Cleaning & Exploratory Data Analysis

5.1 Data Cleaning

In this step, all missing and duplicate rows were removed. We also removed all rows where input and output sequences were the same. Additionally, sentences having word count less than 2 were removed.

After reducing size of dataset, preprocessing steps such as de-contraction, removing special characters were performed.

5.2 Univariate Analysis

5.2.1 Character count of correct sentences

The following plots show KDE and percentile distribution of length of correct sentences.

5.2.2 Character count of incorrect sentences

The following plots show KDE and percentile distribution of length of incorrect sentences.

5.2.3 Word count of correct sentences

The following plots show KDE and percentile distribution of word count for correct sentences.

5.2.4 Word count of correct sentences

The following plots show KDE and percentile distribution of word count for incorrect sentences.

Observations

The character distribution for correct and incorrect sentences shows mostly character length is between 50 and 80.
The word count distribution tells most sentences are having 12 to 37 words.
The distributions are heavily skewed, which indicates there are extreme outliers present.

5.3 Bivariate Analysis

5.3.1 Correct sentence character count vs Incorrect sentence character count

5.3.2 Correct sentence word count vs Incorrect sentence word count

Observations

The character lengths and word counts for correct and incorrect sentences shows a very linear correlation with each other, which tells distributions for correct and incorrect sentences are closely similar to each other.
Prima facie, we can state that there are not heavy edits made while correcting the sentence.

5.4 Stop word Analysis

Stop words are the most commonly used words in any language such as “the”,” a”,” an” etc. As these words are probably small, these words may have caused the above graph to be skewed.

Observations

Around 46% of stop words are to, the, is, a, and, in, of, my, not and it where the distribution lies from 2.57% to 7.96% in both correct and incorrect sentences.

5.5 Word Cloud

A word cloud represents an input corpus such that the size and color of each word indicate its frequency or importance.

Observations

We can observe words like But, Also, elderly, One, need, Japan, think are frequently used words in Correct & Incorrect Sentence

5.7 Sentiment analysis

Sentiment analysis is used for determining if the text is positive, negative, or neutral. We have used Textblob library for calculating the polarity scores.

5.8 Text complexity using Flesch Reading Ease (FRE)

FRE score represents readability of the text. The score is interpreted as follows.

Observations

Around 67 % of the text is categorized as suitable for above 5th grade or FRE score above

6. Data preparation for model

6.1 Start and End token to decoder input and output

As we will be using encoder decoder architecture for modelling. For decoder, the data has to be prepared as follows.

The ground truth will be sent as input to the decoder with ‘<start>’ token added at beginning, and the output will be ground truth with ‘<end>’ token added at the end. This is called at teacher forcing technique where we are giving ground truth as input to decoder along with states learned by encoder.

6.2 Tokenizing and padding

The raw text data can’t be directly sent to the model. The model will accept the numerical representation of text. We have to first tokenize the data so that each unique word in the corpus gets an unique representation. This will be performed using TensorFlow tokenizer API.

Here, encoder tokenizer is learning from incorrect sentence corpus and decoder tokenizer is learning from correct sentences. There are 51949 words represented by encoder tokenizer and 43124 in decoder tokenizer.

7. Model Training

7.1 Encoder Decoder

The model architecture consists of LSTM, GRU, RNN. The encoder takes incorrect sentence as input and represents it in form of hidden and cell state. While initiating the decoder model which also consists of LSTM, GRU or RNN cells, the encoder representation is fed along with ground truth. Given ‘<start>’ token, it attempts to predict the next word in the sentence. It iteratively tries to minimize the loss and updating weights.

GLEU Score-0.226

7.2 Attention mechanism

Attention mechanism is incorporated with neural networks to replicate human behavior of grasping and transforming information. While correcting any sentence grammatically, we will focus on a specific part of the sentence which needs “Attention” while correcting it. E.g., while correcting sentence “I had a English class” we will focus on “a” as in the context of word “English” It should be “an”.

Ref- https://blog.floydhub.com/attention-mechanism/

The encoder part remains the same as in encoder decoder architecture. In the decoder, for every time step we compute a context vector which holds relevant information from the encoder such that more weight will be given to the states of words to which more attention should be given. It is calculated as follows,

First, we compute all encoder hidden states(ht) and all decoder hidden states till previous time step(hs), and we calculate attention scores. Following formula is basically a softmax formula which makes sure the attention values ranges from 0 to 1 and has probabilistic representation. For this case study we have implemneted “Dot” scoring function.

There are following ways for calculating the score provided by Luong.

2. Once we compute the attention scores we calculate context vector by multiplting encoder hidden states with attention scores.

3. For decoder input at time step ‘t’ , the context vector and output state from at previous time step ‘t-1’ is concatenated.

4. This process is iteratively repeated till it reached <end> token.

GLEU Score-0.412

8. Post Training Analysis

Now we will analyze the best model’s predictions and correspond GLEU score.

8.1 Box Plot and percentile values of GLUE Score

Observations

Min & Max GLEU score is 0 & 1
Median GLEU score is 0.346 whereas IQR ranges from 0.20 to 0.529

Now we will divide the data into 3 sub datasets: low, medium and high score predictions. All the sentences having GLEU score less than 0.3 were considered in low dataset. The scores between 0.3 and 0.6 is in medium dataset. All scores more than 0.6 are considered as high score.

8.2 Count plot of word count

As the word count in a sentence increase, the metric drops. We can observe maximum sentences in low score dataset has word count of 11 or 12. For medium score dataset, Maximum files are around 9. In case of high score dataset, most files belong word count of 8 or 9.