Multi-headed model based on BERT to solve Grammatical Error Correction tasks more efficiently

Julia Shaptala
Beyond WebSpellChecker
6 min readSep 27, 2019

Development of NLP-related solutions is gaining momentum.

Thanks to the squad of pre-trained models based on the Transformer, BERT and GPT architecture, NLP-related tasks, Grammatical Error Correction (GEC) tasks, in particular, can be completed super efficiently.

In this story I’d like to tell about WebSpellChecker approach to GEC. We developed it when participating in the Shared Task: Grammatical Error Correction within the framework of Building Educational Applications (BEA) 2019 Workshop. I attended this workshop which took place in Florence, Italy on the 2nd of August, 2019 as a WebSpellChecker CEO and a paper co-author. On that day I wanted to present our model and results of participating in BEA in a poster session to numerous conference participants from all over the world.

Before coming to the conference, our Deep Learning Engineer, BDidenko, carried out an extensive research and developed a model. So, for BEA shared task, we created and published an accompanying paper. Below is the summary of our model along with its findings, and future milestones.

Background

The shared task was aimed at creating innovative approaches to automatic correction of all types of errors in written text.

Participants had access to datasets representing various levels and domains of the English language.

The end goal of the competition was to transform incorrect sentences given as an input into correct equivalents as an output.

To create a unique system for solving GEC tasks, we took several steps detailed below.

Data Preprocessing

Datasets in M2 format (a standard format used for annotated GEC tasks) included:

  • The First Certificate in English (FCE);
  • Lang-8 Corpus of Learner English (Lang-8);
  • The National University of Singapore Corpus of Learner English (NUCLE);
  • English Write & Improve (W&I) and LOCNESS corpus (W&I+LOCNESSv2.1).

Despite being renewed and improved, the datasets had several issues:

  • lots of irrelevant (noisy) data;
  • complex form of the info presented in M2 format.

To eliminate them, we started with pre-processing.

  1. First off, we adjusted the form of the info in datasets by combining related changes represented by several edit operations: U (Unnecessary); M (Missing); R (Replacement).
The example shows the process and results of adjusting the info in datasets.
Adjusting the data form

2. Then, we got rid of the noisy data. For this purpose, we used Textual Semantic Similarity.

In this approach, a source sequence and a sequence after corrections are compared. Then sentences with the similarity that is equal and above the pre-established ratio are accepted while the rest are discarded.

The chart shows pairwise semantic similarity of an original sentence and a corrected sentence on a scale of zero to one.
Credit: Advances in Semantic Textual Similarity — Google AI Blog

In tech terms, the similarity ratio is calculated using the scalar multiplication of vectors (vector size equals 512) generated by the Universal Sentence Encoder.

Having processed about 600K sentences from given datasets, we found out that the majority of sentences had a similarity ratio lower than 0.87 (see the below figure), which means they were inappropriate for further research and we considered them as noise.

The graph shows the results of filtering noisy data using Textual Semantic Similarity analysis.
Data filtering results

3. Finally, we flattened the data.

In order to extend the datasets after discarding the majority of sentences, we converted all the sentences with N edits to N sentences with one edit (see the below example).

Find more details on how we did it in the original paper.

The scheme shows the results of data flattening by extending the number of sentences.
Extending datasets

The Multi-headed Model for GEC Tasks

Our “multi-headed” architecture consists of BERT used as Encoder and specialized “Heads”, fully-connected networks that are responsible for text processing and correction based on certain error types.

Long story short, each module works as follows.

  1. BERT generates output that is taken by Heads as an input.

There are several Heads distinguished by error types:

Heads classification by error types consists of three Heads types: Type of operation, error, correction method.
Heads classification by error types

To learn more about Heads and their parameters, check the original paper.

2. Heads analyze each token from the BERT output, detect an error and then act according to two scenarios: they either correct an error or highlight its position for the Decoder to correct it.

The output of ByDictionary Heads is a suggestion from the dictionary while the output of ByDecoder Heads detects errors’ positions presented as a “Head type mask”.

ByDictionary Heads are classified by the type of operation — Replace Heads and Insert Heads.

ByDecoder Heads can also be of two types — Replace Heads and Range Heads (Range Start and Range End) that are responsible for defining the position of an error (start and end).

The whole process is illustrated below.

A schematic representation of the multi-headed model based on BERT suggesting grammatical error corrections.
The multi-headed architecture for GEC tasks

The “Highlight and Decode” Technique

And now a few words about the Highlight and Decode technique.

To avoid the reconstruction of the entire sentence in existing solutions, we developed a special “highlight and decode” technique.

In simple terms, one of the Heads suggests corrections in a particular place that comes to the Decoder as a highlighted BERT output.

A highlighted BERT output is a combo of special embeddings in error places detected by one of the ByDecoder Heads and zero vectors in other places.

In this way, the Decoder learns how to predict a suggestion only for the highlighted place.

Criteria for Selecting Final Suggestions

All sentences were corrected during each iteration, which finally made up the output and determined the probability distribution for each Head.

The Head with the highest “confidence rate” was selected among others and the changes it proposed were applied.

All the edits were saved in the history of previous changes and the process was looped until certain conditions were met:

  1. the probabilities of edits in all Heads are fewer than 0.5, which means all the errors are fixed;
  2. history length exceeds 10, which means the network tried to improve the original sentence more than ten times.

In addition, at each step, we calculated Textual Semantic Similarity between the original input and output sentences and if it was more than 0.87, the loop stopped and the latest sentence from the history was used.

Takeaways and Roadmap

Unlike other participants, we didn’t generate large volumes of synthetic data and use ensembles (networks with similar architectures delivering more accurate results).

Throughout the competition we faced several challenges:

  • Since Heads had different sizes and quality of dictionaries, their learning speed was different, which worsened the outcome;
  • All Heads worked independently, which led to imbalances in the overall model performance.

The output generated by the model allowed our team to gain the below results.

A table with test results where the F-score for Span-level correction is 39.75.
Research and test results summary

Understanding the limitations of our model, we’ve created a brand-new architecture that is to improve the accuracy of results.

A Brand-new Model for GEC Tasks

Here’s a visual presentation of the advanced model for GEC tasks.

An advanced model for GEC tasks using the Transformer Decoder to predict suggestions for errors.
A brand-new model for GEC tasks

The Transformer Decoder is trained to predict suggestions for errors whose positions are obtained from the attentions of the last layer of the Decoder.

The attentions are reshaped and processed by the fully-connected layer.

The fully-connected layer generates the probability distribution for the dictionary whose size equals the maximum length of the input sequence.

The advantages of the new approach:

  • Suggestions are predicted without reconstructing an original sequence; in this way, the network is specialized specifically to correct errors.
  • Suggestions and their positions are obtained as an output, which doesn’t require additional post-processing steps for proofreading systems.

In the redesigned architecture, we used BERT, however, we’re going to replace it with pretrained models of the next generation, XLNET and RoBERTa.

A new approach can improve the accuracy of the results at least twice, reduce the time needed for error corrections and suggestions, eliminate additional post-processing steps.

Check the results we got implementing a new approach below.

A table with test results for a new model where the F-score for Span-level correction is 58.78.
Research and test results summary

More articles on the brand-new model for GEC are to come. Follow WebSpellChecker on Medium and stay tuned.

--

--