Training a Grammar Error Correction (GEC) Model from Scratch with Marian NMT and Using It with Huggingface’s Sequence to Sequence Transformer API
Objective
The project requirement was simple: develop a web service to correct sentences written in English, using Hugginface’s Transformer API. Why I have considered this project simple? Well, because if you have a trained model, the NLP part of it basically consists in a 6 line Python code, like the one below (adapted from here).
tokenizer = MarianTokenizer.from_pretrained('path/to/gec/model'))
model = MarianMTModel.from_pretrained('path/to/gec/model')
tokenized_text = tokenizer(text, return_tensors="pt")
tk_corrected_text = model.generate(**tokenized_text)
corrected_text = [tokenizer.decode(t, skip_special_tokens=True) for t in tk_corrected_text]print(corrected_text)
So the only thing left to finalize the project was finding a good pretrained GEC model to load into Huggingface’s MarianTokenizer and MarianMTModel Python classes. Needless to say, otherwise I wouldn’t be writing this article, that I couldn’t find one.
Shortly I will address what is Marian NMT and why to use a Machine Translation (MT) framework to build and train a GEC model, but before that it is worth mentioning the objective of this article and who might benefit from reading it.
While Marian NMT is a stable and remarkable piece of software it’s documentation, IMHO, is fare at best. If you are going to use a model developed and trained in this framework in a significantly different scenario than the ones they have in their examples page (my case) there are important gaps in the referred documentation. This situation is less of a problem at Huggingface’s end, but, nevertheless, there are a few. So my objective here is to fill in the mentioned gaps, while guiding the eventual reader in a step-by-step based procedure, starting from building a corpus to, at the very end, using the above 6 line Python code to correct sentences in English.
I think that anyone in need to train from scratch a Grammar Error Corrector for any Latin alphabet based language can benefit from the content of this article as anyone in need to train or fine tune a machine translation model and, of course, use any of these models with Huggingface’s sequence to sequence Transformer API.
The last thing worth mentioning is that this story is by no means a tutorial on how to train a model in Marian NMT and use it with Hugginface API
Prerequisites:
- Basic knowledge of Natural Language Processing
- Basic knowledge of Transformer and how to tune such a model
- Bash script
- Python
- Java
All code used to train this model can be checked in this repo: https://github.com/baosiek/gec
Grammar Error Correction as a Machine Translation Problem
The idea is simple. Machine translation consists in automatically converting a piece of text written in one language ( let’s say Portuguese) to another (the default, English).
For GEC the input text would be the one with possible grammar and/or spelling errors while the output would be the text with no mistakes.
The big difference is that in the former the languages are different while in the latter it is the same language. So if the input sentence has no mistakes, both input and output texts should be equal.
This is a somewhat minimalist explanation. If the reader wants a formal one and get to start understanding some of the features used for training the model, I recommend reading the following paper: “Grammatical error correction using neural machine translation”.
Marian NMT
I came to know Marian NMT while investigating how to use Huggingface API to perform grammar error correction.
Marian NMT is “an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs”. Nowadays it is “mainly being developed by the Microsoft Translator team”.
Among it’s many features, I point a few that addressed my specific needs:
- Single GPU training and batched translation on GPU and CPU;
- Transformer-based language models;
- Dynamically sized mini-batches for maximum memory usage; and
- Large mini-batches even on a single GPU via cumulative/delayed updates
Huggingface has a library with more than 1000 models of language pairs to be promptly used in any machine translation application by simply pointing the already mentioned Python classes to any of them. These models were converted into Huggingface’s binary format from the models trained by the Language Technology NLP Research at the University of Helsinki. The latter were trained using Marian NMT, being this the reason I chose to work with this framework.
Installing Marian NMT in Ubuntu or Pop!_OS 20.04
Installing Marian NMT is a straightforward process. For this project I installed it as guided here. Nevertheless the list with the steps I took follows, considering my Linux distribution is Pop!_OS 20.04 LTS, an Ubuntu based distribution, running with 1 GeForce RTX 3070:
- Install prerequisite packages from Ubuntu repositories. In my case these were already installed.
sudo apt install git cmake build-essential libboost-system-dev openssl libssl-dev libgoogle-perftools-dev
- As we will be using Sentence Piece the following packages are also necessary.
sudo apt install libprotobuf17 protobuf-compiler libprotobuf-dev
- Clone Marian NMT from git.
git clone https://github.com/marian-nmt/marian
- Create build directory and ‘change directory cd’ to it.
mkdir build
cd build
- Execute cmake as follows.
cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_SENTENCEPIECE=ON
- Execute make with 8 cores. If the reader has more or less cores adjust this numeber accordingly.
make -j 8
- This took an hour or so to conclude. To test if Marian was compiled with SentencePiece, execute from the build directory:
./marian --help |& grep sentencepiece
Which will display the following:
Preparing the Corpus
The corpus used for training, validation and testing consists of merging two datasets. The first one is Google’s research dataset cLang8 for grammatical error correction. The second dataset is OpenWebText Corpus which is “an open source effort to reproduce OpenAI’s WebText dataset” for which there is a distribution created by Aaron Gokaslan and Vanya Cohen from Brown University.
At the time of writing this story I have no public available storage resource where I can upload the processed corpus, making it available. If and when I have one, the link to it will be published in this space.
cLang8
After installing cLang8, the English set of this dataset consists of 2,372,119 pair of sentences, where the first component of each pair is the source sentence, i.e., the one with mistakes and the second component is the target, i.e., the corrected sentence. cLang8 website states the following: [All corrected sentences were] “generated by our state-of-the-art GEC method called gT5. The method is described in our ACL-IJCNLP 2021 paper.”
The process to download and prepare the corpus to be used is straightforward. The instructions can be found here. There are three things I had to do in order to use this dataset with Marian transformer architecture:
- In the mentioned instructions there is the following recommendation: “Running the above script takes about 1 hour when spaCy tokenization is enabled (recommended to make tokenization consistent with ConLL-14 and BEA eval sets)”. Do not tokenize at all, because down the line we will be using a different tokenizer to build Sentence Piece models. Check ‘run.sh’ making sure ‘tokenize_text’ parameter is set to ‘False’ like below:
python -m prepare_clang8_dataset \
--lang8_dir="${LANG8_DIR}" \
--tokenize_text='False' \
--languages='ru,de,en'
- cLang8 content was sourced from Lang 8, an online learning language resource, where learners post their texts and native speakers, in turn, correct them. Due to this, characters are not normalized. An example is the apostrophe. This character is oftentimes represented with UTF-16 0x2019 instead of UTF-16 0x0027, causing problems in the required text preprocessing to feed the trainer. To remedy that I developed this UnicodeNormalizer.java class that: first converts mapped UTF-16 characters to their respective ASCII ones; second, converts remaining non ASCII characters to white space; and third, erases all non printing characters except for the ‘tab’.
- I divided this dataset into three: one for training with 98% of the sentences and two additional ones, each with 1% of the remaining ones, for validation and testing. The reason why the test and validation sets are small is due to computing resource limitation.
OpenWebText Corpus
OpenWebText is a huge dataset (approx. 38GB) of 8,013,769 unique documents written in English with at least 128 tokens. One has to bare in mind that this dataset is not meant specifically to train a GEG model, meaning there are no pairs of documents like cLang8 has. To remedy that I will be using a process called back translation based on a model trained on cLang8. More on that follows.
Besides using the already mentioned Unicode normilizer, I also filtered all documents off of a string such as this one: “0000966-a1a03443bc32fdcb2a9df262fe12f1fe.txt 0000644 0000000 0000000 00000014666 00000000000 015376 0 ustar 0000000 0000000” with the following regular expression:
"[0-9]{7}-[0-9a-z]{32}.txt ([0-9]{7} ){2,3}([0-9]{11} ){2}[0-9]{6} [0-9]? ustar ([0-9]{7} ?){2}"
Additionally I also removed all documents containing HTML lines, mostly consisting of JavaScript code.
As with cLang8, I also divided these documents among three files, but as the number of sentences is huge here the percentage was 99.8% of the sentences reserved to training with the reamining 0.2% (about 40,000 sentences) going to validation and testing.
The Development Steps
While the project was divided into the five steps below, this story contemplates just the first one. Everything I develop is ready to implement steps 2 to 5.
- Train the first model with cLang8 and asses it’s performance;
- Train a back translation model;
- Decode OpenWebText with the back translation model;
- Merge cLang8 and OpenWebText decoded; and
- Train on merged corpus and asses the second model, comparing performance gains with the first one.
The Model
The transformer model used was basically Goggle’s model, described in “Attention Is All You Need” with one adjustment. In Google’s model, after each sub-layer, there is a normalization layer. In my case I added a dropout layer, trying to avoid the vanishing gradient.
Corpus Preprocessing
Huggingface’s MarianTokenizer and MarianMTModel expects a vocabulary containing all the words in both source sentences, in this case sentences with grammar or spelling mistakes, and target sentences, in this case the corrected sentences.
Additionally it expects two Sentence Piece models; one for source sentences and the other for target sentences.
The only way I managed to make trained models in Marian NMT work in Huggingface’s API was to train the sentence piece models outside of Marian (although using the Sentence Piece installed together with Marian); export respective vocabularies form trained models; merge them into one vocabulary; and train the grammar corrector model with the merged vocabulary (the same for both source and target sentences) in Marian NMT.
I developed a bash script named “preprocess_to_huggingface.sh” to perform the above preprocessing steps.
Training the Model
The command, extracted from “train_to_huggingface.sh”, I used to train the model follows where $L1 was the source sentences and $L2 the target sentences:
$MARIAN/build/marian \
--devices $GPUS \
--type transformer \
--model $MODEL/model.npz \
--train-sets $CORPUS/corpus.train.encoded.batch.$L1 $CORPUS/corpus.train.encoded.batch.$L2 \
--vocabs $MODEL/vocab.yml $MODEL/vocab.yml \
--dim-vocabs $VOCAB_SIZE $VOCAB_SIZE --max-length 100 \
--guided-alignment $CORPUS/corpus.aligned.$L1-$L2 \
--mini-batch 16 --valid-mini-batch 8 -w 4000 \
--transformer-postprocess-emb d \
--transformer-postprocess dan \
--transformer-dropout 0.1 --label-smoothing 0.1 \
--early-stopping $EARLY_STOP \
--tied-embeddings-all --sync-sgd \
--valid-freq 100000 --save-freq 100000 --disp-freq 10000 \
--cost-type ce-mean-words --valid-metrics ce-mean-words bleu-detok \
--valid-sets $CORPUS/corpus.valid.encoded.$L1 $CORPUS/corpus.valid.encoded.$L2 \
--log $MODEL/train.log --valid-log $MODEL/valid.log --tempdir $MODEL/model \
--overwrite --keep-best --shuffle data \
--seed 1111 --exponential-smoothing \
--enc-depth 6 --dec-depth 6 --transformer-heads 8 \
--normalize 0.6 --beam-size 6 --quiet-translation \
--lr-warmup 32000 --lr-decay-inv-sqrt 32000 \
--learn-rate 0.00003 --lr-report \
--optimizer-params 0.9 0.98 1e-09 --clip-norm 5
Training and validation is performed with all texts encoded with the respective SPM model.
The parameters:
--mini-batch 16 --valid-mini-batch 8 -w 4000
Were tuned to enable training in a GeForce 3070 with 8GB of memory.
The parameter:
--early-stopping $EARLY_STOP
Was set to 10. As I will discuss soon there are indication that this was too early (under fitting) and increasing this parameter will increase the number of epochs, which is one tool to reduce underfitting.
The parameter:
--learn-rate 0.00003
was low in my experience, but, together with dropout layer, was the only way I managed to avoid the vanishing gradient problem. Marian default learning rate is 0.0001.
The aligned corpus as in:
--guided-alignment $CORPUS/corpus.aligned.$L1-$L2
was obtain using the guide provided in Fast Align , which also comes with Marian installation. The code to perform the alignment follows and was extracted from “train_to_huggingface.sh”:
=> paste $CORPUS/corpus.train.encoded.batch.$L1 $CORPUS/corpus.train.encoded.batch.$L2 > $TEMP/align.$L1-$L2
=> sed -i 's/\t/ ||| /g' $TEMP/align.$L1-$L2
=> $TOOLS/fast_align/build/fast_align -vdo -i $TEMP/align.$L1-$L2 > $TEMP/forward.align.$L1-$L2
=> $TOOLS/fast_align/build/fast_align -vdor -i $TEMP/align.$L1-$L2 > $TEMP/reverse.align.$L1-$L2
=> $TOOLS/fast_align/build/atools -c grow-diag-final -i $TEMP/forward.align.$L1-$L2 -j $TEMP/reverse.align.$L1-$L2 > $CORPUS/corpus.aligned.$L1-$L2
The parameters:
--cost-type ce-mean-words; and
--valid-metrics ce-mean-words bleu-detok
mean that for each mini-batch the cost function was the mean of the per-word cross entropy (CE), and, for validation, in addition to CE the BLEU score, which is a metric popular in MT.
The chart above spans through 62 epochs, which is the number of times all sentences were fed to training the model. That took a little over three days and nineteen hours to finish in my machine. There are visual and statistical indications (linear regression on that last 20 epochs, considering the curve at hand) that the training should have continued (underfitting). Around epoch 25, training almost stopped because it reached 9 validations (each validation was performed every 100,000 sentences as per “valid-freq 100000” parameter) without improving validation CE. It seems that with the set learning rate the early-stop parameter should have had a higher value.
Performance
The cLang8 dataset is small to cope with the challenges of GEC. Thus an underfitted model was expected, but due to lack of training examples and not for early stopping.
One cannot compare BLEU scores easily. There are many parameters that influence this score, like normalization, tokenization and others. I could have provided Marian with the path to a validation script, but used it’s built-in script (at this moment it isn’t clear if it is the same as “multi-bleu-detok.perl”, the one I used. My validation script is “to_validate_bleu.sh”). The best score achieved during validation was 76.52 and on the test corpus was 77.70. The higher BLEU is, the better the model is performing, thus the testing score being higher than the validation one was unexpected for an underfitted model.
Despite the advances in NLP, GEC is still a challenging task for computers. Assessing GEC performance isn’t simple. An interesting discussion can be found here. Specifically there is the problem of overcorrection, whereby correct sentences shouldn’t be changed by the model. In other words, if a correct sentence is given to the model, the output should be the same sentence.
Practically all spelling mistakes were corrected. Nevertheless in a significant number of them the model changed the word, instead of correcting it. An example:
Incorrect: “Harry was transfered to New York where he aquired an apartment.”
Model: “Harry was transferred to New York where he established an apartment.”
The model corrected “transfered” to “transferred”, but changed “aquired” to “established” instead of “acquired”.
Some Subject-verb agreement mistakes were detected and corrected, but not all:
Incorrect: “You is supposed to write the document”
Model: “You are supposed to write the document.” (corrected)
But:
Incorrect: “Is you supposed to write the document?”
Model: “Is you supposed to write the document?” (not corrected)
Run-on sentences were detected and corrected most of the time, while punctuation in compound sentences were not:
Incorrect: “Mary is a clever girl, she began writing when she was three year old.”
Model: “Mary is a clever girl. She began writing when she was three years old.”
And:
Incorrect: “The pilot lost his map but he still got to his destiny.”
Model: “The pilot lost his map but he still got to his destiny.”
The corrected sentence should be “The pilot lost his map, but he still got to his destiny.”
Exporting Model to Huggingface
Huggingface provides a handy tool to convert Marian NMT trained models to be used in its APIs. The Python code is convert_marian_to_pytorch.py .
python convert_marian_to_pytorch.py --src marian_dir/ --dest hugg_dir
The above command converts the files in marian_dir (needs to be created) storing the conveted ones into hugg_dir. These directory names are just examples, use those that suit you most.
The following files need to be in marian_dir, otherwise expect an error:
source.spm
target.spm
vocab.yml
model.npz
decoder.yml
If you follow the scripts I provided all files will in the model directory. Consult the script transfer_to_huggingface.sh for further information. With the provided training model configuration, best models for both Cross Entropy and BLEU are saved. Chose one of them and rename respective model and decoder to the ones the converting Python script expects. Ex:
=> cp model/model.npz.best-bleu-detok.npz marian_dir/model.npz
=> cp model/model.npz.best-bleu-detok.npz.decoder.yml marian_dir/decoder.yml
Using the Model in Huggingface
The 6 line Python code at the begining of this story transforms into the 7 line Python code below:
tokenizer = MarianTokenizer.from_pretrained(hugg_dir))
model = MarianMTModel.from_pretrained(hugg_dir)
tokenized_text = tokenizer(text, return_tensors="pt")
tk_corrected_text = model.generate(**tokenized_text)
corrected_text_tokenized = [tokenizer.decode(t, skip_special_tokens=True) for t in tk_corrected_text]
detokenized = ''.join(corrected_text_tokenized[0]).replace('▁'', '\'').replace('▁', ' ').lstrip()print(detokenized)
The function “from_pretrained()” in both MarianTokenizer and MarianMTModel points to hugg_dir, the directory where the converted files above are located.
The extra line, in bold, is due to the use of Sentence Piece. The text to be corrected is the content of the variable “text” in the piece of code “tokenized_text = tokenizer(text, return_tensors=”pt”)” above.
Conclusion
The model stopped training when CE stalled, but clearly BLEU was improving. The performance of this model is tied to the performance of cLang gT5’s, the corrector used to correct the sentences.
To improve performance there are at least 2 possibilities as far as training corpus size goes:
- Back-translation, for which the scripts I developed are ready to execute. The developing steps 2–5, using OpenWebText, shall be followed, provided adequate computing power; or
- Google’s C4_200M Synthetic Dataset for Grammatical Error Correction which also requires substantial computing power.
Augmentation like the two proposed above is another tool to diminish underfitting. On one side I was pleased with Marian NMT as a framework for training a GEC model and using it with Huggingface’s API, while on the other the model clearly underfitted, and no one can be pleased with this, even though was expected with so few sentences provided for training.