spark-nlp

Sign in

Applying Context Aware Spell Checking in Spark NLP

Image credit: Pexels.

Introduction

Take for example the case of tweets, instant messaging, blog posts, OCR, or any other user generated text content. Being able to rely on correct data, without spelling problems reduces vocabulary sizes at different stages in the pipeline, and improves the performance of all the models in the pipeline.

More specifically, were going to be exploring Spark-NLP’s ContextSpellChecker annotator, a special class of Spell Checker that benefits from contextual information to both detect errors and produce the best corrections.

Spark NLP is a Natural Language Processing(NLP) library that allows you to run NLP algorithms and models in a distributed environment. Being one of the most widely used open source libraries for NLP, and the only one that allows to run in parallel in a distributed cluster of computers, Spark NLP provides state of the art models for many different tasks from Tokenization, Sentence Detection, and Name Entity Recognition to Contextual Embeddings like BERT or ELMO. You may be interested in looking at the complete list of models.

Spark NLP Pipelines & annotations

The DocumentAssembler, will take your input text, and will create a number of annotations inside a Spark Dataframe to represent your data. Annotations are just an alternative representation of your data in which that is useful for attaching metadata to it. A full list of members of the Annotation structure can be seen in Figure 2.

You’re going to have different flavors for these annotations at every stage of the pipeline. For example, annotations taken out of the DocumentAssembler will contain the text for the document, typically a paragraph, or a line, together with some indexes describing the beginning and end of the text.

Figure 1: A typical Spark NLP pipeline.

Following with our example, we have our next annotator, the RecursiveTokenizer that will split your document(s) into a set of tokens according to some rules that will depend on what you want to accomplish in the following annotators.

Next, we have our ContextSpellChecker that will take each of these tokens, detect if there’s any error, correct them and return clean tokens at the output.

Finally, our last annotator, the Finisher, will take the corrected text and will transform the set of annotations into a set of strings. This will enable other applications, for example, to display the text to users as reading through the annotation structure can be confusing.

Figure 2: The structure of an Annotation.

These structures can get more complex than this, according to different use cases, but at least we now count on a good understanding that will guide us through the rest of the topics. If you’re interested in knowing more, please check this great tutorial freely available on Youtube.

Spell Checking Task

  1. Detecting which words need correction.
  2. Proposing a correction or a list of candidate corrections for the word.

The first point is straightforward, you need to be accurate when detecting those words that need correction; this will allow you to both preserve correct words, and will also allow for faster correction times.

The second point can lead to some discussion. For Spell Checking in the context of digital writing tools like word processors or the keyboard in your cell phone, the task is defined as precision@k, meaning that we are satisfied with the model producing a suggestion list of k elements containing the right answer.

On the other hand, Spark-NLP is typically used on complete chunks of text, meaning that we don’t have a human in the loop while we produce corrections. This has both a good side and a bad one. On the good side, we can use all the text surrounding a word to produce a correction(before and after each word), and on the bad side we only have one shot to produce the right correction.

OK, too much talking so far, let’s get warm with an example,

“I will call my siter.”

“Due to bad weather, we had to move to a different siter.”

“We travelled to three siter in the summer.”

The appropriate corrections there, starting from the first sentence, would be {sister, site, and sites}. So how did we know that?: Contextual Information.

If you look carefully, all three corrections are at edit distance 1 from the right answer,

siter -> insert ‘s’ -> sister

siter -> remove ‘r’ -> site

siter -> replace ‘r’ by ‘s’ -> sites

And also the correction words {sister, site, and sites} have a very similar uni-gram probability. So traditional spell checking methods are going to be in trouble to make the right choice here.

Spell Checking in Spark-NLP

Going back to our ContextSpellChecker, we all have witnessed how Deep Learning changed the field of NLP in the last couple of years. Deep Learning is great, but it comes with a caveat, it’s difficult to customize your pretrained models to adjust for specific datasets without full retrains.

The approach that the ContextSpellChecker takes is to blend the best of two worlds; benefit from the great results of Deep Learning, and still allow some level of customization like in traditional methods. This allows for configurable, off-the-shelf, Contextual Spell Checking to be possible.

Let’s have a look!

Getting Started

As it’s usual, let’s start by building a pipeline; a spell correction pipeline. This pipeline is going to include an instance of the DocumentAssembler, RecursiveTokenizer, ContextSpellChecker and Finalizer.

We will use pretrained models from our library for each of these annotators.

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
from sparknlp_jsl.annotator import *
spark = sparknlp.start()documentAssembler = DocumentAssembler()\
.setInputCol(“text”)\
.setOutputCol(“document”)
tokenizer = RecursiveTokenizer()\
.setInputCols([“document”])\
.setOutputCol(“token”)\
.setPrefixes([“\””, “(“, “[“, “\n”])\
.setSuffixes([“.”, “,”, “?”, “)”,”!”, “‘s”])
spellModel = ContextSpellCheckerModel\
.pretrained()
.setInputCols(“token”)\
.setOutputCol(“checked”)
finisher = Finisher()\
.setInputCols(“checked”)
pipeline = Pipeline(stages = [
documentAssembler,
tokenizer,
spellModel,
finisher
])
# let's create an empty dataframe just to call fit()
empty_ds = spark.createDataFrame([[""]]).toDF("text")
lp = LightPipeline(pipeline.fit(empty_ds))

As usual, we built our pipeline as a sequence of annotators, and we fed the output of one annotator to the input of the next. The RecursiveTokenizer comes by default with a set of rules to handle English, however you can define your own rules, to handle other languages or other scenarios. The ContextSpellChecker is coming pretrained from the model repository.

Finally, notice the use of the LightPipeline, this is a convenience class that allows us to play with the models without having to deal with Spark Dataframes.

OK, so we built the pipeline, let’s see what we can do with it!

lp.annotate(“Plaese alliow me tao introdduce myhelf, I am a man of waelth und tiaste”)>>{‘checked’: [‘Please’, ‘allow’, ‘me’, ‘to’, ‘introduce’,                              ‘myself’, ‘,’, ‘I’, ‘am’, ‘a’, ‘man’, ‘of’, ‘wealth’, ‘and’, ‘taste’]}

Not bad, right?

How ContextSpellChecker works ?

  1. Different correction candidates for each word — word level.
  2. The surrounding text of each word, i.e. it’s context — sentence level.
  3. The relative cost of different correction candidates according to the edit operations at the character level it requires — subword level.

Let’s explore these different levels, and let’s see how we can customize the pretrained model in Spark-NLP to better fit our particular needs.

Word Level Corrections

  • a general vocabulary that is built from the training corpus during model training(and remains immutable during the life of the model), and
  • special classes for dealing with special types of words like numbers or dates. These are configurable, and you can modify them so they can adjust better to your data.

The general vocabulary is learned during training, and cannot be modified, however, the special classes on a pretrained model can be updated. This means you can modify how existing classes produce corrections, but not the number or type of the classes. Let’s see how we can accomplish this.

First, let’s go back to our pretrained model, and check which classes it has been trained with,

spellModel.getWordClasses()>>[‘(_AGE_,RegexParser)’,
‘(_LOC_,VocabParser)’,
‘(_DATE_,RegexParser)’,
‘(_NAME_,VocabParser)’,
‘(_NUM_,RegexParser)’]

We have five predefined classes,

  • AGE: age tokens like ‘21-year-old’.
  • LOC: tokens representing locations like a city, state, country, etc.
  • DATE: tokens representing dates like ‘Jan-03'.
  • NAME: tokens representing names and surnames.
  • NUM: tokens representing numbers, like 22 or twenty-two.

of two different types: some are vocabulary based and others are regex based, let’s see what this means,

  • Vocabulary based classes can propose correction candidates from the provided vocabulary, for example a dictionary of names.
  • Regex classes are defined by a regular expression, and they can be used to generate correction candidates for things like numbers. Internally, the Spell Checker will enumerate your regular expression and build a fast automaton, not only for recognizing the word(number in this example) as valid and preserve it, but also for generating a correction candidate. Thus the regex should be a finite regex(it must define a finite regular language).

Now let’s see this in action with an example. Suppose that you have a new friend from Poland whose name is Jowita, let’s see how the pretrained Spell Checker does with this name.

# Foreign name without errors
sample = 'We are going to meet Jowita in the city hall.'
lp.annotate(sample)
>>'checked': ['We', 'are', 'going', 'to', 'meet', 'Moita', 'in', 'the', 'city', 'hall', '.']}

Well, the result is not very good, that’s because this model has been trained mainly with American English texts. At least, the surrounding words are helping to obtain a correction that is a name. We can do better, let’s see how.

Updating a predefined word class

Vocabulary Classes

# add some more, in case we need them
spellModel.updateVocabClass('_NAME_', ['Monika', 'Agnieszka', 'Inga', 'Jowita', 'Melania'], True)
# Let's see what we get now
sample = 'We are going to meet Jowita at the city hall.'
lp.annotate(sample)
{'checked': ['We', 'are', 'going', 'to', 'meet', 'Jowita', 'in', 'the', 'city', 'hall', '.']}

We included Jowita, the name of our foreign friend. Much better, right?

Now suppose that we want to be able to not only preserve the word, but also to propose meaningful corrections to the name of our foreign friend. Let’s see what happens when the ContextSpellChecker receives ‘Jovita’ instead of Jowita.

# Foreign name with an error
sample = 'We are going to meet Jovita in the city hall.'
lp.annotate(sample)
>>{'checked': ['We', 'are', 'going', 'to', 'meet', 'Jowita', 'in', 'the', 'city', 'hall', '.']}

Here, we were able to add the new word, Jowita, to the name class and have the ContextSpellChecker propose corrections for it, so the right correction, Jowita was obtained. Furthermore, the new word has been treated as a name, that meaning that the model used information about the typical context for names in order to produce the best correction.

Regex Classes

# Date with custom format
sample = 'We are going to meet her in the city hall on february-3.'
lp.annotate(sample)
>>{'checked': ['We', 'are', 'going', 'to', 'meet', 'her', 'in', 'the', 'city', 'hall', 'on', 'February', '.']}# simple example with 3 months
spellModel.updateRegexClass('_DATE_', '(january|february|march)-[0-31]')
lp.annotate(sample)
>>{'checked': ['We', 'are', 'going', 'to', 'meet', 'her', 'in', 'the', 'city', 'hall', 'on', 'february-3', '.']}

Now our date wasn’t destroyed!

# now check that it produces good corrections to the date
sample = 'We are going to meet her in the city hall on febbruary-3.'
lp.annotate(sample)
>>{'checked': ['We', 'are', 'going', 'to', 'meet', 'her', 'in', 'the', 'city', 'hall', 'on', 'february-3', '.']}

And the model produces good corrections for the special regex class. Remember that each regex that you enter to the model must be finite. In all these examples the new definitions for our classes didn’t prevent the model to continue using the context to produce corrections. Let’s see why being able to use the context is important.

Sentence Level Corrections

# check for the different occurrences of the word "ueather"
example2 = ["During the summer we have the best ueather.",\
"I have a black ueather jacket, so nice.",\
"I introduce you to my sister, she is called ueather."]
lp.annotate(example2)
>>[{'checked': ['During', the', 'summer', 'we', 'have', 'the', 'best', 'weather', '.']},
{'checked': ['I', 'have', 'a', 'black', 'leather', 'jacket',
',', 'so', 'nice', '.']},
{'checked': ['I', 'introduce', 'you', 'to', 'my', 'sister', ',',
'she', 'is', 'called', 'Heather', '.']}]

Now we’ve learned how the context can help to pick the best possible correction, and why it is important to be able to leverage the context even when the other parts of the Spell Checker were updated.

Unfortunately, the only way to customize how the context is used in ContextSpellChecker is to train the language model by training a new model from scratch. If you want to be able to train your custom language model, please keep an eye on my next posts.

Advanced: Subword level corrections

  • Homophones are words that sound similar, but are written differently and have different meaning. Some examples, {there, their, they’re}, {see, sea}, {to, too, two}. You will typically see these errors in text obtained by Automatic Speech Recognition(ASR).
  • Characters can also be confused because of looking similar. So a 0(zero) can be confused with a O(capital o), or a 1(number one) with an l(lowercase l). These errors typically come from OCR.
  • Input device related, sometimes keyboards cause certain patterns to be more likely than others due to letter locations, for example in a QWERTY keyboard.
  • Last but not least, ortographic errors, related to the writter making mistakes. Forgetting a double consonant, or using it in the wrong place, interchanging letters(i.e., ‘becuase’ for ‘because’), and many others.

The goal is to continue using all the other features of the model and still be able to adapt the model to handle each of these cases in the best possible way. Let’s take the case of substitutions,

# sending or lending ?
sample = 'I will be 1ending him my car'
lp.annotate(sample)
>>{'checked': ['I', 'will', 'be', 'sending', 'him', 'my', 'car']}# let's make the replacement of an '1' for an 'l' cheaper
weights = {'1': {'l': .1}}
spellModel.setWeights(weights)
lp.annotate(sample)
>>{'checked': ['I', 'will', 'be', 'lending', 'him', 'my', 'car']}

What happenede here? by default, the cost of any of the edit operations {substitution, deletion, insertion, exchange} is 1, by providing custom weights for individual edit operations we can change the cost for a candidate word for which that edit operation is involved, in this case we’re making the replacement with an l in lending cheaper than with an s in sending, thus lending is preferred.

The weights dictionary that we passed to the model is part of edit matrix, which is a larger structure, and as discussed earlier, the most appropriate content for this matrix depends on your application.

Unfortunately, sometimes, assembling this matrix by hand could be challenging. This is something to be soon included like an option during training for the Context Spell Checker. Stay tuned on new releases!

Advanced: The Mysterious tradeoff parameter

  • The context information: by which the model wants to change words based on the surrounding words.
  • The word and subword information: by which the model wants to preserve as much an input word as possible to avoid destroying it.

Changing input words that are in the vocabulary for others that seem more suitable according to the context is one of the most challenging tasks in spell correction. This is because you run into the risk of destroying existing ‘good’ words.

The models that you will find in the Spark-NLP library have already been configured in a way that balance these two forces and produces good results in most of the situations. But your dataset can be different from the one used to train the model. So we encourage the user to play a bit with the hyper parameters, and for you to have an idea on how it can be modified, we’re going to see the following example,

sample = 'have you been two the falls?'
lp.annotate(sample)
>>{'checked': ['have', 'you', 'been', 'two', 'the', 'falls', '?', '.']}

Here ‘two’ is clearly wrong, probably a typo, and we would expect the model to be able to choose the right correction candidate according to the context. In order for the model to rely more on the context and less on word information, we have the setTradeoff() method.

You can think of the tradeoff as how much a single edition(insert, delete, etc) operation affects a candidate correction sequence whe compared to other sequences.
So the lower the tradeoff, the less we care about the edit operations in the word, and the more we care about the word fitting properly into its context. The range for this parameter will depend on how well the model was trained, but it’s still a reasonable rule is to expect it to be between 5 and 25.
Let’s play a bit with this parameter to relax the importance the model puts on individual words in our example,

spellModel.getTradeoff()>>10.0# let's decrease the influence of word-level errors
spellModel.setTradeoff(5.0)
empty_ds = spark.createDataFrame([[""]]).toDF("text")
lp = LightPipeline(pipeline.fit(empty_ds))
lp.annotate(sample)>>{'checked': ['have', 'you', 'been', 'to', 'the', 'falls', '?', '.']}

Now we can see that the right correction was produced.

Performance Issues

Basically the error detection stage of the model can decide whether a word needs a correction or not; with two reasons for a word to be considered as incorrect,

  • The word is OOV: the word is out of the vocabulary.
  • The context: the word doesn’t fit well within its neighbouring words.

The only parameter that we can control at this point is the second one, and we do so with the setErrorThreshold() method that contains a threshold score above which the word will be considered suspicious and a good candidate for being corrected.

The parameter that comes with the pretrained model has been set so you can get both a decent performance and accuracy. For reference, this is how the F-score, and time varies in a sample dataset for different values of the errorThreshold,

fscore | totaltime | threshold 
------------------------------
52.69 | 405s | 8f
52.43 | 357s | 10f
52.25 | 279s | 12f
52.14 | 234s | 14f

You can trade some minor points in accuracy for a nice speedup.

Conclusion

Finally, we explored how performance can be improved by compromising some minor points in accuracy.

Next Steps

Update: the new article about training a Spell Checker is out,
https://medium.com/@albertoandreotti/training-a-contextual-spell-checker-for-italian-language-66dda528e4bf

NLP expert.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store